Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

verilator smoke test (sha512/sha256/iccm_lock/sha_accel) failures #71

Closed
howardtr opened this issue Apr 25, 2023 · 13 comments
Closed

verilator smoke test (sha512/sha256/iccm_lock/sha_accel) failures #71

howardtr opened this issue Apr 25, 2023 · 13 comments
Assignees

Comments

@howardtr
Copy link
Collaborator

I ran into the following issue while running some smoke tests. I did a few git bisect to see where things broke.

----------------------------------------
Running SHA512 Smoke Test             !!
----------------------------------------
Hit max cycle count (20000000) .. stopping
- /home/runner/work/caliptra-rtl/caliptra-rtl/src/integration/tb/caliptra_top_tb_services.sv:495: Verilog $finish
testname commit
smoke_test_sha512, smoke_test_sha256, iccm_lock, smoke_test_sha_accel 51bf68d

Wondering if these have been fixed internally?

@calebofearth
Copy link
Collaborator

Hi @howardtr - I'm taking a look, thanks for your patience.
While I'm trying to reproduce, a tangential question: the linked verilator regression looks like it passed, you can only see the timeout by manually popping open the specific smoke_test run and checking out the logfile. Have you identified a more reliable way to identify failing runs?

@calebofearth
Copy link
Collaborator

I was unable to reproduce failures on smoke_test_sha512 with VCS or Verilator simulations in our internal repo. I can look into reproducing on your branch tomorrow.

@howardtr
Copy link
Collaborator Author

Hi @calebofearth

While I'm trying to reproduce, a tangential question: the linked verilator regression looks like it passed, you can only see the timeout by manually popping open the specific smoke_test run and checking out the logfile. Have you identified a more reliable way to identify failing runs?

The only way that I've seen this fail so far is that the test takes a really long time. From the summary page, you can see that the failed runs completed around ~2h40m when I expect them all to finish < 5m. I suspect changing $display to $error here might force this to fail properly since the $finish after the $display indicates a normal exit:

$display ("Hit max cycle count (%0d) .. stopping",cycleCnt);
dump_memory_contents(MEMTYPE_LMEM, 32'h8000_0110, 32'h8000_0180);
dump_memory_contents(MEMTYPE_DCCM, `RV_DCCM_SADR, `RV_DCCM_EADR);
dump_memory_contents(MEMTYPE_ICCM, `RV_ICCM_SADR, `RV_ICCM_EADR);
$finish;

I was unable to reproduce failures on smoke_test_sha512 with VCS or Verilator simulations in our internal repo. I can look into reproducing on your branch tomorrow.

FYI - I checked out 51bf68d from the dev-msft branch directly and was able to reproduce. The commit prior from #51 passes.

@calebofearth
Copy link
Collaborator

calebofearth commented Apr 28, 2023

I've run all 4 of these smoke tests against the caliptra-rtl repository (GitHub, not MSFT internal version) at commit 51bf68d3c2e3a1499f53d31b1b45f144726041aa and still see no failures or timeouts.
What are your steps to reproduce this locally?
FWIW I'm using Verilator 4.228 and GCC 8.2.0, and this is the RISCV GCC we have installed:
riscv64-unknown-elf-gcc-8.2.0-2019.02.0-x86_64-linux-centos6

@howardtr
Copy link
Collaborator Author

howardtr commented Apr 28, 2023

Tool Caleb Howard CI
Verilator 4.228 5.002 5.002
gcc 8.2.0 12.2.0 12.1.0
riscv64-unknown-elf-gcc 8.2.0 12.2.0 12.1.0

Steps to reproduce:

git clone git@github.com:chipsalliance/caliptra-rtl.git
cd caliptra-rtl
git checkout dev-msft
git checkout 51bf68d3c2e3a1499f53d31b1b45f144726041aa
make verilator CALIPTRA_ROOT=../.. TESTNAME=smoke_test_sha_512

I don't have a quick/easy way to install gcc 8.2.0 on CI / local machine, but I'm trying verilator 4.228 here

(Update: --timing, which was added in 5.xx is needed for QSPI)

I'll give a try to 10.2.0 next, but I think we should upgrade since 8.2.0 was released ~2019

@calebofearth
Copy link
Collaborator

I will try to reproduce with 5.002, but there's an issue with our installation. I'm hunting down a solution for that, then I'll update here with my results.

@calebofearth
Copy link
Collaborator

Quick Update: still waiting on a good tool installation for our server.

@calebofearth
Copy link
Collaborator

@howardtr I reproduced the hang in smoke_test_sha_accel using Verilator 5.002.
The issue appears to be a verilator bug.
This code is the source of the issue:

always_comb begin
    for (int i = 0; i < WDT_TIMEOUT_PERIOD_NUM_DWORDS; i++) begin
        timer1_timeout_period[i][31:0] = soc_ifc_reg_hwif_out.CPTRA_WDT_TIMER1_TIMEOUT_PERIOD[i].timer1_timeout_period.value;
        timer2_timeout_period[i][31:0] = soc_ifc_reg_hwif_out.CPTRA_WDT_TIMER2_TIMEOUT_PERIOD[i].timer2_timeout_period.value;
    end
end

I ran Verilator with waves and dumped the .vcd file. The RHS signal from this assignment initializes to 0, then acquires the reset value of FFFF_FFFF after the first clock cycle, which it holds for the remainder of the sim.
The LHS of the assignment timer1_timeout_period begins an endless loop of toggling between the value 0 and FFFF_FFFF on every clock edge (on the rising clock edge it goes to FFFF_FFFF, on the falling clock edge it goes back to 0).

image

The 0 value of the timeout period results in the timer calculating an expiration immediately, which fires an unexpected interrupt, which the uC is never able to clear.

This behavior does not occur for the same logic in VCS.
Looking at the revision history for Verilator is a bit concerning.
Version 5.002 has this note:

Introduce a new combinational logic optimizer (DFG), that can yield significant performance improvements on some designs. [Geza Lore, Shunyao CAD]

Versions 5.004+ have a very large number of bug fixes.
Are you able to run 5.010 to see if this issue persists?

@howardtr
Copy link
Collaborator Author

howardtr commented May 3, 2023

Hey @calebofearth: Thanks for narrowing it down!

tldr - I agree this looks like a verilator bug, but I'd like to propose a (temp?) verilog fix :)

Can we replace that always block with a generate/assign instead so that we can get CI back up and running?

for (genvar i = 0; i < WDT_TIMEOUT_PERIOD_NUM_DWORDS; i++) begin
  assign timer1_timeout_period[i] = soc_ifc_reg_hwif_out.CPTRA_WDT_TIMER1_TIMEOUT_PERIOD[i].timer1_timeout_period.value;
  assign timer2_timeout_period[i] = soc_ifc_reg_hwif_out.CPTRA_WDT_TIMER2_TIMEOUT_PERIOD[i].timer2_timeout_period.value;
 end

We can file a bug against verilator in the mean time and add a TODO to revert the code if the always_comb is preferred.

Thoughts?

More details:

The above changes work with 5.002, however there are more changes required to get it to work with > 5.002.

For my own reference, I had to suppress the lint warnings for the following in src/integration/tb/dasm.svi

%Warning-IMPLICITSTATIC: ../..//src/integration/tb/dasm.svi:393:17: Function/task's lifetime implicitly set to static
                                                                  : ... Suggest use 'function automatic' or 'function static'
  393 | function string dasm32_amo( input[31:0] opcode, input tid=0);

I tried both setting the -Wno-IMPLICITSTATIC + replacing with function automatic, but once I got past that I then ran into another error:

/usr/local/bin/verilator-v5.010/share/verilator/include/verilated_timing.h:417:68: error: ‘struct VlCoroutine::VlPromise’ has no member named ‘m_continuation’
  417 |     void await_suspend(std::coroutine_handle<> coro) { m_promisep->m_continuation = coro; }

To get past this, I then replaced -std=c++11 -> -std=c++17 in the Makefile. Seems like the coroutines associated with the --timing flag aren't happy with c++11. (At this point I thought about removing the --timing flag altogether, but it would require a rewrite of the spiflash.sv model. )

@calebofearth
Copy link
Collaborator

Interesting, thanks for running the experiment! The fix looks fine to me, reality is this will be permanent :) But it's functionally equivalent, I have no complaints.

We probably can't file this bug against Verilator until we've reproduced in the latest (5.010), but it would be handy to track the above issues with upgrading in an Issue against caliptra-rtl.

Last comment: The verilator CI workflow reports a "pass" for this test, even when it has timed out per visual inspection.
My recommendation: All of our Firmware-based smoke tests will print out the string "* TESTCASE PASSED" if the test works correctly, "* TESTCASE FAILED" if an explicit error is detected, and probably neither string in a timeout scenario. Can you update the Verilator Workflow to parse output for these strings to compute pass/fail?

@howardtr
Copy link
Collaborator Author

howardtr commented May 3, 2023

Ack - I've tested against 5.002, 5.004, 5.006, and 5.010, which all exhibit the same issue with the always_comb. So we can file a bug against verilator latest. I tried to reproduce a small test case w/o luck (see below).

In regards to the pass for a timeout where there is no string, I'll update the $display -> $error, which should provide a an non-zero error exit code to the workflow.

PASSED Condition:

* TESTCASE PASSED

Finished : minstret = 5452, mcycle = 32720
See "exec.log" for execution trace with register updates..

- /usr/local/google/home/howardtr/chips/dev-goog-ci/caliptra-rtl/tools/scripts/../../src/integration/tb/caliptra_top_tb_services.sv:524: Verilog $finish

VerilatorTB: End of sim
$ echo $?
0

FAILED Condition:

* TESTCASE FAILED
 -- Extending simulation for 100 clock cycles to capture ending waveform
[1322000] %Error: caliptra_top_tb_services.sv:515: Assertion failed in TOP.caliptra_top_tb.tb_services_i: ERROR! FW attempted to end the simulation with SUCCESS after previously requesting to end the sim with FAILURE!
%Error: src/integration/tb/caliptra_top_tb_services.sv:515: Verilog $stop
Aborting...
make: *** [Makefile:183: verilator] Aborted
$ echo $?
2

TIMEOUT Condition (forcing MAX_CYCLE to a small value)

[20000] %Error: caliptra_top_tb_services.sv:491: Assertion failed in TOP.caliptra_top_tb.tb_services_i: Hit max cycle count (2000) .. stopping
%Error: src/integration/tb/caliptra_top_tb_services.sv:491: Verilog $stop
Aborting...
make: *** [Makefile:183: verilator] Aborted
$ echo $?
2

More details:
I tried to create a small test to reproduce without any luck: I think this should be the bulk of the changes that affect this timeout, but I might be missing other subtle things happening in the background.

  // Parameters
  parameter WDT_TIMEOUT_PERIOD_NUM_DWORDS = 2;

  // Struct Definitions
  typedef struct packed{
    logic [31:0] value;
  } soc_ifc_reg__CPTRA_WDT_TIMER1_TIMEOUT_PERIOD__timer1_timeout_period__out_t;

  typedef struct packed{
    soc_ifc_reg__CPTRA_WDT_TIMER1_TIMEOUT_PERIOD__timer1_timeout_period__out_t timer1_timeout_period;
  } soc_ifc_reg__CPTRA_WDT_TIMER1_TIMEOUT_PERIOD__out_t;

   typedef struct packed{
      logic [31:0] value;
  } soc_ifc_reg__CPTRA_WDT_TIMER2_TIMEOUT_PERIOD__timer2_timeout_period__out_t;
 
  typedef struct packed{
      soc_ifc_reg__CPTRA_WDT_TIMER2_TIMEOUT_PERIOD__timer2_timeout_period__out_t timer2_timeout_period;
  } soc_ifc_reg__CPTRA_WDT_TIMER2_TIMEOUT_PERIOD__out_t;

 typedef struct packed{
    soc_ifc_reg__CPTRA_WDT_TIMER1_TIMEOUT_PERIOD__out_t [2-1:0]CPTRA_WDT_TIMER1_TIMEOUT_PERIOD; 
    soc_ifc_reg__CPTRA_WDT_TIMER2_TIMEOUT_PERIOD__out_t [2-1:0]CPTRA_WDT_TIMER2_TIMEOUT_PERIOD;
  }soc_ifc_reg__out_t;

  soc_ifc_reg__out_t soc_ifc_reg_hwif_out;

  // 
  logic [WDT_TIMEOUT_PERIOD_NUM_DWORDS-1:0][31:0] timer1_timeout_period;
  logic [WDT_TIMEOUT_PERIOD_NUM_DWORDS-1:0][31:0] timer2_timeout_period;
  logic [WDT_TIMEOUT_PERIOD_NUM_DWORDS-1:0][31:0] timer1_timeout_period_good;
  logic [WDT_TIMEOUT_PERIOD_NUM_DWORDS-1:0][31:0] timer2_timeout_period_good;

// Notworking
always_comb begin
  for (int i = 0; i < WDT_TIMEOUT_PERIOD_NUM_DWORDS; i++) begin
    timer1_timeout_period[i] = soc_ifc_reg_hwif_out.CPTRA_WDT_TIMER1_TIMEOUT_PERIOD[i].timer1_timeout_period.value;
    timer2_timeout_period[i] = soc_ifc_reg_hwif_out.CPTRA_WDT_TIMER2_TIMEOUT_PERIOD[i].timer2_timeout_period.value;
  end
end

// Working
for (genvar j = 0; j < WDT_TIMEOUT_PERIOD_NUM_DWORDS; j++) begin
  assign timer1_timeout_period_good[j] = soc_ifc_reg_hwif_out.CPTRA_WDT_TIMER1_TIMEOUT_PERIOD[j].timer1_timeout_period.value;
  assign timer2_timeout_period_good[j] = soc_ifc_reg_hwif_out.CPTRA_WDT_TIMER2_TIMEOUT_PERIOD[j].timer2_timeout_period.value;
end

  always @(posedge clk_tb) begin
    if(!reset_n_tb) begin
      soc_ifc_reg_hwif_out <= '0; // all zeros
    end else begin
      soc_ifc_reg_hwif_out <= '1; // all ones
    end
  end


Screenshot 2023-05-03 at 2 53 42 PM

@calebofearth
Copy link
Collaborator

Ah, great! Didn't realize you had gotten past all the friction to actually run 5.010 - that's very helpful. Filing to the Verilator folks is a good idea then :)

The Verilator bug has to be some perfect convergence of issues - several other locations use similar always_comb + for-loop syntax with no problem. I couldn't narrow it down further either. Maybe the other branch cases in the driver FF cause the confusion.

Changing $display to $error looks good to me!
It might also make sense to generate an output log, and fail the test if the TESTCASE PASSED string is not present at all - I'm not sure when we'd ever see a test end without timing out or printing TESTCASE FAILED, but it seems like a possible failure mode to see at some point (infinite loop somewhere prevents timestep advancing?). Requiring TESTCASE PASSED would be more of a stringent checker.

@howardtr
Copy link
Collaborator Author

howardtr commented May 4, 2023

Ack - will generate an output log and check the output. Closing this out for now since this is at least passing within dev-goog

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants