verilator smoke test (sha512/sha256/iccm_lock/sha_accel) failures #71

howardtr · 2023-04-25T23:59:10Z

I ran into the following issue while running some smoke tests. I did a few git bisect to see where things broke.

----------------------------------------
Running SHA512 Smoke Test             !!
----------------------------------------
Hit max cycle count (20000000) .. stopping
- /home/runner/work/caliptra-rtl/caliptra-rtl/src/integration/tb/caliptra_top_tb_services.sv:495: Verilog $finish

testname	commit
smoke_test_sha512, smoke_test_sha256, iccm_lock, smoke_test_sha_accel	`51bf68d`

Wondering if these have been fixed internally?

The text was updated successfully, but these errors were encountered:

calebofearth · 2023-04-28T00:22:44Z

Hi @howardtr - I'm taking a look, thanks for your patience.
While I'm trying to reproduce, a tangential question: the linked verilator regression looks like it passed, you can only see the timeout by manually popping open the specific smoke_test run and checking out the logfile. Have you identified a more reliable way to identify failing runs?

calebofearth · 2023-04-28T01:12:42Z

I was unable to reproduce failures on smoke_test_sha512 with VCS or Verilator simulations in our internal repo. I can look into reproducing on your branch tomorrow.

howardtr · 2023-04-28T14:01:47Z

Hi @calebofearth

While I'm trying to reproduce, a tangential question: the linked verilator regression looks like it passed, you can only see the timeout by manually popping open the specific smoke_test run and checking out the logfile. Have you identified a more reliable way to identify failing runs?

The only way that I've seen this fail so far is that the test takes a really long time. From the summary page, you can see that the failed runs completed around ~2h40m when I expect them all to finish < 5m. I suspect changing $display to $error here might force this to fail properly since the $finish after the $display indicates a normal exit:

caliptra-rtl/src/integration/tb/caliptra_top_tb_services.sv

Lines 491 to 495 in 3e0d8ee

    
           $display ("Hit max cycle count (%0d) .. stopping",cycleCnt); 
        
           dump_memory_contents(MEMTYPE_LMEM, 32'h8000_0110, 32'h8000_0180); 
        
           dump_memory_contents(MEMTYPE_DCCM, `RV_DCCM_SADR, `RV_DCCM_EADR); 
        
           dump_memory_contents(MEMTYPE_ICCM, `RV_ICCM_SADR, `RV_ICCM_EADR); 
        
           $finish;

I was unable to reproduce failures on smoke_test_sha512 with VCS or Verilator simulations in our internal repo. I can look into reproducing on your branch tomorrow.

FYI - I checked out 51bf68d from the dev-msft branch directly and was able to reproduce. The commit prior from #51 passes.

calebofearth · 2023-04-28T17:07:52Z

I've run all 4 of these smoke tests against the caliptra-rtl repository (GitHub, not MSFT internal version) at commit 51bf68d3c2e3a1499f53d31b1b45f144726041aa and still see no failures or timeouts.
What are your steps to reproduce this locally?
FWIW I'm using Verilator 4.228 and GCC 8.2.0, and this is the RISCV GCC we have installed:
riscv64-unknown-elf-gcc-8.2.0-2019.02.0-x86_64-linux-centos6

howardtr · 2023-04-28T19:07:34Z

Tool	Caleb	Howard	CI
Verilator	4.228	5.002	5.002
gcc	8.2.0	12.2.0	12.1.0
riscv64-unknown-elf-gcc	8.2.0	12.2.0	12.1.0

Steps to reproduce:

git clone git@github.com:chipsalliance/caliptra-rtl.git
cd caliptra-rtl
git checkout dev-msft
git checkout 51bf68d3c2e3a1499f53d31b1b45f144726041aa
make verilator CALIPTRA_ROOT=../.. TESTNAME=smoke_test_sha_512

I don't have a quick/easy way to install gcc 8.2.0 on CI / local machine, but I'm trying verilator 4.228 here

(Update: --timing, which was added in 5.xx is needed for QSPI)

I'll give a try to 10.2.0 next, but I think we should upgrade since 8.2.0 was released ~2019

calebofearth · 2023-04-28T20:38:02Z

I will try to reproduce with 5.002, but there's an issue with our installation. I'm hunting down a solution for that, then I'll update here with my results.

calebofearth · 2023-05-02T17:06:33Z

Quick Update: still waiting on a good tool installation for our server.

calebofearth · 2023-05-03T00:49:25Z

@howardtr I reproduced the hang in smoke_test_sha_accel using Verilator 5.002.
The issue appears to be a verilator bug.
This code is the source of the issue:

always_comb begin
    for (int i = 0; i < WDT_TIMEOUT_PERIOD_NUM_DWORDS; i++) begin
        timer1_timeout_period[i][31:0] = soc_ifc_reg_hwif_out.CPTRA_WDT_TIMER1_TIMEOUT_PERIOD[i].timer1_timeout_period.value;
        timer2_timeout_period[i][31:0] = soc_ifc_reg_hwif_out.CPTRA_WDT_TIMER2_TIMEOUT_PERIOD[i].timer2_timeout_period.value;
    end
end

I ran Verilator with waves and dumped the .vcd file. The RHS signal from this assignment initializes to 0, then acquires the reset value of FFFF_FFFF after the first clock cycle, which it holds for the remainder of the sim.
The LHS of the assignment timer1_timeout_period begins an endless loop of toggling between the value 0 and FFFF_FFFF on every clock edge (on the rising clock edge it goes to FFFF_FFFF, on the falling clock edge it goes back to 0).

The 0 value of the timeout period results in the timer calculating an expiration immediately, which fires an unexpected interrupt, which the uC is never able to clear.

This behavior does not occur for the same logic in VCS.
Looking at the revision history for Verilator is a bit concerning.
Version 5.002 has this note:

Introduce a new combinational logic optimizer (DFG), that can yield significant performance improvements on some designs. [Geza Lore, Shunyao CAD]

Versions 5.004+ have a very large number of bug fixes.
Are you able to run 5.010 to see if this issue persists?

howardtr · 2023-05-03T19:20:14Z

Hey @calebofearth: Thanks for narrowing it down!

tldr - I agree this looks like a verilator bug, but I'd like to propose a (temp?) verilog fix :)

Can we replace that always block with a generate/assign instead so that we can get CI back up and running?

for (genvar i = 0; i < WDT_TIMEOUT_PERIOD_NUM_DWORDS; i++) begin
  assign timer1_timeout_period[i] = soc_ifc_reg_hwif_out.CPTRA_WDT_TIMER1_TIMEOUT_PERIOD[i].timer1_timeout_period.value;
  assign timer2_timeout_period[i] = soc_ifc_reg_hwif_out.CPTRA_WDT_TIMER2_TIMEOUT_PERIOD[i].timer2_timeout_period.value;
 end

We can file a bug against verilator in the mean time and add a TODO to revert the code if the always_comb is preferred.

Thoughts?

More details:

The above changes work with 5.002, however there are more changes required to get it to work with > 5.002.

For my own reference, I had to suppress the lint warnings for the following in src/integration/tb/dasm.svi

%Warning-IMPLICITSTATIC: ../..//src/integration/tb/dasm.svi:393:17: Function/task's lifetime implicitly set to static
                                                                  : ... Suggest use 'function automatic' or 'function static'
  393 | function string dasm32_amo( input[31:0] opcode, input tid=0);

I tried both setting the -Wno-IMPLICITSTATIC + replacing with function automatic, but once I got past that I then ran into another error:

/usr/local/bin/verilator-v5.010/share/verilator/include/verilated_timing.h:417:68: error: ‘struct VlCoroutine::VlPromise’ has no member named ‘m_continuation’
  417 |     void await_suspend(std::coroutine_handle<> coro) { m_promisep->m_continuation = coro; }

To get past this, I then replaced -std=c++11 -> -std=c++17 in the Makefile. Seems like the coroutines associated with the --timing flag aren't happy with c++11. (At this point I thought about removing the --timing flag altogether, but it would require a rewrite of the spiflash.sv model. )

calebofearth · 2023-05-03T20:11:58Z

Interesting, thanks for running the experiment! The fix looks fine to me, reality is this will be permanent :) But it's functionally equivalent, I have no complaints.

We probably can't file this bug against Verilator until we've reproduced in the latest (5.010), but it would be handy to track the above issues with upgrading in an Issue against caliptra-rtl.

Last comment: The verilator CI workflow reports a "pass" for this test, even when it has timed out per visual inspection.
My recommendation: All of our Firmware-based smoke tests will print out the string "* TESTCASE PASSED" if the test works correctly, "* TESTCASE FAILED" if an explicit error is detected, and probably neither string in a timeout scenario. Can you update the Verilator Workflow to parse output for these strings to compute pass/fail?

howardtr · 2023-05-03T22:30:26Z

Ack - I've tested against 5.002, 5.004, 5.006, and 5.010, which all exhibit the same issue with the always_comb. So we can file a bug against verilator latest. I tried to reproduce a small test case w/o luck (see below).

In regards to the pass for a timeout where there is no string, I'll update the $display -> $error, which should provide a an non-zero error exit code to the workflow.

PASSED Condition:

* TESTCASE PASSED

Finished : minstret = 5452, mcycle = 32720
See "exec.log" for execution trace with register updates..

- /usr/local/google/home/howardtr/chips/dev-goog-ci/caliptra-rtl/tools/scripts/../../src/integration/tb/caliptra_top_tb_services.sv:524: Verilog $finish

VerilatorTB: End of sim
$ echo $?
0

FAILED Condition:

* TESTCASE FAILED
 -- Extending simulation for 100 clock cycles to capture ending waveform
[1322000] %Error: caliptra_top_tb_services.sv:515: Assertion failed in TOP.caliptra_top_tb.tb_services_i: ERROR! FW attempted to end the simulation with SUCCESS after previously requesting to end the sim with FAILURE!
%Error: src/integration/tb/caliptra_top_tb_services.sv:515: Verilog $stop
Aborting...
make: *** [Makefile:183: verilator] Aborted
$ echo $?
2

TIMEOUT Condition (forcing MAX_CYCLE to a small value)

[20000] %Error: caliptra_top_tb_services.sv:491: Assertion failed in TOP.caliptra_top_tb.tb_services_i: Hit max cycle count (2000) .. stopping
%Error: src/integration/tb/caliptra_top_tb_services.sv:491: Verilog $stop
Aborting...
make: *** [Makefile:183: verilator] Aborted
$ echo $?
2

More details:
I tried to create a small test to reproduce without any luck: I think this should be the bulk of the changes that affect this timeout, but I might be missing other subtle things happening in the background.

  // Parameters
  parameter WDT_TIMEOUT_PERIOD_NUM_DWORDS = 2;

  // Struct Definitions
  typedef struct packed{
    logic [31:0] value;
  } soc_ifc_reg__CPTRA_WDT_TIMER1_TIMEOUT_PERIOD__timer1_timeout_period__out_t;

  typedef struct packed{
    soc_ifc_reg__CPTRA_WDT_TIMER1_TIMEOUT_PERIOD__timer1_timeout_period__out_t timer1_timeout_period;
  } soc_ifc_reg__CPTRA_WDT_TIMER1_TIMEOUT_PERIOD__out_t;

   typedef struct packed{
      logic [31:0] value;
  } soc_ifc_reg__CPTRA_WDT_TIMER2_TIMEOUT_PERIOD__timer2_timeout_period__out_t;
 
  typedef struct packed{
      soc_ifc_reg__CPTRA_WDT_TIMER2_TIMEOUT_PERIOD__timer2_timeout_period__out_t timer2_timeout_period;
  } soc_ifc_reg__CPTRA_WDT_TIMER2_TIMEOUT_PERIOD__out_t;

 typedef struct packed{
    soc_ifc_reg__CPTRA_WDT_TIMER1_TIMEOUT_PERIOD__out_t [2-1:0]CPTRA_WDT_TIMER1_TIMEOUT_PERIOD; 
    soc_ifc_reg__CPTRA_WDT_TIMER2_TIMEOUT_PERIOD__out_t [2-1:0]CPTRA_WDT_TIMER2_TIMEOUT_PERIOD;
  }soc_ifc_reg__out_t;

  soc_ifc_reg__out_t soc_ifc_reg_hwif_out;

  // 
  logic [WDT_TIMEOUT_PERIOD_NUM_DWORDS-1:0][31:0] timer1_timeout_period;
  logic [WDT_TIMEOUT_PERIOD_NUM_DWORDS-1:0][31:0] timer2_timeout_period;
  logic [WDT_TIMEOUT_PERIOD_NUM_DWORDS-1:0][31:0] timer1_timeout_period_good;
  logic [WDT_TIMEOUT_PERIOD_NUM_DWORDS-1:0][31:0] timer2_timeout_period_good;

// Notworking
always_comb begin
  for (int i = 0; i < WDT_TIMEOUT_PERIOD_NUM_DWORDS; i++) begin
    timer1_timeout_period[i] = soc_ifc_reg_hwif_out.CPTRA_WDT_TIMER1_TIMEOUT_PERIOD[i].timer1_timeout_period.value;
    timer2_timeout_period[i] = soc_ifc_reg_hwif_out.CPTRA_WDT_TIMER2_TIMEOUT_PERIOD[i].timer2_timeout_period.value;
  end
end

// Working
for (genvar j = 0; j < WDT_TIMEOUT_PERIOD_NUM_DWORDS; j++) begin
  assign timer1_timeout_period_good[j] = soc_ifc_reg_hwif_out.CPTRA_WDT_TIMER1_TIMEOUT_PERIOD[j].timer1_timeout_period.value;
  assign timer2_timeout_period_good[j] = soc_ifc_reg_hwif_out.CPTRA_WDT_TIMER2_TIMEOUT_PERIOD[j].timer2_timeout_period.value;
end

  always @(posedge clk_tb) begin
    if(!reset_n_tb) begin
      soc_ifc_reg_hwif_out <= '0; // all zeros
    end else begin
      soc_ifc_reg_hwif_out <= '1; // all ones
    end
  end

calebofearth · 2023-05-03T22:57:02Z

Ah, great! Didn't realize you had gotten past all the friction to actually run 5.010 - that's very helpful. Filing to the Verilator folks is a good idea then :)

The Verilator bug has to be some perfect convergence of issues - several other locations use similar always_comb + for-loop syntax with no problem. I couldn't narrow it down further either. Maybe the other branch cases in the driver FF cause the confusion.

Changing $display to $error looks good to me!
It might also make sense to generate an output log, and fail the test if the TESTCASE PASSED string is not present at all - I'm not sure when we'd ever see a test end without timing out or printing TESTCASE FAILED, but it seems like a possible failure mode to see at some point (infinite loop somewhere prevents timestep advancing?). Requiring TESTCASE PASSED would be more of a stringent checker.

howardtr · 2023-05-04T16:46:55Z

Ack - will generate an output log and check the output. Closing this out for now since this is at least passing within dev-goog

howardtr assigned calebofearth Apr 25, 2023

This was referenced Apr 26, 2023

CI test matrix verilator cache fail on new branches #69

Closed

Updates to enable regression testing with dev-msft and dev-integrate #75

Merged

howardtr mentioned this issue May 3, 2023

Moving tool installation to step prior to matrix generation #70

Merged

howardtr mentioned this issue May 3, 2023

Update Verilator to latest revision #79

Closed

howardtr closed this as completed May 4, 2023

howardtr mentioned this issue May 4, 2023

Adding check for "TESTCASE PASSED" string at the end of CI test #81

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

verilator smoke test (sha512/sha256/iccm_lock/sha_accel) failures #71

verilator smoke test (sha512/sha256/iccm_lock/sha_accel) failures #71

howardtr commented Apr 25, 2023

calebofearth commented Apr 28, 2023

calebofearth commented Apr 28, 2023

howardtr commented Apr 28, 2023

calebofearth commented Apr 28, 2023 •

edited

Loading

howardtr commented Apr 28, 2023 •

edited

Loading

calebofearth commented Apr 28, 2023

calebofearth commented May 2, 2023

calebofearth commented May 3, 2023

howardtr commented May 3, 2023

calebofearth commented May 3, 2023

howardtr commented May 3, 2023

calebofearth commented May 3, 2023

howardtr commented May 4, 2023

verilator smoke test (sha512/sha256/iccm_lock/sha_accel) failures #71

verilator smoke test (sha512/sha256/iccm_lock/sha_accel) failures #71

Comments

howardtr commented Apr 25, 2023

calebofearth commented Apr 28, 2023

calebofearth commented Apr 28, 2023

howardtr commented Apr 28, 2023

calebofearth commented Apr 28, 2023 • edited Loading

howardtr commented Apr 28, 2023 • edited Loading

calebofearth commented Apr 28, 2023

calebofearth commented May 2, 2023

calebofearth commented May 3, 2023

howardtr commented May 3, 2023

calebofearth commented May 3, 2023

howardtr commented May 3, 2023

calebofearth commented May 3, 2023

howardtr commented May 4, 2023

calebofearth commented Apr 28, 2023 •

edited

Loading

howardtr commented Apr 28, 2023 •

edited

Loading