-
Notifications
You must be signed in to change notification settings - Fork 570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Current status of b4b on various machines for regression tests #1030
Comments
I am also going to run with the flag to initialize everything to zero and compare that to a run without that flag to see if that generates differences and perhaps could point us to issues as uninitialized variables are frequent causes of non b4b issues in WW3. |
Linking in this issue, in case anyone finds any sneaky uninitialised variables: |
For Red Hat 8 with Intel Icelake using the Intel compiler and MPI from OneAPI 2022.1.2, we have differences like:
Full diff for ww3_tp2.21/work_b_metis:
|
@benoitp-cmc Thank you so much for running this and sharing!! Your ts3 tests are intriguing and the tp2.21 test is confirming the suspicion something is going on with that test. I've occasionally seen some log diffs in tp2.14 as well but they're very very rare for me and it's been a while. Should have reports for noaa machines for the first round of tests later today. |
Here are my results from our Cray HPC, GNU Fortran compiler v4.9.1:
For All the tests with
|
Okay so for our computer hera, Intel compiler intel/2022.1.2 and impi/2022.1.2 *``` ********************* non-identical cases **************************** mww3_test_03/./work_PR3_UQ_MPI_e_c (1 files differ)
|
I've also had issues with For the |
I'll hopefully be able to report output for |
Whoops, I misunderstood. Jessica is handling the testing for NOAA machines. |
Results from our new HPC (Cray) dev system; GNU Forgran v12.1.0
All the tests with
Differences for Differences for Differences for Interestingly,
|
@ukmo-ccbunney thanks for these updates! We had just fixed all the unstructured grid mod_def issues I had thought, but it seems like we have another one. It's likely an un-used or un-initialized variable in the mod_def based on the last time I looked into that. So I ran a set of regtests with -init=zero and then compared against when I did not set this flag and the differences on hera with intel are:
which is my normal set plus some tp2.21 test cases which have differences in: ww3_tp2.21/./work_b_metis : ww3_tp2.21/./work_mb : ww3_tp2.21/./work_b : The differences in the hs netcdf files seem very small. |
@JessicaMeixner-NOAA A good test now would be to run with the When I did this, the |
Tests are running right now! Should have results for this in a few hours. Also running all of this on another machine and with --init=snan which does cause some crashes, so not sure if those will provide useful results or not. |
Well - that just lends more evidence that something somewhere is using an initialised variable!! The location of the crash might give a hint? We might be able to backtrack to the offending variable :) |
BTW ,w.r.t. to compile time flags for initialising variables, I just noted this in the GNU fortran manual:
which is saying that variables in derived types are not initialised with I thought it was worth mentioning in case you compiler has similar behaviour. |
@ukmo-ccbunney I was reading a fair amount about which compiler flag to use for this, and there's definitely some subtleties with it that I'm probably not fully appreciating. |
So - I have run ww3_tp2.21 with
This is good! We might be able to back track this NaN to the offending uninitialized variable! |
Ok.... I think the culprit is the VD (and possibly VS) arrays in w3srce. Lines 1446 to 1448 in 3eb8161
However, the loop is through The next time that VD is used (in ww3_tp2.21) is in the source term increment loop here: Lines 1562 to 1566 in 3eb8161
which is looping over the whole spectrum ( IK and ITH ).This is where we are hitting our NaN values (or unitialised values when we don't compile with -init-real=snan )
BTW - I am testing for NaN values by checking that I've added an initialisation for VD and VS in w3srce and my test now completes without hitting any NaNs!! I'm just going to see whether this gives b4b now when compared against itself (and no -init-zero flags). Fingers crossed. |
Good news - it looks like that fixed the B4B issue I was seeing in ww3_tp2.21. https://github.com/ukmo-waves/WW3/tree/bf/srce_uninit_vars I am going to run some more tests. |
@ukmo-ccbunney very exciting work! I've been working through @benoitp-cmc pr #1019. From that perspective I'll try a set of tests with your branch merged into Benoit's. I can let you know the outcome. |
Thanks @ukmo-ccbunney! I will run tests w/develop and with PR #1010 as well -- maybe that will help that branch too! |
@ukmo-ccbunney I wanted to update my earlier comment regarding @benoitp-cmc's PR. The tests I'd done originally did not pick up the differences in |
Results of regtests (compared against them selves) after VD/VS initialisation bug fix.
|
@ukmo-ccbunney I have run |
@ukmo-ccbunney apologies for the slow reply with the holiday here in the US. If I run with your bf branch, compared to develop I got differences in tp2.21 but then, when I ran the new PR #1010 with the bug fix included the bf branch and that PR+fix matched as expected!!! |
Excellent - that's welcome news! |
That's great and also the result I was hoping for! When time allows, I will try running some other regtests with |
@ukmo-ccbunney was there more testing you wanted to do with your https://github.com/ukmo-waves/WW3/tree/bf/srce_uninit_vars branch? If not, if you'll make a PR with that we can work on that PR and then work to merge PR #1010 I will also try to run more regtests with the -init-real=snan when I get a chance too. |
I am just running the regtests against develop (rather than itself), then I will raise a PR. |
To get a snapshot of the various tests that are not b4b on different machines, requesting that @ukmo-ccbunney @thesser1 @mickaelaccensi (and anyone else who wants to volunteer to run and share) run the full set of regression tests on the develop branch twice and then report the results. If you run on multiple machines or compilers, feel free to share that as well.
Note there are known not b4b tests that are not expected to be the same, others that have long-standing known issues, but it seems like we have a few that have come up recently so we want to understand what those are.
The text was updated successfully, but these errors were encountered: