-
Notifications
You must be signed in to change notification settings - Fork 570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 crash when STAB3 air-sea temperature provided #869
Conversation
With newer Intel compilers, when air-sea temperature differences are provided for STAB3, a cold start crashes WAVEWATCH as the action density can be NaN. The NaN propagates into the computation of an index in the SIN4 3D lookup table.
@SanderHulst thank you for identifying this issue, and a fix. I've updated the Commit Queue with the PR, please note that with the holidays over the next two weeks processing will be slower, and a realistic timeline is the first week of January. |
This does not fix everything. Providing a Air-Sea temperature difference results in NaN variance density. Investigating further. |
Thank you for the update, @SanderHulst. |
This should be it @MatthewMasarik-NOAA |
Great, thank you @SanderHulst. We expect to finish processing this next week. Happy holidays! |
Hi @SanderHulst If so, I feel that the interpolation in the ww3_prnc program should be updated to handle missing land values (NaNs in this case) and not include them in the grid interpolation. I feel like the wave model should expect that it's input fields should have valid values defined on all wet points and not have to consider the possibility of NaNs in its input fields - it is the job of the pre-processor programs to ensure these fields are correct. That's my opinion - but am happy to be challenged on it! P.S. at the Met Office we don't use the ww3_prep or ww3_prnc programs - I have my own pre-processing program that handles out bespoke MO file formats. This program does what I have described above and ignores NaN vales on the land when doing its bi-linear interpolation. |
@ukmo-ccbunney thank you much for your insight on the problem, and affects for UKMO. We are continuing to consider this. @SanderHulst I've updated the commit queue with a projected date of next Wednesday, 01/11. |
Hi @ukmo-ccbunney, yes, that is correct.
I do see where you're coming from and I agree that a model may expect that it's input is valid. In this case I wonder what is wise though. Zero is not the same NaN as the latter indicates that it is unknown. ww3_prnc could indeed be modified so that it does a masked interpolation; valid values over water would be pushed out towards the coast. But then there is still the practical problem when:
My opinion is that the source term should come up with a sensible stress when AST is not known, not the (unwary) modeler. |
@SanderHulst I can see where you are coming from here and I agree that some "defensive" programming in the wave model is probably a sensible thing, especially if the AST field could end up with an undefined value where the wind speed is defined. Is it worth me running the regtests here (considering we don't have a test that covers this change)? |
Update: Some delay due to the AMS conf this week, more time is needed for review. Thanks for your patience. |
I have run the regtests, and they are mostly all ok. However, I am seeing differences in all the Example diff below (small difference in second frequency value)
|
Hi @SanderHulst, I posted once on the issue, but thought I'd follow up here to see if you could share the test case you mentioned in the #868 header? |
Hi @SanderHulst, we're waiting for access to your test case before moving on with this PR. Please let us know how we can access it. |
Good morning, I was offline for a while, stuff happened. I have a tarball of 1.6Gb. Would it be most convenient to send via wetransfer? Sander |
@SanderHulst Thanks for getting back to us when you got a chance. You can send the files to me (Jessica.Meixner noaa.gov) and @MatthewMasarik-NOAA ( Matthew.masarik noaa.gov ) assuming it'll send through email or coordinate via email another way to send that. |
Hi @SanderHulst. Yes, email won't do the trick for 1.6GB. Please email I or Jessica and we can coordinate offline the easiest way to transfer your test case. We'd like to be able to confirm your test case and if all looks well merge the PR. We're also interested in having a fix within ww3_prnc as @ukmo-ccbunney suggested, so you will see an issue created regarding that. Having both will provide an extra layer of protection for the unwary modeler :) |
Thanks for sending the updated files @SanderHulst ! We'll review this PR as soon as possible. Looks like all the info we need is available now though. I'll let you know if we end up missing something. |
Update: @SanderHulst and all, I apologize for the very long turn around time on this PR. I had some trouble trying to reproduce the error, and have done more thorough testing as well. Fortunately I think we are close to wrapping it up. I've run the regression tests for both |
@MatthewMasarik-NOAA I have a feeling that the differences I saw are likely not directly related to the changes in this PR but some other existing issue (possibly related to unitialised variables somewhere? I don't know) I've not rerun the regtests as |
So - I tried merging in I still get the same tiny differences in I am going to try on another compiler. |
P.S. FREQ2 is calculated in ww3_ounp like this:
so is entirely dependent on I am running with debug flags to see if there is any array overruns, or similar misbehaviour... |
Hi @ukmo-ccbunney, I appreciate you bringing the branch up-to-date and re-running the regtest. @SanderHulst, could you update your branch to current? It should be a straightfoward sync. |
Hmm. Very interesting. So some more hints it may not be related to this PR. I have tried running the regtests with both intel and gnu on our main RDHPC machine (hera) and did not see any differences, from what I could tell. Is there anything I can run on our end to help verify this? |
I synced the branch |
Awesome. Thanks, @SanderHulst |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I've rerun the mww3_test_01/work_PR2_UNO_MPI
regtrest using the Cray compiler (as apposed to my usual GNU compiler) and I now don't get any differences.
The fact that we all get slightly different behaviour with respect to b4b reproducability depending on the compiler/hardware vexes me somewhat! Although it's not entirely surprising...
I'm happy to approve this review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I've rerun the mww3_test_01/work_PR2_UNO_MPI
regtrest using the Cray compiler (as apposed to my usual GNU compiler) and I now don't get any differences.
The fact that we all get slightly different behaviour with respect to b4b reproducability depending on the compiler/hardware vexes me somewhat! Although it's not entirely surprising...
I'm happy to approve this review.
@ukmo-ccbunney thank you for this update. I agree it's a bit vexing (thouh not entirely surprising!) to get different behavior for different compiler/hardware. For me using the gnu compiler the test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review PASS
Testing PASS
**********************************************************************
********************* non-identical cases ****************************
**********************************************************************
mww3_test_03/./work_PR1_MPI_e (1 files differ)
mww3_test_03/./work_PR2_UQ_MPI_e (1 files differ)
mww3_test_03/./work_PR2_UNO_MPI_e (1 files differ)
mww3_test_03/./work_PR2_UNO_MPI_d2 (18 files differ)
mww3_test_03/./work_PR1_MPI_d2 (14 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2_c (11 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2_c (15 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2 (12 files differ)
mww3_test_03/./work_PR2_UQ_MPI_d2 (14 files differ)
mww3_test_03/./work_PR3_UQ_MPI_e (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2 (16 files differ)
ww3_tp2.10/./work_MPI_OMPH (7 files differ)
ww3_tp2.16/./work_MPI_OMPH (4 files differ)
ww3_ufs1.3/./work_a (2 files differ)
**********************************************************************
************************ identical cases *****************************
**********************************************************************
@SanderHulst thank you for this bug fix addressing air sea temperature difference. My apologies for the longer turn around time, I appreciate your patience. And thanks again @ukmo-ccbunney for extra duty in confirming results. |
* origin/develop: handle NaN air-sea temperatures from nearest land points (NOAA-EMC#869) Increase valid_max for f in ounp (NOAA-EMC#1014) Bugfix deallocation of invalid memory in ww3_prnc (NOAA-EMC#1016) Update to orion intel module path and two typo corrections. (NOAA-EMC#1011) Bugfix to out of bounds array write in w3profsmd_pdlib.f90 (NOAA-EMC#1013) Simple logic fix for time interpolation of boundary nodes at the end of W3XYPFSNIMP. (NOAA-EMC#1005) in w3iors use NSEA instead of NSEAL in serial write/read of VA (NOAA-EMC#954) Update documentation for UNST namelist (NOAA-EMC#986) In certain coupled configurations, the piece of code testing the coupling frequency to check if 'receive' coupling exchanges need to take place fail, resulting in an infinite loop causing the integration between time zero and the first time step to repeat indefinitely. This check needs to be rewritten, which fixes also issue NOAA-EMC#816 in a simpler way. (NOAA-EMC#999)
PR was updated on Dec the 29th after getting to the bottom of the issue
Pull Request Summary
With newer Intel compilers, when air-sea temperature differences are provided for STAB3, a WAVEWATCH will crash when one of the values is NaN.
Description
We are investigating whether a term for atmospheric stability improves our day-to-day forecast. We therefor compiled our WAVEWATCH-6.07 with the Intel OneAPI 2022.1.0 and Intel MPI 2021.6.0 in hybrid mode with the STAB3 switch. The model runs fine as long as ww3_prnc writes the wind field without air-sea temperature difference (WND), even on a cold start. However, as soon as ww3_prnc writes the wind field and the air-sea temperature difference (WNS), then ww3_multi aborts.
Older Intel compiler (<2016?) replaced NaN with 0.0. Later compiler versions do not (I agree). After some digging I found that my problem is caused by a NaN air-sea temperature difference. This parameter is NaN above land and that propagates to the nearest wet points after interpolation to the higher resolution WAVEWATCH grid. I modified the code so that it escapes NaN AS values.
The figure above shows the NaN as 0. air sea temperature difference.
One could argue whether NaN air sea temperature differences are to be dealt with in WAVEWATCH. I think that having NaN AS values over land is intuitive, so I'd say yes.
Additional information:
Issue(s) addressed
Commit Message
handle NaN air-sea temperatures from nearest land points
Check list
Testing
A single enclosed basin, the black sea, was set up. wind fields were generated with WND and WNS flags. The grid was cold started. WAVEWATCH completes the run with WND wind field, but crashes with the WNS wind field. After the fix, both complete.
No. I don't think there are regression tests for STAB3.
No