Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 crash when STAB3 air-sea temperature provided #869

Merged
merged 5 commits into from
Jun 8, 2023

Conversation

SanderHulst
Copy link
Contributor

@SanderHulst SanderHulst commented Dec 20, 2022

PR was updated on Dec the 29th after getting to the bottom of the issue

Pull Request Summary

With newer Intel compilers, when air-sea temperature differences are provided for STAB3, a WAVEWATCH will crash when one of the values is NaN.

Description

We are investigating whether a term for atmospheric stability improves our day-to-day forecast. We therefor compiled our WAVEWATCH-6.07 with the Intel OneAPI 2022.1.0 and Intel MPI 2021.6.0 in hybrid mode with the STAB3 switch. The model runs fine as long as ww3_prnc writes the wind field without air-sea temperature difference (WND), even on a cold start. However, as soon as ww3_prnc writes the wind field and the air-sea temperature difference (WNS), then ww3_multi aborts.

Older Intel compiler (<2016?) replaced NaN with 0.0. Later compiler versions do not (I agree). After some digging I found that my problem is caused by a NaN air-sea temperature difference. This parameter is NaN above land and that propagates to the nearest wet points after interpolation to the higher resolution WAVEWATCH grid. I modified the code so that it escapes NaN AS values.

image
The figure above shows the NaN as 0. air sea temperature difference.

One could argue whether NaN air sea temperature differences are to be dealt with in WAVEWATCH. I think that having NaN AS values over land is intuitive, so I'd say yes.

Additional information:

  • I have an isolated testcase where replacing one wind binary file with another triggers the issue
  • I needed to modify the ounf meta data to allow for negative air-sea temperature differences

Issue(s) addressed

Commit Message

handle NaN air-sea temperatures from nearest land points

Check list

Testing

  • How were these changes tested?
    A single enclosed basin, the black sea, was set up. wind fields were generated with WND and WNS flags. The grid was cold started. WAVEWATCH completes the run with WND wind field, but crashes with the WNS wind field. After the fix, both complete.
  • Are the changes covered by regression tests? (If not, why? Do new tests need to be added?)
    No. I don't think there are regression tests for STAB3.
  • Have the matrix regression tests been run (if yes, please note HPC and compiler)?
    No
  • Please indicate the expected changes in the regression test output, (Note the list of known non-identical tests.)
  • Please provide the summary output of matrix.comp (matrix.Diff.txt, matrixCompFull.txt and matrixCompSummary.txt):

With newer Intel compilers, when air-sea temperature differences are provided for STAB3, a cold start crashes WAVEWATCH as the action density can be NaN. The NaN propagates into the computation of an index in the SIN4 3D lookup table.
@MatthewMasarik-NOAA
Copy link
Collaborator

@SanderHulst thank you for identifying this issue, and a fix. I've updated the Commit Queue with the PR, please note that with the holidays over the next two weeks processing will be slower, and a realistic timeline is the first week of January.

@SanderHulst
Copy link
Contributor Author

This does not fix everything. Providing a Air-Sea temperature difference results in NaN variance density. Investigating further.

@MatthewMasarik-NOAA
Copy link
Collaborator

Thank you for the update, @SanderHulst.

@SanderHulst
Copy link
Contributor Author

This should be it @MatthewMasarik-NOAA

@MatthewMasarik-NOAA
Copy link
Collaborator

Great, thank you @SanderHulst. We expect to finish processing this next week. Happy holidays!

@ukmo-ccbunney
Copy link
Collaborator

ukmo-ccbunney commented Jan 6, 2023

Hi @SanderHulst
Good catch on this one.
However, I am of the opinion that ww3_prnc (or ww3_prep) programs are at fault here - not the WW3 model itself.
If I understand correctly, the ww3_prnc program is interpolating NaN values from the land on the input grid onto the coastal seapoint cells defined by the model? Is that correct?

If so, I feel that the interpolation in the ww3_prnc program should be updated to handle missing land values (NaNs in this case) and not include them in the grid interpolation.

I feel like the wave model should expect that it's input fields should have valid values defined on all wet points and not have to consider the possibility of NaNs in its input fields - it is the job of the pre-processor programs to ensure these fields are correct.

That's my opinion - but am happy to be challenged on it!

P.S. at the Met Office we don't use the ww3_prep or ww3_prnc programs - I have my own pre-processing program that handles out bespoke MO file formats. This program does what I have described above and ignores NaN vales on the land when doing its bi-linear interpolation.

@MatthewMasarik-NOAA
Copy link
Collaborator

@ukmo-ccbunney thank you much for your insight on the problem, and affects for UKMO. We are continuing to consider this. @SanderHulst I've updated the commit queue with a projected date of next Wednesday, 01/11.

@SanderHulst
Copy link
Contributor Author

SanderHulst commented Jan 9, 2023

Hi @SanderHulst Good catch on this one. However, I am of the opinion that ww3_prnc (or ww3_prep) programs are at fault here - not the WW3 model itself. If I understand correctly, the ww3_prnc program is interpolating NaN values from the land on the input grid onto the coastal seapoint cells defined by the model? Is that correct?

Hi @ukmo-ccbunney, yes, that is correct.

If so, I feel that the interpolation in the ww3_prnc program should be updated to handle missing land values (NaNs in this case) and not include them in the grid interpolation.

I feel like the wave model should expect that it's input fields should have valid values defined on all wet points and not have to consider the possibility of NaNs in its input fields - it is the job of the pre-processor programs to ensure these fields are correct.

I do see where you're coming from and I agree that a model may expect that it's input is valid. In this case I wonder what is wise though. Zero is not the same NaN as the latter indicates that it is unknown. ww3_prnc could indeed be modified so that it does a masked interpolation; valid values over water would be pushed out towards the coast. But then there is still the practical problem when:

  • input resolution matches model resolution
  • the atmospheric and hydrodynamic models do not agree on the landmask, leading to an invalid AST

My opinion is that the source term should come up with a sensible stress when AST is not known, not the (unwary) modeler.

@ukmo-ccbunney
Copy link
Collaborator

My opinion is that the source term should come up with a sensible stress when AST is not known, not the (unwary) modeler.

@SanderHulst I can see where you are coming from here and I agree that some "defensive" programming in the wave model is probably a sensible thing, especially if the AST field could end up with an undefined value where the wind speed is defined.

Is it worth me running the regtests here (considering we don't have a test that covers this change)?

@MatthewMasarik-NOAA
Copy link
Collaborator

Update: Some delay due to the AMS conf this week, more time is needed for review. Thanks for your patience.

@ukmo-ccbunney
Copy link
Collaborator

I have run the regtests, and they are mostly all ok.

However, I am seeing differences in all the *_spec.nc files. The difference is just a tiny difference between the frequency values. I don't see how this could be related to this PR, but would be interested if anyone else experiences it (I am using the GNU compiler). It is possible it is just an oddity of our system. I am going to rerun with a different compiler.

Example diff below (small difference in second frequency value)

***
/home/d02/frey/WW3/SanderHulst/regtests/output/mww3_test_01/work_PR2_UNO_MPI/ww3.196806_spec.nc_diff.txt
***
248c248
<  frequency2 = 0.0438585, 0.04824436, 0.05306879, 0.05837567, 0.06421325,
---
>  frequency2 = 0.0438585, 0.04824435, 0.05306879, 0.05837567, 0.06421325,

@MatthewMasarik-NOAA
Copy link
Collaborator

Hi @SanderHulst, I posted once on the issue, but thought I'd follow up here to see if you could share the test case you mentioned in the #868 header?

@MatthewMasarik-NOAA
Copy link
Collaborator

Hi @SanderHulst, we're waiting for access to your test case before moving on with this PR. Please let us know how we can access it.

@SanderHulst
Copy link
Contributor Author

Good morning,

I was offline for a while, stuff happened. I have a tarball of 1.6Gb. Would it be most convenient to send via wetransfer?

Sander

@JessicaMeixner-NOAA
Copy link
Collaborator

@SanderHulst Thanks for getting back to us when you got a chance. You can send the files to me (Jessica.Meixner noaa.gov) and @MatthewMasarik-NOAA ( Matthew.masarik noaa.gov ) assuming it'll send through email or coordinate via email another way to send that.

@MatthewMasarik-NOAA
Copy link
Collaborator

Hi @SanderHulst. Yes, email won't do the trick for 1.6GB. Please email I or Jessica and we can coordinate offline the easiest way to transfer your test case.

We'd like to be able to confirm your test case and if all looks well merge the PR. We're also interested in having a fix within ww3_prnc as @ukmo-ccbunney suggested, so you will see an issue created regarding that. Having both will provide an extra layer of protection for the unwary modeler :)

@JessicaMeixner-NOAA
Copy link
Collaborator

Thanks for sending the updated files @SanderHulst ! We'll review this PR as soon as possible. Looks like all the info we need is available now though. I'll let you know if we end up missing something.

@MatthewMasarik-NOAA
Copy link
Collaborator

Update: @SanderHulst and all, I apologize for the very long turn around time on this PR. I had some trouble trying to reproduce the error, and have done more thorough testing as well. Fortunately I think we are close to wrapping it up.

I've run the regression tests for both intel and gnu and didn't find any unknown non-b4b tests. I wanted to check in with @ukmo-ccbunney who had mentioned some really small differences in *_spec.nc. Do you have any updated impressions if the differences are likely do to this PR, or not related?

@ukmo-ccbunney
Copy link
Collaborator

I've run the regression tests for both intel and gnu and didn't find any unknown non-b4b tests. I wanted to check in with @ukmo-ccbunney who had mentioned some really small differences in *_spec.nc. Do you have any updated impressions if the differences are likely do to this PR, or not related?

@MatthewMasarik-NOAA I have a feeling that the differences I saw are likely not directly related to the changes in this PR but some other existing issue (possibly related to unitialised variables somewhere? I don't know)

I've not rerun the regtests as develop has moved on quite a bit since then and I don't believe this PR has been bought up-to-date with develop yet?

@ukmo-ccbunney
Copy link
Collaborator

So - I tried merging in develop to my local copy of @SanderHulst's branch to bring it up-to-date and then re-ran the mww3_test_01 regtest metioned above.

I still get the same tiny differences in frequency2 for the ww3.196806_spec.nc file.

I am going to try on another compiler.

@ukmo-ccbunney
Copy link
Collaborator

P.S. FREQ2 is calculated in ww3_ounp like this:

    FREQ(1:NK)=SIG(1:NK)*TPIINV
    FREQ1(1:NK)=FREQ(1:NK)-0.5*(FREQ(1:NK)-(FREQ(1:NK)/XFR))
    FREQ2(1:NK)=FREQ(1:NK)+0.5*(-FREQ(1:NK)+(FREQ(1:NK)*XFR))
    FREQ1(1)=SIG(1)*TPIINV
    FREQ2(NK)=SIG(NK)*TPIINV

so is entirely dependent on SIG, XFR and TPIINV (parameter), so it is unlikely that there is any inadvertent change to those variables otherwise it would likely show up elsewhere.

I am running with debug flags to see if there is any array overruns, or similar misbehaviour...

@MatthewMasarik-NOAA
Copy link
Collaborator

So - I tried merging in develop to my local copy of @SanderHulst's branch to bring it up-to-date and then re-ran the mww3_test_01 regtest metioned above.

I still get the same tiny differences in frequency2 for the ww3.196806_spec.nc file.

I am going to try on another compiler.

Hi @ukmo-ccbunney, I appreciate you bringing the branch up-to-date and re-running the regtest.

@SanderHulst, could you update your branch to current? It should be a straightfoward sync.

@MatthewMasarik-NOAA
Copy link
Collaborator

P.S. FREQ2 is calculated in ww3_ounp like this:

    FREQ(1:NK)=SIG(1:NK)*TPIINV
    FREQ1(1:NK)=FREQ(1:NK)-0.5*(FREQ(1:NK)-(FREQ(1:NK)/XFR))
    FREQ2(1:NK)=FREQ(1:NK)+0.5*(-FREQ(1:NK)+(FREQ(1:NK)*XFR))
    FREQ1(1)=SIG(1)*TPIINV
    FREQ2(NK)=SIG(NK)*TPIINV

so is entirely dependent on SIG, XFR and TPIINV (parameter), so it is unlikely that there is any inadvertent change to those variables otherwise it would likely show up elsewhere.

I am running with debug flags to see if there is any array overruns, or similar misbehaviour...

Hmm. Very interesting. So some more hints it may not be related to this PR. I have tried running the regtests with both intel and gnu on our main RDHPC machine (hera) and did not see any differences, from what I could tell. Is there anything I can run on our end to help verify this?

@SanderHulst
Copy link
Contributor Author

I synced the branch

@MatthewMasarik-NOAA
Copy link
Collaborator

I synced the branch

Awesome. Thanks, @SanderHulst

Copy link
Collaborator

@ukmo-ccbunney ukmo-ccbunney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I've rerun the mww3_test_01/work_PR2_UNO_MPI regtrest using the Cray compiler (as apposed to my usual GNU compiler) and I now don't get any differences.

The fact that we all get slightly different behaviour with respect to b4b reproducability depending on the compiler/hardware vexes me somewhat! Although it's not entirely surprising...

I'm happy to approve this review.

Copy link
Collaborator

@ukmo-ccbunney ukmo-ccbunney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I've rerun the mww3_test_01/work_PR2_UNO_MPI regtrest using the Cray compiler (as apposed to my usual GNU compiler) and I now don't get any differences.

The fact that we all get slightly different behaviour with respect to b4b reproducability depending on the compiler/hardware vexes me somewhat! Although it's not entirely surprising...

I'm happy to approve this review.

@MatthewMasarik-NOAA
Copy link
Collaborator

So, I've rerun the mww3_test_01/work_PR2_UNO_MPI regtrest using the Cray compiler (as apposed to my usual GNU compiler) and I now don't get any differences.

The fact that we all get slightly different behaviour with respect to b4b reproducability depending on the compiler/hardware vexes me somewhat! Although it's not entirely surprising...

I'm happy to approve this review.

@ukmo-ccbunney thank you for this update. I agree it's a bit vexing (thouh not entirely surprising!) to get different behavior for different compiler/hardware. For me using the gnu compiler the test ww3_ts1/./work_ST4_WRT showed up as a non-b4b, though not for intel as usual. Thank you much for your multiple efforts to help me verify this PR.

Copy link
Collaborator

@MatthewMasarik-NOAA MatthewMasarik-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review PASS

Testing PASS

**********************************************************************          
********************* non-identical cases ****************************          
**********************************************************************          
mww3_test_03/./work_PR1_MPI_e                     (1 files differ)              
mww3_test_03/./work_PR2_UQ_MPI_e                     (1 files differ)           
mww3_test_03/./work_PR2_UNO_MPI_e                     (1 files differ)          
mww3_test_03/./work_PR2_UNO_MPI_d2                     (18 files differ)        
mww3_test_03/./work_PR1_MPI_d2                     (14 files differ)            
mww3_test_03/./work_PR3_UNO_MPI_d2_c                     (11 files differ)      
mww3_test_03/./work_PR3_UQ_MPI_d2_c                     (15 files differ)       
mww3_test_03/./work_PR3_UNO_MPI_d2                     (12 files differ)        
mww3_test_03/./work_PR2_UQ_MPI_d2                     (14 files differ)         
mww3_test_03/./work_PR3_UQ_MPI_e                     (1 files differ)           
mww3_test_03/./work_PR3_UQ_MPI_d2                     (16 files differ)         
ww3_tp2.10/./work_MPI_OMPH                     (7 files differ)                 
ww3_tp2.16/./work_MPI_OMPH                     (4 files differ)                 
ww3_ufs1.3/./work_a                     (2 files differ)                        
                                                                                
**********************************************************************          
************************ identical cases *****************************          
**********************************************************************

matrixCompFull.txt
matrixDiff.txt
matrixCompSummary.txt

@MatthewMasarik-NOAA MatthewMasarik-NOAA merged commit 3eb8161 into NOAA-EMC:develop Jun 8, 2023
@MatthewMasarik-NOAA
Copy link
Collaborator

@SanderHulst thank you for this bug fix addressing air sea temperature difference. My apologies for the longer turn around time, I appreciate your patience. And thanks again @ukmo-ccbunney for extra duty in confirming results.

MatthewMasarik-NOAA added a commit to MatthewMasarik-NOAA/WW3 that referenced this pull request Jun 27, 2023
* origin/develop:
  handle NaN air-sea temperatures from nearest land points (NOAA-EMC#869)
  Increase valid_max for f in ounp (NOAA-EMC#1014)
  Bugfix deallocation of invalid memory in ww3_prnc (NOAA-EMC#1016)
  Update to orion intel module path and two typo corrections. (NOAA-EMC#1011)
  Bugfix to out of bounds array write in w3profsmd_pdlib.f90 (NOAA-EMC#1013)
  Simple logic fix for time interpolation of boundary nodes at the end of W3XYPFSNIMP. (NOAA-EMC#1005)
  in w3iors use NSEA instead of NSEAL in serial write/read of VA (NOAA-EMC#954)
  Update documentation for UNST namelist (NOAA-EMC#986)
  In certain coupled configurations, the piece of code testing the coupling frequency to check if 'receive' coupling exchanges need to take place fail, resulting in an infinite loop causing the integration between time zero and the first time step to repeat indefinitely. This check needs to be rewritten, which fixes also issue NOAA-EMC#816 in a simpler way.  (NOAA-EMC#999)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

STAB3 issue on cold start
4 participants