Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug fix to have save point weight file be different name #1357

Merged

Conversation

JessicaMeixner-NOAA
Copy link
Collaborator

@JessicaMeixner-NOAA JessicaMeixner-NOAA commented Jan 26, 2025

Pull Request Summary

A bug fix for #1350

Description

On some machines, for the unstructured grid cases such as:
./bin/run_cmake_test -b slurm -o all -S -T -s MPI -s PDLIB -w work_pdlib -g pdlib -f -p srun -n 24 ../model ww3_tp2.6
processor 1 was so much faster than other processors, that the NetCDF file writting out the point output existed for some processors, but not all. This was causing the model to then hang. We did not see this on every machine.

To fix this issue, I have renamed the output file to a different file name. On hercules with intel, this fixed the issue. Additional testing to ensure this fixes everyones issue is needed.

Issue(s) addressed

Commit Message

bug fix to have save point weight file be different name

Check list

Testing

  • How were these changes tested? W/Matrix
  • Are the changes covered by regression tests? (If not, why? Do new tests need to be added?) Not all use the new feature.
  • Have the matrix regression tests been run (if yes, please note HPC and compiler)? Yes. Hera w/intel, Hercules w/intel, hercules w/gnu, orion w/intel
  • Please indicate the expected changes in the regression test output, (Note the list of known non-identical tests.)

Only expected.

  • Please provide the summary output of matrix.comp (matrix.Diff.txt, matrixCompFull.txt and matrixCompSummary.txt):

Hera comparison with develop from 20241213

**********************************************************************
********************* non-identical cases ****************************
**********************************************************************
mww3_test_03/./work_PR1_MPI_e                     (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_e_c                     (1 files differ)
mww3_test_03/./work_PR3_UNO_MPI_e                     (1 files differ)
mww3_test_03/./work_PR2_UQ_MPI_e                     (1 files differ)
mww3_test_03/./work_PR2_UNO_MPI_e                     (1 files differ)
mww3_test_03/./work_PR2_UNO_MPI_d2                     (17 files differ)
mww3_test_03/./work_PR1_MPI_d2                     (14 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2_c                     (17 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2_c                     (15 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2                     (16 files differ)
mww3_test_03/./work_PR2_UQ_MPI_d2                     (15 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2                     (17 files differ)
mww3_test_09/./work_MPI_ASCII                     (0 files differ)
ww3_tp2.10/./work_MPI_OMPH                     (7 files differ)
ww3_tp2.16/./work_MPI_OMPH                     (4 files differ)
ww3_tp2.17/./work_a                     (0 files differ)
ww3_tp2.17/./work_c                     (0 files differ)
ww3_tp2.17/./work_b                     (0 files differ)
ww3_tp2.19/./work_1B_a                     (0 files differ)
ww3_tp2.19/./work_1A_a                     (0 files differ)
ww3_tp2.19/./work_1C_a                     (0 files differ)
ww3_tp2.21/./work_ma                     (0 files differ)
ww3_tp2.21/./work_b_metis                     (0 files differ)
ww3_tp2.21/./work_a                     (0 files differ)
ww3_tp2.21/./work_mb                     (0 files differ)
ww3_tp2.21/./work_b                     (0 files differ)
ww3_tp2.6/./work_ST0                     (0 files differ)
ww3_tp2.6/./work_ST4                     (0 files differ)
ww3_tp2.6/./work_pdlib                     (0 files differ)
ww3_tp2.6/./work_ST4_ASCII                     (0 files differ)
ww3_tp2.7/./work_ST0                     (0 files differ)
ww3_ts4/./work_ug_MPI                     (0 files differ)
ww3_ufs1.1/./work_unstr_b                     (0 files differ)
ww3_ufs1.1/./work_unstr_a                     (0 files differ)
ww3_ufs1.1/./work_unstr_c                     (0 files differ)
ww3_ufs1.3/./work_a                     (3 files differ)

matrixCompFull.txt
matrixCompSummary.txt
matrixDiff.txt

@JessicaMeixner-NOAA
Copy link
Collaborator Author

@thesser1 - Can you try this bugfix on your machine?

I should have more info and test results on my end in tomorrow.

@JessicaMeixner-NOAA
Copy link
Collaborator Author

@thesser1 - I was incorrect about this bug-fix. It worked once, but didn't after that. I'm closing this PR, I don't think it's worth trying. I'll keep you posted.

@thesser1
Copy link
Collaborator

thesser1 commented Jan 27, 2025 via email

@JessicaMeixner-NOAA
Copy link
Collaborator Author

@thesser1: I have now run this on 3 machines with intel and all the regtests went through without hanging. I was getting hangs on some of these machines before. I forgot to hit submit on one machine with gnu, so those are running now along w/the compare scripts.

I think this is ready to test. This is basically the same fix as yesterday except I renamed the wrong filename (the input instead of the output), so I just had a bug in my bugfix but all the comments/descriptions above are the same.

I think this is worth trying on your end now, but I'll continue to keep you posted on my end on the output of the last gnu run and the comparison outputs if you'd like to wait for those before trying on your end.

@JessicaMeixner-NOAA
Copy link
Collaborator Author

@thesser1 - I think this is ready for you to test if you don't mind.

@JessicaMeixner-NOAA
Copy link
Collaborator Author

Okay - I'm still getting hangs with ww3_ufs1.1/work_unstr_a on hercules with gnu, but no other machine/compiler is hanging anymore for me though. So I think this might be an unrelated issue, but not 100% sure.

@thesser1 - It would be a helpful data point to know how this branch goes on your end if you have time.

@thesser1
Copy link
Collaborator

Sorry for the delay @JessicaMeixner-NOAA. I will setup and test now.

@JessicaMeixner-NOAA
Copy link
Collaborator Author

Thanks Ty! I'm not sure if it'll work or not, but would definitely appreciate you taking time to test - and if it doesn't work any/all error information would be really helpful.

@thesser1
Copy link
Collaborator

thesser1 commented Jan 29, 2025

Not sure this helps you, but on one computer where tp2.6 was failing, it is now running with your fix. On the other computer where tp2.6 failed, the code is still failing with your fix. I did recheck on that computer when I roll the commit back, the regtest runs on that system. Both systems are using intel compiler.

@JessicaMeixner-NOAA
Copy link
Collaborator Author

Not sure this helps you, but on one computer where tp2.6 was failing, it is now running with your fix. On the other computer where tp2.6 failed, the code is still failing with your fix. I did recheck on that computer when I roll the commit back, the regtest runs on that system. Both systems are using intel compiler.

Any chance you have error messages with line numbers and other details? Thanks again for running things.

I'll continue to prioritize getting a fix for this.

@JessicaMeixner-NOAA
Copy link
Collaborator Author

@thesser1 - I have pushed some additional fixes. I haven't fully tested everything, but it's looking like i'm getting passed previous errors. I should have my testing info by tomorrow morning, if not sooner. I'll post here if I find anything negative to report as soon as I come across it.

@thesser1
Copy link
Collaborator

thesser1 commented Feb 5, 2025 via email

@JessicaMeixner-NOAA
Copy link
Collaborator Author

@thesser1 I have not had any errors in any of my tests, including on hercules with GNU where I saw some issues before. For the comparisons that have run, everything looks fine. The last of the compare scripts are running this morning and I'll post results this afternoon. I'm hopeful this PR will fix the issues you have been seeing.

@JessicaMeixner-NOAA
Copy link
Collaborator Author

@thesser1 - @MatthewMasarik-NOAA is going to start reviewing this PR.

If you have had any issues, please let us know (and any bug information as well). I'm cautiously hopeful this has fixed things, but I know different computers can have problems not seen on other machines.

@thesser1
Copy link
Collaborator

thesser1 commented Feb 7, 2025

@JessicaMeixner-NOAA and @MatthewMasarik-NOAA , I just completed testing this, and my local Duck unstructured cases and tp2.6 are working properly with the latest changes. Thanks. I should add that I did not run the full suite of regtests, just tp2.6.

@JessicaMeixner-NOAA
Copy link
Collaborator Author

Thanks @thesser1 ! That's great news. If you run into problems in the future - let us know.

@MatthewMasarik-NOAA
Copy link
Collaborator

@JessicaMeixner-NOAA and @MatthewMasarik-NOAA , I just completed testing this, and my local Duck unstructured cases and tp2.6 are working properly with the latest changes. Thanks. I should add that I did not run the full suite of regtests, just tp2.6.

Thank you, @thesser1! I have the tests running now on my end.

@MatthewMasarik-NOAA
Copy link
Collaborator

For testing on hera I got the same matrixCompSummary.txt output as @JessicaMeixner-NOAA. The matrix tests on hercules are still running. I'm planning to get this merged today if those tests come back without anything unexpected. I still need to run the matrix.comp, so it could be later in the day depending when that finishes, but it should go through today.

Copy link
Collaborator

@MatthewMasarik-NOAA MatthewMasarik-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review

Pass

Testing

Pass

hera

**********************************************************************
********************* non-identical cases ****************************
**********************************************************************
mww3_test_03/./work_PR1_MPI_e                     (1 files differ)
mww3_test_03/./work_PR3_UNO_MPI_e                     (1 files differ)
mww3_test_03/./work_PR2_UQ_MPI_e                     (1 files differ)
mww3_test_03/./work_PR2_UNO_MPI_e                     (1 files differ)
mww3_test_03/./work_PR2_UNO_MPI_d2                     (17 files differ)
mww3_test_03/./work_PR1_MPI_d2                     (12 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2_c                     (11 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2_c                     (16 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2                     (11 files differ)
mww3_test_03/./work_PR2_UQ_MPI_d2                     (14 files differ)
mww3_test_03/./work_PR3_UQ_MPI_e                     (1 files differ)
mww3_test_03/./work_PR3_UNO_MPI_e_c                     (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2                     (16 files differ)
mww3_test_09/./work_MPI_ASCII                     (0 files differ)
ww3_tp2.10/./work_MPI_OMPH                     (7 files differ)
ww3_tp2.16/./work_MPI_OMPH                     (4 files differ)
ww3_tp2.17/./work_a                     (0 files differ)
ww3_tp2.17/./work_c                     (0 files differ)
ww3_tp2.17/./work_b                     (0 files differ)
ww3_tp2.19/./work_1B_a                     (0 files differ)
ww3_tp2.19/./work_1A_a                     (0 files differ)
ww3_tp2.19/./work_1C_a                     (0 files differ)
ww3_tp2.21/./work_ma                     (0 files differ)
ww3_tp2.21/./work_b_metis                     (0 files differ)
ww3_tp2.21/./work_a                     (0 files differ)
ww3_tp2.21/./work_mb                     (0 files differ)
ww3_tp2.21/./work_b                     (0 files differ)
ww3_tp2.6/./work_ST0                     (0 files differ)
ww3_tp2.6/./work_ST4                     (0 files differ)
ww3_tp2.6/./work_pdlib                     (0 files differ)
ww3_tp2.6/./work_ST4_ASCII                     (0 files differ)
ww3_tp2.7/./work_ST0                     (0 files differ)
ww3_ts4/./work_ug_MPI                     (0 files differ)
ww3_ufs1.3/./work_a                     (3 files differ)
 
**********************************************************************
************************ identical cases *****************************
**********************************************************************

hera.matrixCompSummary.txt
hera.matrixCompFull.txt
hera.matrixDiff.txt

hercules
The matrix output on hercules reflects the hanging behavior in the develop branch run.

**********************************************************************
********************* non-identical cases ****************************
**********************************************************************
mww3_test_03/./work_PR2_UQ_MPI_d2                     (15 files differ)
mww3_test_03/./work_PR2_UNO_MPI_e                     (1 files differ)
mww3_test_03/./work_PR2_UNO_MPI_d2                     (17 files differ)
mww3_test_03/./work_PR1_MPI_e                     (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2                     (16 files differ)
mww3_test_03/./work_PR3_UNO_MPI_e                     (1 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2_c                     (14 files differ)
mww3_test_03/./work_PR3_UQ_MPI_e                     (1 files differ)
mww3_test_03/./work_PR2_UQ_MPI_e                     (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2_c                     (17 files differ)
mww3_test_03/./work_PR1_MPI_d2                     (12 files differ)
mww3_test_03/./work_PR3_UNO_MPI_e_c                     (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_e_c                     (1 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2                     (14 files differ)
mww3_test_09/./work_MPI_ASCII                     (0 files differ)
ww3_tp2.10/./work_MPI_OMPH                     (7 files differ)
ww3_tp2.16/./work_MPI_OMPH                     (4 files differ)
ww3_tp2.17/./work_a                     (0 files differ)
ww3_tp2.17/./work_b                     (0 files differ)
ww3_tp2.17/./work_c                     (0 files differ)
ww3_tp2.21/./work_ma                     (0 files differ)
ww3_tp2.21/./work_b_metis                     (0 files differ)
ww3_tp2.21/./work_a                     (0 files differ)
ww3_tp2.21/./work_b                     (0 files differ)
ww3_tp2.21/./work_mb                     (0 files differ)
ww3_tp2.6/./work_ST0                     (0 files differ)
ww3_tp2.6/./work_pdlib                     (1 files differ)
ww3_tp2.6/./work_ST4                     (0 files differ)
ww3_tp2.6/./work_ST4_ASCII                     (0 files differ)
ww3_tp2.7/./work_ST0                     (0 files differ)
ww3_ufs1.3/./work_a                     (2 files differ)
 
**********************************************************************
************************ identical cases *****************************
**********************************************************************

hercules.matrixCompSummary.txt
hercules.matrixCompFull.txt
hercules.matrixDiff.txt

Approved.

@MatthewMasarik-NOAA MatthewMasarik-NOAA merged commit 2681f7b into NOAA-EMC:develop Feb 8, 2025
3 of 6 checks passed
@MatthewMasarik-NOAA
Copy link
Collaborator

@JessicaMeixner-NOAA thank you for finding this successful fix. @thesser1 we appreciate the report of this behavior and your help testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ww3_tp2.6 regression test hanging
3 participants