-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor of save_MOM6_internal_state
#403
Refactor of save_MOM6_internal_state
#403
Conversation
Codecov Report
@@ Coverage Diff @@
## dev/gfdl #403 +/- ##
============================================
- Coverage 38.16% 38.15% -0.01%
============================================
Files 269 269
Lines 76773 76786 +13
Branches 14127 14131 +4
============================================
+ Hits 29297 29299 +2
- Misses 42195 42203 +8
- Partials 5281 5284 +3
... and 1 file with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Since this PR and #375 makes significant modifications to the drivers, I would like to get feedback from users of the other drivers. Full review and testing is not necessary, the main goal is to check that it won't create any disruptions to existing work or future plans. (ping: @alperaltuntas @klindsay28 @gustavo-marques @jiandewang, maybe @awallcraft @abozec) |
@marshallward some of coupled jobs failed at |
@jiandewang Big thanks for testing this out. I agree with your overall point about potential race conditions between the submodels. I thought that I could avoid it here since I am also confused about the error you saw. If Ignoring these issues, there is already a problem with every rank attempting to call There may be way to implement There is quite a bit going on here, and I rushed it out without proper testing. I will do some more work on my end and see what I can learn. |
@marshallward I think current code will get rc=0 if RESTART directory doesn't exist or there is nothing inside it. Otherwise rc won't be 0. See HERA for example |
283d100
to
649db36
Compare
I identified and fixed two issues:
The restart directory creation is now done in Not sure if these explain the issues reported by @jiandewang, but I would at the least expect some improvement. (Also, I suppose these two commits are no longer thematically related, but I'll keep them together.) |
I think most likely your new approach will work in coupled system as other submodules will not be able to start their 2nd time step until MOM finished its 1st time step. I will have a try later as our machine is down today.
|
Thank you @jiandewang for going above and beyond in testing here. I would say that this change moves the creation into initialization, prior to any model timestep. Also, I would hope that the presence of a |
@marshallward still not working. I carefully re-checked my previous runs and new runs, the behaviors are similar. My previous comments on the relative speed of other submodules may caused the failure is not correct. The difference between previous runs and new runs is that the new runs died earlier than previous ones which is understandable as you moved the call in the model earlier stage. Now the summary for all coupled runs are:
|
@jiandewang I can also replicate this. Seems like Thanks for testing this (and apologies for not testing it out myself!) |
I've found at least one of the problems. Our |
I've updated the directory existence check, and it now work for us on all of our compilers (GNU, Intel, Nvidia). |
I will test in again on EMC side |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just one minor cosmetic comment
src/core/MOM.F90
Outdated
type(time_type), intent(in) :: Time !< model time, used in this routine | ||
type(directories), intent(in) :: dirs !< structure with directory paths | ||
type(MOM_control_struct), intent(inout) :: CS !< MOM control structure | ||
type(MOM_restart_CS), pointer :: restart_CSp !< pointer to the restart control | ||
!! structure that will be used for MOM. | ||
!type(MOM_restart_CS), pointer :: restart_CSp !< pointer to the restart control |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep commented out or delete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks! I'll remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redundant comment should be removed now.
@marshallward it works fine now on EMC side |
663d466
to
5a4ed63
Compare
Great news! Thanks again @jiandewang for helping to test this out. |
I believe this is ready now, pending final review and regression testing. |
Conflict resolved. |
5a4ed63
to
05bb803
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR has been extensively discussed, with a most enthusiastic response. Jiande has kindly provided valuable feed back on the non-FMS coupler interfaces, and I think that there is widespread agreement that this is ready to go. This should be merged in once it reaches the top of the queue.
This patch merges the internal `save_restart` function with the new `save_MOM6_internal_state` function into a new general MOM restart function. It also makes an effort to eliminate `MOM_restart` as a driver dependency, narrowing the required MOM API for existing and future drivers. Also removes the `restart_CSp` argument from `MOM_wave_interface_init`, since it appeared to be used for nothing.
MOM simulations typically abort of the restart directory (usually RESTART) are absent. This patch adds POSIX support for mkdir() and creates the directory if it is missing.
Using inquire() to check for directory existence is not possible, since at least one compiler (Intel) does not consider directories to be files. The inquire call is replaced with a C interface to the POSIX stat() function. We do not fully emulate the behavior of stat, but we use its return value to determine existence of directories. This provides a more reliable method for identifying the existence of the directory. This should resolve many of the observed problems with RESTART creation in coupled runs.
05bb803
to
cc3f34d
Compare
Unfortunately, this PR appears to be failing the pipeline testing (e.g., at https://gitlab.gfdl.noaa.gov/ogrp/MOM6/-/jobs/107397) with messages that do not clearly indicate what is wrong. The runs themselves seem to be fine, so this might be an issue with the automated testing not working smoothly with these code changes, rather than problems with the code changes per-se. We should move on to handle other PRs in the queue, rather than waiting to sort this one out first. The relevant error messages from the pipeline testing read:
|
Every PR has been suffering from these failures. They seem to be failures in |
CI is now passing: https://gitlab.gfdl.noaa.gov/ogrp/MOM6/-/jobs/107399 |
This PR refactors the implementation of
save_MOM6_internal_state
by also including the MOM6save_restart
function, and renaming it tosave_MOM_restart
, which now acts as a generic placeholder for all MOM restart operations.The
restart_CS
is now wholly contained within the top-level MOM control structure (usuallyMOM_CS
) and all restart operations must be routed throughMOM_CS
.This PR also includes a commit which creates the restart directory (usually
RESTART
) if it is missing.