Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor of save_MOM6_internal_state #403

Merged
merged 3 commits into from
Jul 29, 2023

Conversation

marshallward
Copy link
Member

This PR refactors the implementation of save_MOM6_internal_state by also including the MOM6 save_restart function, and renaming it to save_MOM_restart, which now acts as a generic placeholder for all MOM restart operations.

The restart_CS is now wholly contained within the top-level MOM control structure (usually MOM_CS) and all restart operations must be routed through MOM_CS.

This PR also includes a commit which creates the restart directory (usually RESTART) if it is missing.

@codecov
Copy link

codecov bot commented Jul 12, 2023

Codecov Report

Merging #403 (f6c5eee) into dev/gfdl (55fc59a) will decrease coverage by 0.01%.
The diff coverage is 53.57%.

❗ Current head f6c5eee differs from pull request most recent head cc3f34d. Consider uploading reports for the commit cc3f34d to get more accurate results

@@             Coverage Diff              @@
##           dev/gfdl     #403      +/-   ##
============================================
- Coverage     38.16%   38.15%   -0.01%     
============================================
  Files           269      269              
  Lines         76773    76786      +13     
  Branches      14127    14131       +4     
============================================
+ Hits          29297    29299       +2     
- Misses        42195    42203       +8     
- Partials       5281     5284       +3     
Files Changed Coverage Δ
src/framework/MOM_get_input.F90 54.76% <0.00%> (-7.41%) ⬇️
src/framework/posix.F90 39.68% <44.44%> (+0.79%) ⬆️
config_src/drivers/solo_driver/MOM_driver.F90 51.04% <50.00%> (+0.21%) ⬆️
src/core/MOM.F90 51.03% <88.88%> (+0.06%) ⬆️
src/user/MOM_wave_interface.F90 1.24% <100.00%> (ø)

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@marshallward
Copy link
Member Author

Since this PR and #375 makes significant modifications to the drivers, I would like to get feedback from users of the other drivers.

Full review and testing is not necessary, the main goal is to check that it won't create any disruptions to existing work or future plans.

(ping: @alperaltuntas @klindsay28 @gustavo-marques @jiandewang, maybe @awallcraft @abozec)

@marshallward marshallward requested a review from adcroft July 13, 2023 18:10
@jiandewang
Copy link

jiandewang commented Jul 17, 2023

@marshallward some of coupled jobs failed at
https://github.com/marshallward/MOM6/blob/new_save_restart/src/core/MOM.F90#L3937-L3940
I think your logic will work for ocean only case but in coupled cases which component will reach its restart writing stage is unpredictable as it depends on the PE # being assigned. For cases when ocean is the fastest (i.e it reaches to the above line ahead of other components), it will work. However, if other component, for example, ice is faster than ocean, then there will be an "ice_restart_xxxx" exists inside RESTART directory already before ocean reaches the above line.
In UFS, we always create RESTART directory on script levle before job is being launched.
I guess FORTRAN doesn't have capability to do "mkdir -p"

@marshallward
Copy link
Member Author

@jiandewang Big thanks for testing this out. I agree with your overall point about potential race conditions between the submodels. I thought that I could avoid it here since save_restart is called at the end of the run.

I am also confused about the error you saw. If RESTART was created during preprocessing, then file_exists should have been true for every ocean rank, and mkdir should never have been called.

Ignoring these issues, there is already a problem with every rank attempting to call mkdir!

There may be way to implement mkdir -p like behavior, but it will require a bit more work (mainly access to the errno inside libc). That might resolve any potential submodel conflicts.

There is quite a bit going on here, and I rushed it out without proper testing. I will do some more work on my end and see what I can learn.

@jiandewang
Copy link

@marshallward I think current code will get rc=0 if RESTART directory doesn't exist or there is nothing inside it. Otherwise rc won't be 0. See HERA for example
(1) failed case: /scratch1/NCEPDEV/stmp2/Jiande.Wang/FV3_RT/rt_3991/datm_cdeps_control_gefs_intel
if you do ls RESTART, you will see ice restart file exist but no MOM restart file
(2) OK case: /scratch1/NCEPDEV/stmp2/Jiande.Wang/FV3_RT/rt_3991/cpld_debug_p8_gnu
if you do ls -l RESTART, you will see MOMres is the earliest

@marshallward
Copy link
Member Author

marshallward commented Jul 18, 2023

I identified and fixed two issues:

  • Checking for RESTART in save_restart was too late in many cases, since many of the coupled models use RESTART for other restart-related tasks before the MOM6 restart function is called.

  • Every rank was attempting to call mkdir, which was causing all sorts of race conditions.

The restart directory creation is now done in MOM_get_input rather than save_restart. This is not perfect, but it should be adequate from the perspective of MOM6.

Not sure if these explain the issues reported by @jiandewang, but I would at the least expect some improvement.

(Also, I suppose these two commits are no longer thematically related, but I'll keep them together.)

@jiandewang
Copy link

jiandewang commented Jul 18, 2023

I identified and fixed two issues:

  • Checking for RESTART in save_restart was too late in many cases, since many of the coupled models use RESTART for other restart-related tasks before the MOM6 restart function is called.
  • Every rank was attempting to call mkdir, which was causing all sorts of race conditions.

The restart directory creation is now done in MOM_get_input rather than save_restart. This is not perfect, but it should be adequate from the perspective of MOM6.

Not sure if these explain the issues reported by @jiandewang, but I would at the least expect some improvement.

I think most likely your new approach will work in coupled system as other submodules will not be able to start their 2nd time step until MOM finished its 1st time step. I will have a try later as our machine is down today.

(Also, I suppose these two commits are no longer thematically related, but I'll keep them together.)

@marshallward
Copy link
Member Author

Thank you @jiandewang for going above and beyond in testing here.

I would say that this change moves the creation into initialization, prior to any model timestep. Also, I would hope that the presence of a RESTART would mean that none of these changes are ever invoked.

@jiandewang
Copy link

@marshallward still not working. I carefully re-checked my previous runs and new runs, the behaviors are similar. My previous comments on the relative speed of other submodules may caused the failure is not correct. The difference between previous runs and new runs is that the new runs died earlier than previous ones which is understandable as you moved the call in the model earlier stage. Now the summary for all coupled runs are:

  1. all gnu jobs finished OK
  2. all intel jobs died at mkdir stage
  3. even if I don't pre-make RESTART dir, model still died (although it does generated RESTART directory)
    so the issue is ierr = mkdir(trim(dirs%restart_output_dir), int(o'700')) will return -1 regardless whether RESTART exists or not under intel

@marshallward
Copy link
Member Author

@jiandewang I can also replicate this. Seems like mkdir() works for GNU but not Intel. I'll need to look into why this isn't working. Fortunately the problem appears to be very direct.

Thanks for testing this (and apologies for not testing it out myself!)

@marshallward
Copy link
Member Author

I've found at least one of the problems. inquire() cannot be used to check if directories exist with Intel. (They claim that the standard is ambiguous and provide a non-standard dir= argument extension; I think it really depends on your interpretation of "file").

Our file_exists() is mostly just a wrapper to inquire(), so this would explain why it attempts to run mkdir even when RESTART is already present.

@marshallward
Copy link
Member Author

I've updated the directory existence check, and it now work for us on all of our compilers (GNU, Intel, Nvidia).

@jiandewang
Copy link

I've updated the directory existence check, and it now work for us on all of our compilers (GNU, Intel, Nvidia).

I will test in again on EMC side

Copy link

@raphaeldussin raphaeldussin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just one minor cosmetic comment

src/core/MOM.F90 Outdated
type(time_type), intent(in) :: Time !< model time, used in this routine
type(directories), intent(in) :: dirs !< structure with directory paths
type(MOM_control_struct), intent(inout) :: CS !< MOM control structure
type(MOM_restart_CS), pointer :: restart_CSp !< pointer to the restart control
!! structure that will be used for MOM.
!type(MOM_restart_CS), pointer :: restart_CSp !< pointer to the restart control

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep commented out or delete?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks! I'll remove it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant comment should be removed now.

@jiandewang
Copy link

@marshallward it works fine now on EMC side

@marshallward marshallward removed the request for review from adcroft July 25, 2023 14:48
@marshallward
Copy link
Member Author

@marshallward it works fine now on EMC side

Great news! Thanks again @jiandewang for helping to test this out.

@marshallward
Copy link
Member Author

I believe this is ready now, pending final review and regression testing.

@marshallward
Copy link
Member Author

marshallward commented Jul 25, 2023

I suspect this will create a conflict with #408, so I will handle that one first, then amend this one to resolve the conflicts.

Conflict resolved.

Copy link
Member

@Hallberg-NOAA Hallberg-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR has been extensively discussed, with a most enthusiastic response. Jiande has kindly provided valuable feed back on the non-FMS coupler interfaces, and I think that there is widespread agreement that this is ready to go. This should be merged in once it reaches the top of the queue.

This patch merges the internal `save_restart` function with the new
`save_MOM6_internal_state` function into a new general MOM restart
function.

It also makes an effort to eliminate `MOM_restart` as a driver
dependency, narrowing the required MOM API for existing and future
drivers.

Also removes the `restart_CSp` argument from `MOM_wave_interface_init`,
since it appeared to be used for nothing.
MOM simulations typically abort of the restart directory (usually
RESTART) are absent.  This patch adds POSIX support for mkdir() and
creates the directory if it is missing.
Using inquire() to check for directory existence is not possible, since
at least one compiler (Intel) does not consider directories to be files.

The inquire call is replaced with a C interface to the POSIX stat()
function.  We do not fully emulate the behavior of stat, but we use its
return value to determine existence of directories.  This provides a
more reliable method for identifying the existence of the directory.

This should resolve many of the observed problems with RESTART creation
in coupled runs.
@Hallberg-NOAA
Copy link
Member

Unfortunately, this PR appears to be failing the pipeline testing (e.g., at https://gitlab.gfdl.noaa.gov/ogrp/MOM6/-/jobs/107397) with messages that do not clearly indicate what is wrong. The runs themselves seem to be fine, so this might be an issue with the automated testing not working smoothly with these code changes, rather than problems with the code changes per-se. We should move on to handle other PRs in the queue, rather than waiting to sort this one out first.

The relevant error messages from the pipeline testing read:

$ make WORKSPACE=$WORKSPACE test.summary
The following tests report solution regressions:
lustre :
The following tests report diagnostic regressions:
lustre :
make: *** [Makefile:655: test.summary] Error 1

@marshallward
Copy link
Member Author

Every PR has been suffering from these failures. They seem to be failures in MPI_init, which has nothing to do with MOM6. I would recommend re-running it.

@marshallward
Copy link
Member Author

CI is now passing: https://gitlab.gfdl.noaa.gov/ogrp/MOM6/-/jobs/107399

@marshallward marshallward merged commit 5efad9b into NOAA-GFDL:dev/gfdl Jul 29, 2023
@marshallward marshallward deleted the new_save_restart branch May 8, 2024 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants