-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port SRW App to WCOSS2 #270
Conversation
@@ -30,7 +30,7 @@ protocol = git | |||
repo_url = https://github.com/NOAA-EMC/UPP | |||
# Specify either a branch name or a hash but not both. | |||
#branch = develop | |||
hash = 394917e | |||
hash = fbd41a5 | |||
local_path = src/UPP | |||
required = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have these hashes of UFS_UTILS and UPP been tested with the SRW App on other platforms than WCOSS2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JeffBeck-NOAA, this hash has been tested with PR #266 on other machines. Once this PR is approved, I'll change this hash to the original in this PR. This hash is just for testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested on wcoss2. Build was successful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the status of these hashes on Cheyenne? @mkavulich, @mark-a-potts, do we need to update hpc-stack or otherwise install newer libraries for this PR and for PR #266?
etc/lmod-setup.sh
Outdated
@@ -35,6 +35,10 @@ elif [ "$L_MACHINE" = odin ]; then | |||
module --initial_load --no_redirect restore | |||
export MODULEPATH="/oldscratch/ywang/external/hpc-stack/modulefiles/mpi/intel/2020/cray-mpich/7.7.16:/oldscratch/ywang/external/hpc-stack/modulefiles/compiler/intel/2020:/oldscratch/ywang/external/hpc-stack/modulefiles/core:/oldscratch/ywang/external/hpc-stack/modulefiles/stack:/opt/cray/pe/perftools/21.02.0/modulefiles:/opt/cray/ari/modulefiles:/opt/cray/pe/craype-targets/default/modulefiles:/opt/cray/pe/modulefiles:/opt/cray/modulefiles:/opt/modulefiles" | |||
|
|||
elif [ "$L_MACHINE" = wcoss2 ]; then | |||
module purge | |||
module load envvar/1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this go to the modulefiles instead? Unless it is loading Lmod and it is something that can be loaded in modulefiles, I think it should go there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danielabdi-noaa, when a user logs in wcoss2, the module envvar/1.0
is automatically loaded by default. In the wcoss2 user wiki, the following is described in the section of "Lmod and User Environments": Please note that when doing a module purge you need to reload envvar, module load envvar/1.0, in your environment (interactive or batch) to get the appropriate default modules. NCO has implemented the standard of doing a “module reset”. This implements the module purge and module load of envvar/1.0 in one step.
So I think this command would be better to be placed here. What do you think about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think module load envar/1.0
can go to the top of build_wcoss2_*
. Have you tested this change and found it to cause a problem? It looks like it is loading default modules which should not be any different from loading custom modulefiles of SRW. It is not changing the module app, as is the case for example on gaea/odin
going from Cray->Lmod or on macos where you have no Lmod when you start. Also, the way you have it now if the user does a regular module purge
on the command line, he has to follow it up with module load envvar/1.0
, which would be avoided if you have the latter in build_wcoss2
. But for this to be the case, you have to test loading envvar in build_wcoss2
works in the first place ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once again, machine specific things need to be isolated to the machine-specific configuration. I agree with @danielabdi-noaa. This belongs in the wcoss configuration module file. It doesn't make sense here, and pollutes the code with machine-specific things in places users and developers don't expect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danielabdi-noaa, okay, I'll test it and move it to build_wcoss2_*
unless it cases any issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are putting it in the build_wcoss*
modulefile, then you can do whatever you want there as that is completely isolated behavior for that particular machine configuration. My objection is the test for machine name in a non-machine-specific config/module file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean you don't like module reset
, either? It is an official nco standard on wcoss2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are really two problems. First, manipulation of modules really belongs in the machine-specific modulefiles. The other big problem is the elif [ "$L_MACHINE" = wcoss2 ]
. That is bad design and should be fixed. The conditional tests in this file for which machine you are on needs to be removed and moved from the code/scripts to machine-specific configurations. But, it appears that you are just following the existing template here, and so I can approve this as is for now because fixing this design problem is beyond the scope of your proposed changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@christopherwharrop-noaa, now I got what your point was. You pointed out that the if-statement itself in the 'lmod-setup.sh' file would cause the issues you mentioned. Is this correct? I thought that you were talking about the command 'module load envvar'. I'll remove 'wcoss2' from the if-statement. Sorry for my late grasp! :) :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chan-hoo I think it is Ok both ways. It is desirable if sourcing etc/lmod-setup.sh/csh
can be avoided, one less step. On most systems that is actually the case and all you get if you do source it is the module purge
in the else:
condition. In my opinion, those two etc files should be restricted to setting up Lmod, and loading default modules (envvar) can be considered part of setting up Lmod. If it was upto me, I would rather have the user load Lmod manually before building SRW :), but I know your stand on this. Thanks for addressing the requests.
etc/lmod-setup.sh
Outdated
@@ -35,6 +35,10 @@ elif [ "$L_MACHINE" = odin ]; then | |||
module --initial_load --no_redirect restore | |||
export MODULEPATH="/oldscratch/ywang/external/hpc-stack/modulefiles/mpi/intel/2020/cray-mpich/7.7.16:/oldscratch/ywang/external/hpc-stack/modulefiles/compiler/intel/2020:/oldscratch/ywang/external/hpc-stack/modulefiles/core:/oldscratch/ywang/external/hpc-stack/modulefiles/stack:/opt/cray/pe/perftools/21.02.0/modulefiles:/opt/cray/ari/modulefiles:/opt/cray/pe/craype-targets/default/modulefiles:/opt/cray/pe/modulefiles:/opt/cray/modulefiles:/opt/modulefiles" | |||
|
|||
elif [ "$L_MACHINE" = wcoss2 ]; then | |||
module purge | |||
module load envvar/1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once again, machine specific things need to be isolated to the machine-specific configuration. I agree with @danielabdi-noaa. This belongs in the wcoss configuration module file. It doesn't make sense here, and pollutes the code with machine-specific things in places users and developers don't expect.
etc/lmod-setup.csh
Outdated
else if ( "$L_MACHINE" == wcoss2 ) then | ||
module reset | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chan-hoo - This file is problematic from a design point of view because it is unscalable/unmaintainable/unportable/untestable. This stuff should be moved to some kind of individual machine-specific configuration file. But, since you are just following the existing template, I can't argue with this change right now. But this should be flagged as something that needs rethinking with respect to the overall design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@christopherwharrop-noaa, thank you for the comment. If you still think that it is better to separate two module commands and move envvar to build_wcoss2_intel, please let me know. I'll talk about this with my EMC colleagues.
@gspetro-NOAA - is WCOSS2 being supported as Level 1 systems? It seems like WCOSS2 has been removed from the supported platforms, following the discussion on SRW App meeting on May 24, 2022 (but I may be wrong). |
@natalie-perlin, WCOSS1 (wcoss_dell_p3) will go away in June. So it will be removed from SRW App soon (maybe in July). Instead, WCOSS2 (cactus/dogwood) will be on the list of the tier-1 platforms. |
For the release, I think we should leave WCOSS2 off. It is not yet operational and I am not sure we have all the required conda packages installed there, yet. If you like, you can add a "coming soon..." for it, though. |
@mark-a-potts, yes. I agree. Both WCOSS1 (wcoss_dell_p3) and WCOSS2 will not be available on this release version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to build the codes successfully on WCOSS2.
* Fix artifact creation for branches with / in name (#275) The slash in branch names such as release/.* and feature/.* is interpreted as a directory separator. This change replaces the "/" character with a "_". * Include documentation for building/running SRW App on Mac (#240) * updated docs * added git submodule * fix formatting * added new submodule commits * fixed ref links * finished Intro * finish Components & Intro edits * edited Rocoto workflow section of Quickstart * added minor hpc submodule commits * Updates to Rocoto Workflow in Quick Start * add to HPC-stack intro * submodule updates * added submodule docs edits * hpc-stack updates & formatting fixes * hpc-stack intro edits * bibtex attempted fix * add hpc-stack module edits * update sphinxcontrib version * add .readthedocs.yaml file * update .readthedocs.yaml file * update .readthedocs.yaml file * update conf.py * updates .readthedocs.yaml with submodules * updates .readthedocs.yaml with submodules * submodule updates * submodule updates * minor Intro edits * minor Intro edits * minor Intro edits * submodule updates * fixed typos in QS * QS updates * QS updates * QS updates * updates to InputOutput and QS * fix I/O doc typos * pull updates to hpc-stack docs * pull updates to hpc-stack docs * fix table wrapping * updates to QS for cloud * fix QS export statements * fix QS export statements * QS edits on bind, config * add bullet points to notes * running without rocoto * add HPC-Stack submodule w/docs * split QS into container/non-container approaches * added filepath changes for running in container on Orion, et al. * edits to overview and container QS * moved CodeReposAndDirs.rst info to the Introduction & deleted file * continued edits to SRWAppOverview * combine overview w/non-container docs * finish merging non-container guide & SRWOverview, rename/remove files, update FAQ * minor edits for Intro & QS * updates to BuildRun doc through 3.8.1 * edits to Build/Run and Components * remove .gitignore * fix Ch 3 title, 4 supported platform levels note * fix typos, add term links * other minor fixes/suggestions implemented * updated Intro based on feedback; changed SRW to SRW App throughout * update comment to Intro citation * add user-defined vertical levels to future work * Add instructions for srw_common module load * fix typo * update Intro & BuildRunSRW based on Mark's feedback * minor intro updates * 1st round of jwolff's edits * 2nd round of jwolff updates * update QS intro * fix minor physics details * update citation and physics suite name * add compute node allocation info to QS * add authoritative hpc-stack docs to Intro * create MacOS install/build instructions * add MacOS Build/Run instructions * fix MacOS Build/Run details * add MacOS info directly to Build/Run SRW chapter * minor details * minor edits * update Include-HPCInstall with mac installation docs * add note re: Terminal.app & bash shell in MacOS section * remove MacInstall file-contents added to BuildRunSRW * update hpc-stack submodule to include mac installation info * add MacOS config details * add MacOS config & run details * minor MacOS note * mention need to verify software library version #'s * update hpc-stack-mod * align MacDetails section with PR #238 info * remove gsed & alter related commands * update hpc-stack submodule * typos * switch from env to module load Co-authored-by: Will Mayfield <59745143+willmayfield@users.noreply.github.com> * Update BuildRunSRW.rst * update hpc-stack module docs & MacOS config.sh * update machine file instructions * updates to BuildRun chapter * fix typo Co-authored-by: gspetro <gillian.s.petro@gmail.com> Co-authored-by: Will Mayfield <59745143+willmayfield@users.noreply.github.com> * Update documentation for CSV file containing WE2E test info (#278) * Edits to documentation to match latest in code. * Edits to documentation to match latest in code. * Minor changes to documentation. * Port SRW App to WCOSS2 (#270) * port to wcoss2 * update scripts * ensure platform name from variance * update scripts * remove wcoss2 from lmod-setup * Fix --ccpp option in devbuild.sh (#280) * fix ccpp issue * update script * Print usage if machine name is not passed to `lmod-setup.sh/csh`. (#262) * Print usage() message if machine name is not passed. * Bug fix with macos modulefiles. * Back to original hashes (#287) * Updated the Introduction, build for MacOS (#281) * Updated the Introduction, build for MacOS * Update Introduction.rst Some comments removed * Update Introduction.rst Updated data storage requirements for the SRW * Update documentation for python plotting scripts (#289) ## DESCRIPTION OF CHANGES: This PR updates the RST documentation files to accompany changes in PR #[783](ufs-community/regional_workflow#783) in `regional_workflow`. ## DEPENDENCIES: PR #[783](ufs-community/regional_workflow#783) in `regional_workflow`. ## ISSUE (optional): Partially resolves issue #[781](ufs-community/regional_workflow#781) (still need to introduce tests in `regional_workflow` that run the plotting scripts). ## CONTRIBUTORS: @mark-a-potts * Updating for testing with updated ufs weather model * added version to wflow for jet * switched to epic repo for testing * Updates to modules for cheyenne * updated module path * changed hash to branch for testing * updated to use ufs-community version of WM * Pointing to PR-branch of regional_workflow for testing * Updating module files for cheyenne for testing * Added png explicitly under gaea. * Update build_cheyenne_intel Corrected loading system modules (load intel/2022.1, mpt/2.25), updated package versions if newer modules are built. * Updated pointer to model for testing * fixed repos * updated to point at my regional_workflow branch * updated regional_workflow to release/public-v2 Co-authored-by: Jesse McFarland <jesse@mcfarland.sh> Co-authored-by: Gillian Petro <96886803+gspetro-NOAA@users.noreply.github.com> Co-authored-by: gspetro <gillian.s.petro@gmail.com> Co-authored-by: Will Mayfield <59745143+willmayfield@users.noreply.github.com> Co-authored-by: gsketefian <31046882+gsketefian@users.noreply.github.com> Co-authored-by: Chan-Hoo.Jeon-NOAA <60152248+chan-hoo@users.noreply.github.com> Co-authored-by: danielabdi-noaa <52012304+danielabdi-noaa@users.noreply.github.com> Co-authored-by: Natalie Perlin <68030316+natalie-perlin@users.noreply.github.com> Co-authored-by: Mark Potts <mpotts@redlineperf.com>
DESCRIPTION OF CHANGES:
TESTS CONDUCTED:
Tested hashes:
UFS_UTILS: b6efa86
ufs weather model: 96dffa1
UPP: fbd41a5
Build steps:
ISSUE:
Fixes issue mentioned in #249
CONTRIBUTORS:
@BenjaminBlake-NOAA