Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port SRW App to WCOSS2 #270

Merged
merged 5 commits into from
May 31, 2022
Merged

Conversation

chan-hoo
Copy link
Collaborator

DESCRIPTION OF CHANGES:

TESTS CONDUCTED:

  • Tested hashes:
    UFS_UTILS: b6efa86
    ufs weather model: 96dffa1
    UPP: fbd41a5

  • Build steps:

  1. git clone -b feature/wcoss2 https://github.com/chan-hoo/ufs-srweather-app
  2. cd ufs-srweather-app
  3. ./manage_externals/checkout_externals
  4. ./devbuild.sh --platform=wcoss2

ISSUE:

Fixes issue mentioned in #249

CONTRIBUTORS:

@BenjaminBlake-NOAA

@@ -30,7 +30,7 @@ protocol = git
repo_url = https://github.com/NOAA-EMC/UPP
# Specify either a branch name or a hash but not both.
#branch = develop
hash = 394917e
hash = fbd41a5
local_path = src/UPP
required = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have these hashes of UFS_UTILS and UPP been tested with the SRW App on other platforms than WCOSS2?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JeffBeck-NOAA, this hash has been tested with PR #266 on other machines. Once this PR is approved, I'll change this hash to the original in this PR. This hash is just for testing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on wcoss2. Build was successful.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the status of these hashes on Cheyenne? @mkavulich, @mark-a-potts, do we need to update hpc-stack or otherwise install newer libraries for this PR and for PR #266?

@@ -35,6 +35,10 @@ elif [ "$L_MACHINE" = odin ]; then
module --initial_load --no_redirect restore
export MODULEPATH="/oldscratch/ywang/external/hpc-stack/modulefiles/mpi/intel/2020/cray-mpich/7.7.16:/oldscratch/ywang/external/hpc-stack/modulefiles/compiler/intel/2020:/oldscratch/ywang/external/hpc-stack/modulefiles/core:/oldscratch/ywang/external/hpc-stack/modulefiles/stack:/opt/cray/pe/perftools/21.02.0/modulefiles:/opt/cray/ari/modulefiles:/opt/cray/pe/craype-targets/default/modulefiles:/opt/cray/pe/modulefiles:/opt/cray/modulefiles:/opt/modulefiles"

elif [ "$L_MACHINE" = wcoss2 ]; then
module purge
module load envvar/1.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this go to the modulefiles instead? Unless it is loading Lmod and it is something that can be loaded in modulefiles, I think it should go there.

Copy link
Collaborator Author

@chan-hoo chan-hoo May 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielabdi-noaa, when a user logs in wcoss2, the module envvar/1.0 is automatically loaded by default. In the wcoss2 user wiki, the following is described in the section of "Lmod and User Environments": Please note that when doing a module purge you need to reload envvar, module load envvar/1.0, in your environment (interactive or batch) to get the appropriate default modules. NCO has implemented the standard of doing a “module reset”. This implements the module purge and module load of envvar/1.0 in one step. So I think this command would be better to be placed here. What do you think about this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think module load envar/1.0 can go to the top of build_wcoss2_*. Have you tested this change and found it to cause a problem? It looks like it is loading default modules which should not be any different from loading custom modulefiles of SRW. It is not changing the module app, as is the case for example on gaea/odin going from Cray->Lmod or on macos where you have no Lmod when you start. Also, the way you have it now if the user does a regular module purge on the command line, he has to follow it up with module load envvar/1.0, which would be avoided if you have the latter in build_wcoss2. But for this to be the case, you have to test loading envvar in build_wcoss2 works in the first place ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once again, machine specific things need to be isolated to the machine-specific configuration. I agree with @danielabdi-noaa. This belongs in the wcoss configuration module file. It doesn't make sense here, and pollutes the code with machine-specific things in places users and developers don't expect.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielabdi-noaa, okay, I'll test it and move it to build_wcoss2_* unless it cases any issues.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are putting it in the build_wcoss* modulefile, then you can do whatever you want there as that is completely isolated behavior for that particular machine configuration. My objection is the test for machine name in a non-machine-specific config/module file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean you don't like module reset, either? It is an official nco standard on wcoss2.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are really two problems. First, manipulation of modules really belongs in the machine-specific modulefiles. The other big problem is the elif [ "$L_MACHINE" = wcoss2 ]. That is bad design and should be fixed. The conditional tests in this file for which machine you are on needs to be removed and moved from the code/scripts to machine-specific configurations. But, it appears that you are just following the existing template here, and so I can approve this as is for now because fixing this design problem is beyond the scope of your proposed changes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christopherwharrop-noaa, now I got what your point was. You pointed out that the if-statement itself in the 'lmod-setup.sh' file would cause the issues you mentioned. Is this correct? I thought that you were talking about the command 'module load envvar'. I'll remove 'wcoss2' from the if-statement. Sorry for my late grasp! :) :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chan-hoo I think it is Ok both ways. It is desirable if sourcing etc/lmod-setup.sh/csh can be avoided, one less step. On most systems that is actually the case and all you get if you do source it is the module purge in the else: condition. In my opinion, those two etc files should be restricted to setting up Lmod, and loading default modules (envvar) can be considered part of setting up Lmod. If it was upto me, I would rather have the user load Lmod manually before building SRW :), but I know your stand on this. Thanks for addressing the requests.

@@ -35,6 +35,10 @@ elif [ "$L_MACHINE" = odin ]; then
module --initial_load --no_redirect restore
export MODULEPATH="/oldscratch/ywang/external/hpc-stack/modulefiles/mpi/intel/2020/cray-mpich/7.7.16:/oldscratch/ywang/external/hpc-stack/modulefiles/compiler/intel/2020:/oldscratch/ywang/external/hpc-stack/modulefiles/core:/oldscratch/ywang/external/hpc-stack/modulefiles/stack:/opt/cray/pe/perftools/21.02.0/modulefiles:/opt/cray/ari/modulefiles:/opt/cray/pe/craype-targets/default/modulefiles:/opt/cray/pe/modulefiles:/opt/cray/modulefiles:/opt/modulefiles"

elif [ "$L_MACHINE" = wcoss2 ]; then
module purge
module load envvar/1.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once again, machine specific things need to be isolated to the machine-specific configuration. I agree with @danielabdi-noaa. This belongs in the wcoss configuration module file. It doesn't make sense here, and pollutes the code with machine-specific things in places users and developers don't expect.

Comment on lines 37 to 39
else if ( "$L_MACHINE" == wcoss2 ) then
module reset

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chan-hoo - This file is problematic from a design point of view because it is unscalable/unmaintainable/unportable/untestable. This stuff should be moved to some kind of individual machine-specific configuration file. But, since you are just following the existing template, I can't argue with this change right now. But this should be flagged as something that needs rethinking with respect to the overall design.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christopherwharrop-noaa, thank you for the comment. If you still think that it is better to separate two module commands and move envvar to build_wcoss2_intel, please let me know. I'll talk about this with my EMC colleagues.

@natalie-perlin
Copy link
Collaborator

@gspetro-NOAA - is WCOSS2 being supported as Level 1 systems? It seems like WCOSS2 has been removed from the supported platforms, following the discussion on SRW App meeting on May 24, 2022 (but I may be wrong).

@chan-hoo
Copy link
Collaborator Author

@natalie-perlin, WCOSS1 (wcoss_dell_p3) will go away in June. So it will be removed from SRW App soon (maybe in July). Instead, WCOSS2 (cactus/dogwood) will be on the list of the tier-1 platforms.

@mark-a-potts
Copy link
Collaborator

@natalie-perlin, WCOSS1 (wcoss_dell_p3) will go away in June. So it will be removed from SRW App soon (maybe in July). Instead, WCOSS2 (cactus/dogwood) will be on the list of the tier-1 platforms.

For the release, I think we should leave WCOSS2 off. It is not yet operational and I am not sure we have all the required conda packages installed there, yet. If you like, you can add a "coming soon..." for it, though.

@chan-hoo
Copy link
Collaborator Author

@mark-a-potts, yes. I agree. Both WCOSS1 (wcoss_dell_p3) and WCOSS2 will not be available on this release version.

Copy link
Collaborator

@BenjaminBlake-NOAA BenjaminBlake-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to build the codes successfully on WCOSS2.

@chan-hoo chan-hoo merged commit 9d159ca into ufs-community:develop May 31, 2022
@chan-hoo chan-hoo deleted the feature/wcoss2 branch June 1, 2022 11:45
mark-a-potts added a commit that referenced this pull request Jun 12, 2022
* Fix artifact creation for branches with / in name (#275)

The slash in branch names such as release/.* and feature/.* is
interpreted as a directory separator. This change replaces the "/"
character with a "_".

* Include documentation for building/running SRW App on Mac (#240)

* updated docs

* added git submodule

* fix formatting

* added new submodule commits

* fixed ref links

* finished Intro

* finish Components & Intro edits

* edited Rocoto workflow section of Quickstart

* added minor hpc submodule commits

* Updates to Rocoto Workflow in Quick Start

* add to HPC-stack intro

* submodule updates

* added submodule docs edits

* hpc-stack updates & formatting fixes

* hpc-stack intro edits

* bibtex attempted fix

* add hpc-stack module edits

* update sphinxcontrib version

* add .readthedocs.yaml file

* update .readthedocs.yaml file

* update .readthedocs.yaml file

* update conf.py

* updates .readthedocs.yaml with submodules

* updates .readthedocs.yaml with submodules

* submodule updates

* submodule updates

* minor Intro edits

* minor Intro edits

* minor Intro edits

* submodule updates

* fixed typos in QS

* QS updates

* QS updates

* QS updates

* updates to InputOutput and QS

* fix I/O doc typos

* pull updates to hpc-stack docs

* pull updates to hpc-stack docs

* fix table wrapping

* updates to QS for cloud

* fix QS export statements

* fix QS export statements

* QS edits on bind, config

* add bullet points to notes

* running without rocoto

* add HPC-Stack submodule w/docs

* split QS into container/non-container approaches

* added filepath changes for running in container on Orion, et al.

* edits to overview and container QS

* moved CodeReposAndDirs.rst info to the Introduction & deleted file

* continued edits to SRWAppOverview

* combine overview w/non-container docs

* finish merging non-container guide & SRWOverview, rename/remove files, update FAQ

* minor edits for Intro & QS

* updates to BuildRun doc through 3.8.1

* edits to Build/Run and Components

* remove .gitignore

* fix Ch 3 title, 4 supported platform levels note

* fix typos, add term links

* other minor fixes/suggestions implemented

* updated Intro based on feedback; changed SRW to SRW App throughout

* update comment to Intro citation

* add user-defined vertical levels to future work

* Add instructions for srw_common module load

* fix typo

* update Intro & BuildRunSRW based on Mark's feedback

* minor intro updates

* 1st round of jwolff's edits

* 2nd round of jwolff updates

* update QS intro

* fix minor physics details

* update citation and physics suite name

* add compute node allocation info to QS

* add authoritative hpc-stack docs to Intro

* create MacOS install/build instructions

* add MacOS Build/Run instructions

* fix MacOS Build/Run details

* add MacOS info directly to Build/Run SRW chapter

* minor details

* minor edits

* update Include-HPCInstall with mac installation docs

* add note re: Terminal.app & bash shell in MacOS section

* remove MacInstall file-contents added to BuildRunSRW

* update hpc-stack submodule to include mac installation info

* add MacOS config details

* add MacOS config & run details

* minor MacOS note

* mention need to verify software library version #'s

* update hpc-stack-mod

* align MacDetails section with PR #238 info

* remove gsed & alter related commands

* update hpc-stack submodule

* typos

* switch from env to module load

Co-authored-by: Will Mayfield <59745143+willmayfield@users.noreply.github.com>

* Update BuildRunSRW.rst

* update hpc-stack module docs & MacOS config.sh

* update machine file instructions

* updates to BuildRun chapter

* fix typo

Co-authored-by: gspetro <gillian.s.petro@gmail.com>
Co-authored-by: Will Mayfield <59745143+willmayfield@users.noreply.github.com>

* Update documentation for CSV file containing WE2E test info (#278)

* Edits to documentation to match latest in code.

* Edits to documentation to match latest in code.

* Minor changes to documentation.

* Port SRW App to WCOSS2 (#270)

* port to wcoss2

* update scripts

* ensure platform name from variance

* update scripts

* remove wcoss2 from lmod-setup

* Fix --ccpp option in devbuild.sh (#280)

* fix ccpp issue

* update script

* Print usage if machine name is not passed to `lmod-setup.sh/csh`. (#262)

* Print usage() message if machine name is not passed.

* Bug fix with macos modulefiles.

* Back to original hashes (#287)

* Updated the Introduction, build for MacOS (#281)

* Updated the Introduction, build for MacOS

* Update Introduction.rst

Some comments removed

* Update Introduction.rst

Updated data storage requirements for the SRW

* Update documentation for python plotting scripts (#289)

## DESCRIPTION OF CHANGES: 
This PR updates the RST documentation files to accompany changes in PR #[783](ufs-community/regional_workflow#783) in `regional_workflow`.

## DEPENDENCIES:
PR #[783](ufs-community/regional_workflow#783) in `regional_workflow`.

## ISSUE (optional): 
Partially resolves issue #[781](ufs-community/regional_workflow#781) (still need to introduce tests in `regional_workflow` that run the plotting scripts).

## CONTRIBUTORS: 
@mark-a-potts

* Updating for testing with updated ufs weather model

* added version to wflow for jet

* switched to epic repo for testing

* Updates to modules for cheyenne

* updated module path

* changed hash to branch for testing

* updated to use ufs-community version of WM

* Pointing to PR-branch of regional_workflow for testing

* Updating module files for cheyenne for testing

* Added png explicitly under gaea.

* Update build_cheyenne_intel

Corrected loading system modules (load intel/2022.1, mpt/2.25), updated package versions if newer modules are built.

* Updated pointer to model for testing

* fixed repos

* updated to point at my regional_workflow branch

* updated regional_workflow to release/public-v2

Co-authored-by: Jesse McFarland <jesse@mcfarland.sh>
Co-authored-by: Gillian Petro <96886803+gspetro-NOAA@users.noreply.github.com>
Co-authored-by: gspetro <gillian.s.petro@gmail.com>
Co-authored-by: Will Mayfield <59745143+willmayfield@users.noreply.github.com>
Co-authored-by: gsketefian <31046882+gsketefian@users.noreply.github.com>
Co-authored-by: Chan-Hoo.Jeon-NOAA <60152248+chan-hoo@users.noreply.github.com>
Co-authored-by: danielabdi-noaa <52012304+danielabdi-noaa@users.noreply.github.com>
Co-authored-by: Natalie Perlin <68030316+natalie-perlin@users.noreply.github.com>
Co-authored-by: Mark Potts <mpotts@redlineperf.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants