Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify who maintains stack where and establish a process for updating the stacks #83

Closed
aerorahul opened this issue Nov 18, 2020 · 29 comments
Assignees
Labels
enhancement New feature or request

Comments

@aerorahul
Copy link
Contributor

Clearly identify, who officially maintains a stack on which machine.
There can be a back-up.

Establish a process for updating a stack and the versioning that goes with it:

  • Is the update in a single package which has no downstream dependency in the stack e.g. ESMF
  • is the update in a single package which has downstream dependencies e.g. hdf5 or netcdf
  • Is the update to the stack due to change in the Compiler + MPI combination?

And more.

@edwardhartnett edwardhartnett self-assigned this Nov 19, 2020
@edwardhartnett
Copy link
Contributor

Doesn't @GeorgeVandenberghe-NOAA usually do this?

@GeorgeVandenberghe-NOAA
Copy link

@edwardhartnett
Copy link
Contributor

OK, great opportunity to identify some back-ups to @GeorgeVandenberghe-NOAA !

@edwardhartnett edwardhartnett added the enhancement New feature or request label Nov 19, 2020
@edwardhartnett
Copy link
Contributor

Can we start with an exhaustive list of machines we are responsible to install hpc-stack on? @GeorgeVandenberghe-NOAA which machines would you install on?

@kgerheiser
Copy link
Contributor

Hang and I usually install hpc-stack.

Orion, Hera, Jet, WCOSS-Dell

@edwardhartnett
Copy link
Contributor

Just those 4 machines then? Where do you install it? That is, under what root directory?

@kgerheiser
Copy link
Contributor

@kgerheiser
Copy link
Contributor

Hang and I have kinda of being doing it ad-hoc. I think I installed it on Hera and Jet, and he did WCOSS and Orion.

I think he and I should split up which machines we're responsible for and formally document that.

@GeorgeVandenberghe-NOAA
Copy link

@edwardhartnett
Copy link
Contributor

The README would be a good place to document this. We have authors and code manager, add a section "Installers."

@kgerheiser
Copy link
Contributor

It can be built on systems without lmod, but then you don't have modules

@climbfuji
Copy link
Contributor

climbfuji commented Nov 20, 2020 via email

@kgerheiser
Copy link
Contributor

@climbfuji Hang or I can do Jet. We've been maintaining a build of hpc-stack on there

@climbfuji
Copy link
Contributor

@kgerheiser this would be great, and a necessary first step to make jet a tier-1 platform. @arunchawla-NOAA created an issue for this work here: ufs-community/ufs-weather-model#271 - once you install the stack on jet, can you please let the ufs-weather-model code managers (@junwang-noaa, @DusanJovic-NOAA, myself) know so that we can update the modulefile?

Going forward, we should continue the discussion and work towards making jet a tier-1 platform in the ufs-weather-model issue 271.

@GeorgeVandenberghe-NOAA
Copy link

@climbfuji
Copy link
Contributor

Yes, hpc-stack does not use the nightmare flag -xHOST, which makes this possible.

The fact that jet has different node types and hardware is one reason why we need to make it a tier-1 platform - we need to make sure that our codes function in such an environment.

The ufs-weather-model currently works around the default AVX2 flags by compiling the model with multiple SIMD instruction sets on jet:

    elseif(SIMDMULTIARCH)
        set(CMAKE_Fortran_FLAGS "${CMAKE_Fortran_FLAGS} -axSSE4.2,AVX,CORE-AVX2,CORE-AVX512 -qno-opt-dynamic-align")
        set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -axSSE4.2,AVX,CORE-AVX2,CORE-AVX512 -qno-opt-dynamic-align")

While this provides flexibility, it makes compiling a lot slower. We may consider other options such as only specifying -axSSE4.2,CORE-AVX2 or turning off SIMD instructions entirely on jet. TBD.

The rt.sh scripts currently compile and run on xjet, but there is no reason to keep doing this. We could run some tests on xjet, some on kjet, some on whatever-jet. TBD.

@GeorgeVandenberghe-NOAA
Copy link

@edwardhartnett
Copy link
Contributor

edwardhartnett commented Nov 21, 2020

Here's a summary from the comments above:

Machine Programmer
Orion Kyle
Hera Hang
Jet Kyle
WCOSS-Dell Hang
cheyenne Dom
gaea Dom
WCOSS-Cray Hang

Is that all of them?

I would suggest that we mark the release, then everyone install and report back either success or problems.

If there are problems, we hold the release, resolve the problems, and move the tag to the fixed release.

Once there are no problems and we are all happy with the release, we announce it, and move on to planning of the 1.2.0 release.

@edwardhartnett
Copy link
Contributor

Elsewhere @mark-a-potts mentions a system called "acorn". Is that a NOAA system? Mark, do you want to try our our 1.1.0 release before we announce it? Or do you want to try building the develop branch?

@Hang-Lei-NOAA
Copy link
Contributor

Acorn is the name of WCOSS2 machine

@aerorahul
Copy link
Contributor Author

Lets leave wcoss2 (acorn) out of this release.

@edwardhartnett
Copy link
Contributor

OK I've added an issue for acorn and assigned it to the next release (1.2.0).

@arunchawla-NOAA
Copy link

maybe we should create a milestone for 1.2.0 and identify issues to address for that? We need to add met plus libraries before we roll it out on WCOSS2. Has the met team created an issue for that?

@edwardhartnett
Copy link
Contributor

@arunchawla-NOAA to add an issue to the next release, use the "Project" pull-down on the right side of the issue screen.

At each weekly meeting we will examine the issue list for the next release, and also place any new issues into a release. For release planning for the 1.2.0 release, see: https://github.com/NOAA-EMC/hpc-stack/projects/2

(New issues can also be added from this screen, or selected from the issue list and added to the release with the Add Cards button on upper right.)

There is as yet no issue for the met plus libraries, and I will add that now.

@edwardhartnett
Copy link
Contributor

edwardhartnett commented Nov 21, 2020

(@arunchawla-NOAA for release planning of the upcoming 1.1.0 release see https://github.com/NOAA-EMC/hpc-stack/projects/1).

@junwang-noaa
Copy link

junwang-noaa commented Nov 21, 2020 via email

@kgerheiser
Copy link
Contributor

Hang will take care of WCOSS Cray.

@DusanJovic-NOAA
Copy link
Contributor

Please reinstall hpc-stack on WCOSS2. It's broken after they renamed /lfs/h2 to /lfs/h1.

$ module show hpc/1.0.0-beta1 
------------------------------------------------------------------------------------------------------ 
  /lfs/h1/emc/nceplibs/noscrub/hpc-stack/test/noaa/modulefiles/stack/hpc/1.0.0-beta1.lua: 
------------------------------------------------------------------------------------------------------ 
help([[]]) 
conflict("hpc") 
setenv("HPC_OPT","/lfs/h2/emc/nceplibs/noscrub/hpc-stack/test/noaa") 
prepend_path("MODULEPATH","/lfs/h2/emc/nceplibs/noscrub/hpc-stack/test/noaa/modulefiles/core") 
setenv("LMOD_EXACT_MATCH","no") 
setenv("LMOD_EXTENDED_DEFAULT","yes") 
whatis("Name: hpc") 
whatis("Version: 1.0.0-beta1") 
whatis("Category: Base") 
whatis("Description: Initialize HPC software stack")

MODULEPATH still points to /lsf/h2.

@Hang-Lei-NOAA
Copy link
Contributor

Hang-Lei-NOAA commented Nov 25, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

9 participants