Skip to content

Commit

Permalink
Enable cycling support for Gaea C6 (#3323)
Browse files Browse the repository at this point in the history
Add support for the Gaea clusters by enabling cycled experiments
on C6 and addressing a number of issues on both clusters.

Also included is support for HPSS on C6, but it should NOT be
utilized at this time. The ES cluster, where HPSS connections are made,
has a significant issue with the F6 (C6's filesystem) mount that causes
the system to run extremely slow when filesystem-intense operations are
performed (such as htar). There is a plan to enable this feature more
broadly in the near future.

Included changes:
- Fixed memory variable unsetting for Gaea C5/6 in config.resources.GAEAC{5,6}
- Refactoring the system-level parameter detection when determining task
resources in the setup scripts to make it easier to define multiple
partitions, queues, and clusters.
- Adding a `DTN` partition, queue, and cluster definition.
- Added/renamed missing/miss-named tasks to tasks.py and added a check
that the input task is valid.

Refs #2905
Refs #3108
Refs #3133 
Refs #3261
Refs #3269 
Refs #3324
  • Loading branch information
DavidHuber-NOAA authored Mar 3, 2025
1 parent bbd5cca commit b152424
Show file tree
Hide file tree
Showing 44 changed files with 412 additions and 155 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/ci_unit_tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ jobs:
run: |
sudo mkdir -p /scratch1/NCEPDEV
cd $GITHUB_WORKSPACE/sorc
git submodule update --init
git submodule update --init -j 9
./link_workflow.sh
cd $GITHUB_WORKSPACE/ci/scripts/tests
ln -s ../wxflow
Expand Down
3 changes: 2 additions & 1 deletion ci/cases/pr/C48_S2SW_extended.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,8 @@ arguments:

skip_ci_on_hosts:
- hera
- gaea
- gaeac5
- gaeac6
- orion
- hercules
- wcoss2 # TODO run on WCOSS2 once the gfs_waveawipsbulls job is fixed
3 changes: 2 additions & 1 deletion ci/cases/pr/C48mx500_3DVarAOWCDA.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,5 +19,6 @@ arguments:

skip_ci_on_hosts:
- wcoss2
- gaea
- gaeac6
- gaeac5
- orion
3 changes: 2 additions & 1 deletion ci/cases/pr/C48mx500_hybAOWCDA.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,5 +20,6 @@ arguments:

skip_ci_on_hosts:
- wcoss2
- gaea
- gaeac5
- gaeac6
- orion
3 changes: 2 additions & 1 deletion ci/cases/pr/C96C48_hybatmaerosnowDA.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,5 +20,6 @@ arguments:
skip_ci_on_hosts:
- wcoss2
- orion
- gaea
- gaeac5
- gaeac6
- hercules
3 changes: 2 additions & 1 deletion ci/cases/pr/C96C48_ufs_hybatmDA.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,8 @@ arguments:
yaml: {{ HOMEgfs }}/ci/cases/yamls/ufs_hybatmDA_defaults.ci.yaml

skip_ci_on_hosts:
- gaea
- gaeac5
- gaeac6
- orion
- hercules
- wcoss2
3 changes: 2 additions & 1 deletion ci/cases/pr/C96_atm3DVar_extended.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ arguments:

skip_ci_on_hosts:
- hera
- gaea
- gaeac5
- gaeac6
- orion
- hercules
3 changes: 3 additions & 0 deletions ci/cases/pr/C96mx100_S2S.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,6 @@ arguments:
icsdir: {{ 'ICSDIR_ROOT' | getenv }}/C96mx100/20240610
yaml: {{ HOMEgfs }}/ci/cases/yamls/sfs_defaults.yaml

skip_ci_on_hosts:
- gaeac6
- gaeac5
2 changes: 1 addition & 1 deletion ci/platforms/config.gaeac6
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#!/usr/bin/bash

export GFS_CI_ROOT=/gpfs/f6/drsa-precip3/scratch/${USER}/GFS_CI_ROOT
export ICSDIR_ROOT=/gpfs/f6/bil-fire8/world-shared/global/glopara/data/ICSDIR
export ICSDIR_ROOT=/gpfs/f6/drsa-precip3/world-shared/role.glopara/data/ICSDIR
export HPC_ACCOUNT=drsa-precip3
export max_concurrent_cases=5
export max_concurrent_pr=4
2 changes: 1 addition & 1 deletion ci/scripts/utils/launch_java_agent.sh
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/env bash
#!/usr/bin/env bash

set -e

Expand Down
8 changes: 4 additions & 4 deletions docs/source/hpc.rst
Original file line number Diff line number Diff line change
Expand Up @@ -147,17 +147,17 @@ The Global Workflow provides capabilities for deterministic and ensemble forecas
-
- X
* - Gaea C6
- 3
- 1
- X
- X
- X
- X
- X
-
-
-
-
- X
-
-
- X
-
- X
* - AWS (PW)
Expand Down
2 changes: 1 addition & 1 deletion docs/source/setup.rst
Original file line number Diff line number Diff line change
Expand Up @@ -259,4 +259,4 @@ Example:
Step 4: Confirm files from setup scripts
****************************************

You will now have a rocoto xml file in your EXPDIR ($PSLOT.xml) and a crontab file generated for your use. Rocoto uses CRON as the scheduler. If you do not have a crontab file you may not have had the rocoto module loaded. To fix this load a rocoto module and then rerun ``setup_xml.py`` script again. Follow directions for setting up the rocoto cron on the platform the experiment is going to run on.
You will now have a rocoto xml file in your EXPDIR ($PSLOT.xml) and a crontab file generated for your use. Rocoto uses CRON or SCRON as the scheduler. If you do not have a crontab file you may not have had the rocoto module loaded. To fix this load a rocoto module and then rerun ``setup_xml.py`` script again. Follow directions for setting up the rocoto cron on the platform the experiment is going to run on.
29 changes: 22 additions & 7 deletions docs/source/start.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,31 +18,46 @@ The first jobs of your run should now be queued or already running (depending on

You'll now have a "logs" folder in both your ``ROTDIR`` and ``EXPDIR``. The EXPDIR log folder contains workflow log files (e.g. rocoto command results) and the ``ROTDIR`` log folder will contain logs for each job (previously known as dayfiles).

^^^^^^^^^^^^^^^^^^^^^^^^^^^
Set up your experiment cron
^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Set up your experiment cron or scron
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Most systems allow users to write to their crontabs. However, some systems, like Gaea, require users the use of scron. The setup is very similar, with the only differences being the command (crontab or scrontab) and the entry.


.. note::
Orion and Hercules currently only support cron on Orion-login-1 and Hercules-login-1, respectively. Cron support for other login nodes is coming in the future.

::

crontab -e
(crontab|scrontab) -e

or

::

crontab $PSLOT.crontab
(crontab|scrontab) $PSLOT.crontab

.. warning::

The ``crontab $PSLOT.crontab`` command will overwrite existing crontab file on your login node. If running multiple crons recommend editing crontab file with ``crontab -e`` command.
The ``(crontab|scrontab) $PSLOT.crontab`` command will overwrite the existing crontab/scrontab file on your login node. If you are running multiple crons, it is recommend editing the crontab/scrontab file with ``(crontab|scrontab) -e`` command.

Check your crontab settings::

crontab -l
(crontab|scrontab) -l

Crontab uses following format::

*/5 * * * * /path/to/rocotorun -w /path/to/workflow/definition/file -d /path/to/workflow/database/file

Scrontab instead launches a script and requires SCRON directives to launch an sbatch job with the following format::

#SCRON --partition=<cron partition>
#SCRON --account=<your account>
#SCRON --mail-user=<your email (optional)>
#SCRON --dependency=singleton
#SCRON --job-name=${PSLOT}_cron
#SCRON --output=/path/to/EXPDIR/logs/scron.log
#SCRON --time=00:10:00

*/5 * * * * /path/to/rocoto/launch/script
2 changes: 1 addition & 1 deletion env/GAEAC6.env
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ case ${step} in
"prep" | "prepbufr")

export POE="NO"
export BACK=${BACK:-"YES"}
export BACK="NO"
export sys_tp="GAEAC6"
export launcher_PREP="srun"
;;
Expand Down
3 changes: 3 additions & 0 deletions modulefiles/module_base.gaeac5.lua
Original file line number Diff line number Diff line change
Expand Up @@ -44,5 +44,8 @@ load(pathJoin("prepobs", (os.getenv("prepobs_run_ver") or "None")))
prepend_path("MODULEPATH", pathJoin("/gpfs/f5/ufs-ard/world-shared/global/glopara/data/git/Fit2Obs/v" .. (os.getenv("fit2obs_ver") or "None"), "modulefiles"))
load(pathJoin("fit2obs", (os.getenv("fit2obs_ver") or "None")))

local hsi_mod_path=(os.getenv("hsi_mod_path") or "None")
append_path("MODULEPATH", hsi_mod_path)
load(pathJoin("hsi", (os.getenv("hsi_ver") or "None")))

whatis("Description: GFS run setup environment")
12 changes: 10 additions & 2 deletions modulefiles/module_base.gaeac6.lua
Original file line number Diff line number Diff line change
Expand Up @@ -36,12 +36,20 @@ load(pathJoin("metplus", (os.getenv("metplus_ver") or "None")))
load(pathJoin("py-xarray", (os.getenv("py_xarray_ver") or "None")))

setenv("WGRIB2","wgrib2")

-- Stop gap fix for wgrib with spack-stack 1.6.0
-- TODO Remove this when spack-stack issue #1097 is resolved
setenv("WGRIB","wgrib")
setenv("UTILROOT",(os.getenv("prod_util_ROOT") or "None"))

prepend_path("MODULEPATH", pathJoin("/gpfs/f6/bil-fire8/world-shared/global/glopara/git/prepobs/v" .. (os.getenv("prepobs_run_ver") or "None"), "modulefiles"))
prepend_path("MODULEPATH", pathJoin("/gpfs/f6/drsa-precip3/world-shared/role.glopara/git/prepobs/v" .. (os.getenv("prepobs_run_ver") or "None"), "modulefiles"))
load(pathJoin("prepobs", (os.getenv("prepobs_run_ver") or "None")))

prepend_path("MODULEPATH", pathJoin("/gpfs/f6/bil-fire8/world-shared/global/glopara/git/Fit2Obs/v" .. (os.getenv("fit2obs_ver") or "None"), "modulefiles"))
prepend_path("MODULEPATH", pathJoin("/gpfs/f6/drsa-precip3/world-shared/role.glopara/git/Fit2Obs/v" .. (os.getenv("fit2obs_ver") or "None"), "modulefiles"))
load(pathJoin("fit2obs", (os.getenv("fit2obs_ver") or "None")))

local hsi_mod_path=(os.getenv("hsi_mod_path") or "None")
append_path("MODULEPATH", hsi_mod_path)
load(pathJoin("hsi", (os.getenv("hsi_ver") or "None")))

whatis("Description: GFS run setup environment")
4 changes: 2 additions & 2 deletions modulefiles/module_gwsetup.gaeac6.lua
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@ help([[
Load environment to run GFS workflow setup scripts on Gaea C6
]])

prepend_path("MODULEPATH", "/ncrc/proj/epic/rocoto/modulefiles")
load(pathJoin("rocoto"))
prepend_path("MODULEPATH", "/autofs/ncrc-svm1_proj/hurr1/hafs/shared/modulefiles")
load(pathJoin("rocoto", "1.3.7_fix"))

prepend_path("MODULEPATH", "/ncrc/proj/epic/spack-stack/c6/spack-stack-1.6.0/envs/unified-env/install/modulefiles/Core")

Expand Down
5 changes: 5 additions & 0 deletions parm/config/gefs/config.base
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,14 @@ export RUN_ENVIR="emc"
export ACCOUNT="@ACCOUNT@"
export QUEUE="@QUEUE@"
export QUEUE_SERVICE="@QUEUE_SERVICE@"
export QUEUE_DTN="@QUEUE_DTN@"
export PARTITION_BATCH="@PARTITION_BATCH@"
export PARTITION_SERVICE="@PARTITION_SERVICE@"
export PARTITION_DTN="@PARTITION_DTN@"
export RESERVATION="@RESERVATION@"
export CLUSTERS="@CLUSTERS@"
export CLUSTERS_SERVICE="@CLUSTERS_SERVICE@"
export CLUSTERS_DTN="@CLUSTERS_DTN@"

# Project to use in mass store:
export HPSS_PROJECT="@HPSS_PROJECT@"
Expand Down
2 changes: 2 additions & 0 deletions parm/config/gefs/config.resources
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@ case ${machine} in
exit 3
esac
;;
"GAEAC5") max_tasks_per_node=128;;
"GAEAC6") max_tasks_per_node=192;;
"S4")
case ${PARTITION_BATCH} in
"s4") max_tasks_per_node=32;;
Expand Down
1 change: 1 addition & 0 deletions parm/config/gefs/config.resources.GAEAC5
1 change: 1 addition & 0 deletions parm/config/gefs/config.resources.GAEAC6
4 changes: 4 additions & 0 deletions parm/config/gfs/config.base
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,14 @@ export RUN_ENVIR="emc"
export ACCOUNT="@ACCOUNT@"
export QUEUE="@QUEUE@"
export QUEUE_SERVICE="@QUEUE_SERVICE@"
export QUEUE_DTN="@QUEUE_DTN@"
export PARTITION_BATCH="@PARTITION_BATCH@"
export PARTITION_SERVICE="@PARTITION_SERVICE@"
export PARTITION_DTN="@PARTITION_DTN@"
export RESERVATION="@RESERVATION@"
export CLUSTERS="@CLUSTERS@"
export CLUSTERS_SERVICE="@CLUSTERS_SERVICE@"
export CLUSTERS_DTN="@CLUSTERS_DTN@"

# Project to use in mass store:
export HPSS_PROJECT="@HPSS_PROJECT@"
Expand Down
19 changes: 19 additions & 0 deletions parm/config/gfs/config.resources.GAEAC6
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,22 @@

unset memory
unset "memory_${RUN}"

case ${step} in
"fcst" | "efcs")
case "${CASE}" in
"C768")
export tasks_per_node=144
;;
"C1152")
#TODO set tasks_per_node after investigating a safe threshold
;;
*)
# Nothing to do for other resolutions
true
;;
esac
;;
*)
;;
esac
10 changes: 8 additions & 2 deletions sorc/build_compute.sh
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,8 @@ fi

# shellcheck disable=SC2155,SC2312
HOMEgfs=$(cd "$(dirname "$(readlink -f -n "${BASH_SOURCE[0]}" )" )/.." && pwd -P)

mkdir -p "${HOMEgfs}/sorc/logs" || exit 1
cd "${HOMEgfs}/sorc" || exit 1

# Delete the rocoto XML and database if they exist
Expand All @@ -76,7 +78,9 @@ set +e
"${HOMEgfs}/workflow/build_compute.py" --yaml "${HOMEgfs}/workflow/build_opts.yaml" --systems "${systems}"
rc=$?
if (( rc != 0 )); then
echo "FATAL ERROR: ${BASH_SOURCE[0]} failed to create 'build.xml' with error code ${rc}"
msg="FATAL ERROR: ${BASH_SOURCE[0]} failed to create 'build.xml' with error code ${rc}"
echo "${msg}"
echo "${msg}" > logs/error.logs
exit 1
fi

Expand All @@ -101,8 +105,10 @@ while [[ "${finished}" == "false" ]]; do
elif [[ "${state}" == "RUNNING" ]]; then
finished=false
else
echo "FATAL ERROR: ${BASH_SOURCE[0]} rocoto failed with state '${state}'"
msg="FATAL ERROR: ${BASH_SOURCE[0]} rocoto failed with state '${state}'"
echo "${msg}"
rm -f logs/error.logs
echo "${msg}" > logs/error.logs
# Determine which build(s) failed
stat_out="$(rocotostat -w "${build_xml}" -d "${build_db}")"
echo "${stat_out}" > rocotostat.out
Expand Down
2 changes: 1 addition & 1 deletion sorc/link_workflow.sh
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ case "${machine}" in
"jet") FIX_DIR="/lfs5/HFIP/hfv3gfs/glopara/FIX/fix" ;;
"s4") FIX_DIR="/data/prod/glopara/fix" ;;
"gaeac5") FIX_DIR="/gpfs/f5/ufs-ard/world-shared/global/glopara/data/fix" ;;
"gaeac6") FIX_DIR="/gpfs/f6/bil-fire8/world-shared/global/glopara/fix" ;;
"gaeac6") FIX_DIR="/gpfs/f6/drsa-precip3/world-shared/role.glopara/fix" ;;
"noaacloud") FIX_DIR="/contrib/global-workflow-shared-data/fix" ;;
*)
echo "FATAL: Unknown target machine ${machine}, couldn't set FIX_DIR"
Expand Down
4 changes: 2 additions & 2 deletions versions/build.gaeac6.ver
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
export stack_intel_ver=2023.2.0
export stack_cray_mpich_ver=8.1.29
export spack_env=gsi-addon-dev
export spack_env=gsi-addon
source "${HOMEgfs:-}/versions/spack.ver"
export spack_mod_path="/ncrc/proj/epic/spack-stack/spack-stack-${spack_stack_ver}/envs/${spack_env}/install/modulefiles/Core"
export spack_mod_path="/ncrc/proj/epic/spack-stack/c6/spack-stack-${spack_stack_ver}/envs/${spack_env}/install/modulefiles/Core"
3 changes: 3 additions & 0 deletions versions/run.gaeac5.ver
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,8 @@ export spack_env=gsi-addon-dev

export perl_ver=5.38.2

export hsi_mod_path="/usw/hpss/modulefiles"
export hsi_ver=9.3

source "${HOMEgfs:-}/versions/spack.ver"
export spack_mod_path="/ncrc/proj/epic/spack-stack/spack-stack-${spack_stack_ver}/envs/${spack_env}/install/modulefiles/Core"
8 changes: 8 additions & 0 deletions versions/run.gaeac6.ver
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,13 @@ export spack_env=gsi-addon

export perl_ver=5.38.2

export hsi_mod_path="/usw/hpss/modulefiles"
export hsi_ver=9.3

source "${HOMEgfs:-}/versions/spack.ver"

# Gaea uses a newer version of Fit2Obs
export fit2obs_ver=1.1.5
# Gaea uses a newer version of the ensemble tracker as well
export ens_tracker_ver=v1.2.0
export spack_mod_path="/ncrc/proj/epic/spack-stack/c6/spack-stack-${spack_stack_ver}/envs/${spack_env}/install/modulefiles/Core"
6 changes: 3 additions & 3 deletions workflow/build_compute.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,13 +103,13 @@ def get_host_specs(host: Dict) -> Dict:
native = '-l place=vscatter'
elif host.info.SCHEDULER in ['slurm']:
native = '--export=NONE'
if host.info.PARTITION_BATCH not in [""]:
if host.info.get("PARTITION_BATCH", "") != "":
partition = host.info.PARTITION_BATCH

if host.info.RESERVATION not in [""]:
if host.info.get("RESERVATION", "") != "":
native += f' --reservation={host.info.RESERVATION}'

if host.info.CLUSTERS not in [""]:
if host.info.get("CLUSTERS", "") != "":
native += f' --clusters={host.info.CLUSTERS}'

specs = AttrDict()
Expand Down
Loading

0 comments on commit b152424

Please sign in to comment.