Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update g-w to cycle GDASApp #1067

Closed
RussTreadon-NOAA opened this issue Oct 11, 2022 · 11 comments · Fixed by #1091
Closed

Update g-w to cycle GDASApp #1067

RussTreadon-NOAA opened this issue Oct 11, 2022 · 11 comments · Fixed by #1091
Labels
maintenance Regular updates and maintenance work

Comments

@RussTreadon-NOAA
Copy link
Contributor

Expected behavior
g-w develop jobs/JGDAS_GLOBAL_ATMOS_ANALYSIS_PREP should run to completion.

Current behavior
JGDAS_GLOBAL_ATMOS_ANALYSIS_PREP fails with the traceback

/work/noaa/da/Russ.Treadon/git/global_workflow/develop/jobs/JGDAS_GLOBAL_ATMOS_ANALYSIS_PREP: line 82: BASH_SOURCE: unbound variable
/work/noaa/da/Russ.Treadon/git/global_workflow/develop/jobs/JGDAS_GLOBAL_ATMOS_ANALYSIS_PREP: line 82: BASH_SOURCE: unbound variable
/work/noaa/da/Russ.Treadon/git/global_workflow/develop/jobs/JGDAS_GLOBAL_ATMOS_ANALYSIS_PREP: line 82: BASH_SOURCE: unbound variable
+++ [82]: env
+++ [82]: grep -o '[^=]*CONDA[^=]*'
+++ [82]: grep -v 'CONDA_ENVS_PATH\|CONDA_PKGS_DIRS\|CONDARC'
/work/noaa/da/Russ.Treadon/git/global_workflow/develop/jobs/JGDAS_GLOBAL_ATMOS_ANALYSIS_PREP: line 82: BASH_SOURCE: unbound variable
++ [82]: unset CONDA_SHLVL CONDA_EXE _CE_CONDA CONDA_PYTHON_EXE
/work/noaa/da/Russ.Treadon/git/global_workflow/develop/jobs/JGDAS_GLOBAL_ATMOS_ANALYSIS_PREP: line 82: BASH_SOURCE: unbound variable
++ [82]: unset prefix
/work/noaa/da/Russ.Treadon/git/global_workflow/develop/jobs/JGDAS_GLOBAL_ATMOS_ANALYSIS_PREP: line 82: BASH_SOURCE: unbound variable
++ [82]: conda deactivate

CommandNotFoundError: Your shell has not been properly configured to use 'conda deactivate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.


/work/noaa/da/Russ.Treadon/git/global_workflow/develop/jobs/JGDAS_GLOBAL_ATMOS_ANALYSIS_PREP: line 1: BASH_SOURCE: unbound variable
+ [1]: postamble JGDAS_GLOBAL_ATMOS_ANALYSIS_PREP 1665512197 1
+ preamble.sh[68]: set +x
End JGDAS_GLOBAL_ATMOS_ANALYSIS_PREP at 18:16:49 with error code 1 (time elapsed: 00:00:12)

Machines affected
Running C96L127 parallel on Orion using GDASApp based (UFS-based) DA

To Reproduce

  1. Install g-w develop at e8ef5fc
  2. set up EXPDIR with export DO_JEDIVAR="YES" in config.base
  3. populate ROTDIR for UFS-based DA
  4. submit gdasatmanalprep

A log file with the reported error is /work/noaa/stmp/rtreadon/comrot/prgdasens4/logs/2021122100/gdasatmanalprep.log

Context
Closed issue #1015 reported another case of $BASH_SOURCE errors. #1015 differs from the failure reported in this issue.

Detailed Description
The section of JGDAS_GLOBAL_ATMOS_ANALYSIS_PREP in which the failure occurs is

# NOTE BELOW IS A HACK FOR TESTING
# PLEASE FIX THIS LATER
# ASK @aerorahul
# HOW TO HANDLE DIFFERENT COMPILERS/ETC. FOR MODEL VS DA
# PROD_UTIL, ETC. DO NOT EXIST FOR JEDI MODULE VERSIONS
module purge
module use $HOMEgfs/sorc/gdas.cd/modulefiles
module load GDAS/orion
export PYTHONPATH=$HOMEgfs/sorc/gdas.cd/ush/:$PYTHONPATH

According to gdasatmanalprep.log the failure occurs on the module purge line.

Possible Implementation
As noted in the comments in JGDAS_GLOBAL_ATMOS_ANALYSIS_PREP, the sequence of module purge, use, and load should not be in the j-job. This section of the job needs to be refactored. What's does the g-w team recommend?

@RussTreadon-NOAA RussTreadon-NOAA added the bug Something isn't working label Oct 11, 2022
@RussTreadon-NOAA
Copy link
Contributor Author

Tagging @CoryMartin-NOAA for awareness.

@WalterKolczynski-NOAA
Copy link
Contributor

WalterKolczynski-NOAA commented Oct 11, 2022

Are you loading conda in your .bashrc? For some reason, conda doesn't reset correctly, even if you do a module purge, and it doesn't play nicely with the preamble's trace ($PS4) setting, which is what all the BASH_SOURCE: unbound variable are.

The ultimate error is something different but also due to conda issues.

@RussTreadon-NOAA
Copy link
Contributor Author

Thank you, @WalterKolczynski-NOAA , for this information. I do not load conda in my .bashrc

@aerorahul
Copy link
Contributor

@RussTreadon-NOAA and @aerorahul had a side discussion on this topic and it was agreed that as a temporary measure, an equivalent load_jedi_modules.sh will be created.
The full discussion included introducing job specific modulefiles in a near term for new jobs.

@RussTreadon-NOAA
Copy link
Contributor Author

The load_jedi_modules.sh approach will not currently work.

config.base contains variables defined by modules loaded by module_base.$machine.lua. These variables are not defined in by spack-stack modules. For example, config.base contains

export NCDUMP="$NETCDF/bin/ncdump

module_base.$machine.lua loads netcdf/4.7.2. This module defines NETCDF. The modules loaded by GDAS/$machine.lua do not define NETCDF. Thus, when config.base is sourced by JGDAS_GLOBAL_ATMOS_ANALYSIS_PREP, the job aborts because NETCDF is unbound (not defined).

UTILROOT is a another example. config.base contains

export DBNROOT=${DBNROOT:-${UTILROOT}/fakedbn}

module_base.$machine.lua loads prod_util/1.2.2. This module defines UTILROOT. GDAS/$machine.lua does not load prod_util. No prod_util is available in the current spack-stack. Thus, JGDAS_GLOBAL_ATMOS_ANALYSIS_PREP again aborts with an unbound variable, UTILROOT.

One final example is CRTM_FIX. module_base.$machine.lua defines CRTM_FIX. GDAS/$machine.lua does not define CRTM_FIX nor do any loaded modules define CRTM_FIX. Thus, JGDAS_GLOBAL_ATMOS_ANALYSIS_PREP aborts with an unbound variable, CRTM_FIX.

For the load_jedi_modules.sh approach to work we will need to update GDAS/$machine.lua, add missing modules to spack-stack, or find other ways to define unbound variables. A GDASApp issue will be opened to document the required changes.

Tagging @CoryMartin-NOAA and @guillaumevernieres for awareness.

@aerorahul
Copy link
Contributor

It seems spack-stack modules need variables to be added to support existing applications.

@RussTreadon-NOAA
Copy link
Contributor Author

Agreed.

A similar comment applies to some variables and hpc-stack. /apps/contrib/NCEP/libs/hpc-stack/modulefiles/compiler/intel/2018.4/crtm/2.3.0.lua does not define CRTM_FIX. CRTM_FIX is defined in g-w module_base.$machine. It's preferable to define CRTM_FIX in the crtm module.

In addition to adding variables to spack-stack modules, we also need to add some production modules to spack-stack. For example, there is no prod_util module in the spack-stack currently installed on Orion.

@WalterKolczynski-NOAA
Copy link
Contributor

As long as all these variables are defined in the WCOSS2 versions of these modules, the versions on other machines, whether it be in hpc-stack or spack-stack, really need to have them too. This is an NCEP-LIBS issue.

@RussTreadon-NOAA
Copy link
Contributor Author

Changes pertaining to this issue will be committed to g-w branch feature/updates_for_GDASApp. This branch originated from develop at fd771cb.

@RussTreadon-NOAA
Copy link
Contributor Author

Changes committed at 1a1004f

New file ush/load_ufsda_modules.sh loads modules used by GDASApp jobs in g-w. The use of these modules corresponds to a more recent GDASApp hash. As such the GDASApp hash in sorc/checkout.sh has been updated to the current head of GDASApp develop (0332c17). The GSI hash has also been updated to the current head of it's develop (48d8676). Note: current is relevant to the date of this comment, 10/17/2022.

As noted earlier in this issue, GDASApp j-jobs were loading modulefiles required by GDASApp executables. Hash 1a1004f removes these loads from GDASApp j-jobs. GDASApp modules are now loaded in GDASApp rocoto jobs. This is consistent with GSI-based DA jobs.

Attempts to run GDASApp jobs on Orion failed with an error message stating that python module solo was not found. For example, here's the error message from gdasatmanalprep.log.

Traceback (most recent call last):
  File "/work/noaa/da/Russ.Treadon/git/global_workflow/develop/scripts/exgdas_global_atmos_analysis_prep.py", line 37, in <module>
    import ufsda
  File "/work/noaa/da/Russ.Treadon/git/global_workflow/develop/sorc/gdas.cd/ush/ufsda/__init__.py", line 2, in <module>
    from .ufs_yaml import gen_yaml, parse_config
  File "/work/noaa/da/Russ.Treadon/git/global_workflow/develop/sorc/gdas.cd/ush/ufsda/ufs_yaml.py", line 3, in <module>
    from solo.yaml_file import YAMLFile
ModuleNotFoundError: No module named 'solo.yaml_file'
+ JGDAS_GLOBAL_ATMOS_ANALYSIS_PREP[1]: postamble JGDAS_GLOBAL_ATMOS_ANALYSIS_PREP 1666023659 1
+ preamble.sh[68]: set +x
End JGDAS_GLOBAL_ATMOS_ANALYSIS_PREP at 16:21:03 with error code 1 (time elapsed: 00:00:04)
+ atmanalprep.sh[1]: postamble atmanalprep.sh 1666023606 1
+ preamble.sh[68]: set +x
End atmanalprep.sh at 16:21:03 with error code 1 (time elapsed: 00:00:57)

Adding pip list to ush/load_ufsda_modules.sh confirmed that solo was not loaded even though it is part of the python environment loaded by GDAS/orion.lua. A closer examination of the failed job log file found the following error at the top of the log file.

+ . /work/noaa/da/Russ.Treadon/git/global_workflow/develop/ush/load_ufsda_modules.sh
++ [[ NO == \N\O ]]
++ echo 'Loading modules quietly...'
Loading modules quietly...
++ set +x
/work2/noaa/da/python/opt/core/miniconda3/4.6.14/etc/profile.d/conda.sh: line 55: PS1: unbound variable

While PS1 is defined in my .bashrc. this setting is not in the job run environment as confirmed by a echo statement added to atmanalprep.sh. As a test, I added my .bashrc setting for PS1 to atmanalprep.sh. With this change, atmanalprep.sh failed in a different location

PS1 is \[\e[31m\]\h\[\e[m\]:\[\e[32m\]\w\[\e[m\]\$  before preamble.sh
Begin atmanalprep.sh at Mon Oct 17 16:27:26 UTC 2022
+ atmanalprep.sh[13]: . /work/noaa/da/Russ.Treadon/git/global_workflow/develop/ush/load_ufsda_modules.sh
++ load_ufsda_modules.sh[4]: [[ NO == \N\O ]]
++ load_ufsda_modules.sh[5]: echo 'Loading modules quietly...'
Loading modules quietly...
++ load_ufsda_modules.sh[6]: set +x
/work2/noaa/da/python/opt/core/miniconda3/4.6.14/envs/gdasapp/etc/conda/deactivate.d/proj4-deactivate.sh: line 5: _CONDA_SET_PROJ_LIB: unbound variable
End atmanalprep.sh at 16:27:35 with error code 1 (time elapsed: 00:00:09)

Both failures suggest an issue (or feature) in the Orion miniconda3/4.6.14 . Via trial and error it was found that turning off error trapping prior to executing preamble.sh and then turning it back on immediately after preamble.sh as shown below

#! /usr/bin/env bash

export STRICT="NO"
source "$HOMEgfs/ush/preamble.sh"
export STRICT="YES"

resulted in atmanalprep.sh running to completion on Orion.

This failure does not occur when executing GDASApp jobs on Hera. So while 1a1004f adds the STRICT no and yes lines to GDASApp rocoto jobs, this change is only as a patch until the miniconda3/4.6.14 behavior on Orion is better understood.

@RussTreadon-NOAA
Copy link
Contributor Author

Note: The g-w changes committed to feature/updates_for_GDASApp at 1a1004f require the use of GDASApp branch feature/updates_for_gw. See GDASApp issue #154 for details.

@RussTreadon-NOAA RussTreadon-NOAA changed the title BASH_SOURCE: unbound variable fails with unbound_variable. Update g-w to cycle GDASApp Oct 20, 2022
@RussTreadon-NOAA RussTreadon-NOAA added task maintenance Regular updates and maintenance work and removed bug Something isn't working task labels Oct 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Regular updates and maintenance work
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants