Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Conda runtime to our CI integration tests #223

Merged
merged 3 commits into from
Oct 19, 2022
Merged

Conversation

tsibley
Copy link
Member

@tsibley tsibley commented Oct 6, 2022

Splits up the big initial command block to setup runtimes into single-command steps so we can condition runtime setup on host OS support (instead of trying setup on all OSes and ignoring errors).

Related issue(s)

Based on #222, because nextstrain setup docker is handy there.

Testing

  • CI passes

@tsibley
Copy link
Member Author

tsibley commented Oct 6, 2022

Hmm. All the CI jobs for macOS and Linux seem to have stalled out during the new nextstrain setup conda step. Windows is unaffected because that step isn't run on it. Curiously, the last output visible in the GitHub Actions logs is Linking openmpi (if it comes before nextclade) or Linking nextclade (if not). Both of those Conda packages have custom post-link scripts (e.g. installed as ~/.nextstrain/runtimes/conda/env/bin/.openmpi-post-link.sh) which are run at that point, so this may indicate something weird is going on with them there? But maybe also we're just missing output, because there's output from nextstrain setup that is missing from the logs too (we're only seeing the output of micromamba create).

@tsibley
Copy link
Member Author

tsibley commented Oct 7, 2022

I've been debugging this more this morning.

Re-running the GitHub Actions jobs hasn't worked. Issue is reliably reproducible every time.

I tried reproducing locally first by setting CI=true and redirecting stdin off my tty:

CI=true nextstrain setup conda </dev/null

This unfortunately worked fine, although revealed that the slightly different micromamba output is from those, which I confirmed by looking at the source code. (I still don't know why there's seemingly missing output from nextstrain setup though. We don't adjust this output based on CI or stdin being a tty.)

Thinking about the issue a bit more, I thought maybe the issue was that in CI we're running nextstrain from inside an activated Conda environment and something about micromamba create in that context is having issues. So I tried by first replicating a similar environment locally:

mamba create -p /tmp/stall python=3.8 augur auspice snakemake
conda activate /tmp/stall
python3 -m pip install ~/nextstrain/cli/dist/nextstrain_cli-5.0.0.dev0+git-py3-none-any.whl

and then running:

CI=true NEXTSTRAIN_HOME=/tmp/stall/nextstrain nextstrain setup conda </dev/null

The NEXTSTRAIN_HOME is to avoid it looking in my home dir for existing stuff. And I'm not sure CI and </dev/null are needed to reproduce, but I used them.

\o/ This ended up stalling out during "Linking openmpi…"!

Process tree:

$ pstree -alp 6042
nextstrain,6042 /tmp/stall/bin/nextstrain setup conda
  └─micromamba,6048 --root-prefix /tmp/stall/nextstrain/runtimes/conda/micromamba --no-rc --no-env --yes create --prefix /tmp/stall/nextstrain/runtimes/conda/env --override-channels --strict-channel-priority --channel conda-forge --channel bioconda --channel defaults augur auspice nextalign nextclade nextstrain-cli bash epiweeks git pangolearn pangolin snakemake
      ├─bash,7507 /tmp/mambafSONHzWTsLX
      │   └─micromamba,7510 shell --shell bash activate /tmp/stall/nextstrain/runtimes/conda/env
      │       └─{micromamba},7511
      └─{micromamba},6049

top showed PID 7510 pegging a CPU, so I straced it to see what it was doing:

$ sudo strace -fp 7510
strace: Process 7510 attached with 2 threads
strace: [ Process PID=7510 runs in x32 mode. ]
[pid  7511] rt_sigtimedwait([INT], 

It's waiting (in a thread 7511) for a SIGINT. Not sure why strace can't see/report the full syscall signature which includes the timeout value, but it appears to wait indefinitely in practice (GitHub Actions jobs stalled for over an hour here).

The content of the shell wrapper script in the process tree is:

export MAMBA_EXE='/tmp/stall/nextstrain/runtimes/conda/micromamba/bin/micromamba'
eval "$($MAMBA_EXE 'shell' 'hook' '-s' 'bash' '-p' '/tmp/stall/nextstrain/runtimes/conda/micromamba')"
micromamba activate "/tmp/stall/nextstrain/runtimes/conda/env"

. /tmp/stall/nextstrain/runtimes/conda/env/bin/.openmpi-post-link.sh

Observations:

  1. No idea why it's waiting for a SIGINT here. Would be good to take a quick look into the source.
  2. I'm surprised to see that the call to micromamba shell --shell bash activate …. Why is this happening? Is this from the eval in the wrapper script? Or part of micromamba activate? Try to reproduce issue more narrowly with just that inside the outer activated env.
  3. The issue is triggered by packages with post-link steps, but not a fault of the post-link scripts themselves, as I first thought, as they are never even reached.
  4. I can suspend (SIGSTOP) the nextstrain setup conda process to unpeg the CPU and then resume to re-peg.

Will dig more after lunch.

@tsibley
Copy link
Member Author

tsibley commented Oct 7, 2022

I think the thread waiting on SIGINT is part of libmamba's normal thread handling and not indicative of anything amiss. The thread that's blocked in rt_sigtimedwait() on receiving a SIGINT is also (of course) not the thread that's pegging the CPU since it's blocked in a syscall. So what is the main thread (PID 7510) doing? It's still unclear.

The micromamba shell --shell bash activate <path> call happens like this.

First, the wrapper script (/tmp/mambaXXXXX…) eval's micromamba's shell hook with:

eval "$($MAMBA_EXE 'shell' 'hook' '-s' 'bash' '-p' '/tmp/stall/nextstrain/runtimes/conda/micromamba')"

This shell hook is emitted and evaled just fine under all conditions I've tried. It's not the proximate issue.

Next, the wrapper activates the prefix path with:

micromamba activate "/tmp/stall/nextstrain/runtimes/conda/env"

Here, micromamba is the shell function (wrapping the binary executable) installed by the shell hook.

Shell hook source
# Copyright (C) 2012 Anaconda, Inc
# SPDX-License-Identifier: BSD-3-Clause

__add_sys_prefix_to_path() {
    # In dev-mode MAMBA_EXE is python.exe and on Windows
    # it is in a different relative location to condabin.
    if [ -z "${MAMBA_ROOT_PREFIX}" ]; then
        return 0
    fi;

    if [ -n "${WINDIR+x}" ]; then
        PATH="${MAMBA_ROOT_PREFIX}/bin:${PATH}"
        PATH="${MAMBA_ROOT_PREFIX}/Scripts:${PATH}"
        PATH="${MAMBA_ROOT_PREFIX}/Library/bin:${PATH}"
        PATH="${MAMBA_ROOT_PREFIX}/Library/usr/bin:${PATH}"
        PATH="${MAMBA_ROOT_PREFIX}/Library/mingw-w64/bin:${PATH}"
        PATH="${MAMBA_ROOT_PREFIX}:${PATH}"
    else
        PATH="${MAMBA_ROOT_PREFIX}/bin:${PATH}"
    fi
    \export PATH
}


__conda_hashr() {
    if [ -n "${ZSH_VERSION:+x}" ]; then
        \rehash
    elif [ -n "${POSH_VERSION:+x}" ]; then
        :  # pass
    else
        \hash -r
    fi
}

__mamba_activate() {
    \local cmd="$1"
    shift
    \local ask_conda
    CONDA_INTERNAL_OLDPATH="${PATH}"
    __add_sys_prefix_to_path
    ask_conda="$(PS1="$PS1" "/tmp/stall/nextstrain/runtimes/conda/micromamba/bin/micromamba" shell --shell bash "$cmd" "$@")" || \return $?
    rc=$?
    PATH="${CONDA_INTERNAL_OLDPATH}"
    \eval "$ask_conda"
    if [ $rc != 0 ]; then
        \export PATH
    fi
    __conda_hashr
}

__mamba_reactivate() {
    \local ask_conda
    CONDA_INTERNAL_OLDPATH="${PATH}"
    __add_sys_prefix_to_path
    ask_conda="$(PS1="$PS1" "/tmp/stall/nextstrain/runtimes/conda/micromamba/bin/micromamba" shell --shell bash reactivate)" || \return $?
    PATH="${CONDA_INTERNAL_OLDPATH}"
    \eval "$ask_conda"
    __conda_hashr
}

micromamba() {
    if [ "$#" -lt 1 ]; then
        "/tmp/stall/nextstrain/runtimes/conda/micromamba/bin/micromamba"
    else
        \local cmd="$1"
        shift
        case "$cmd" in
            activate|deactivate)
                __mamba_activate "$cmd" "$@"
                ;;
            install|update|upgrade|remove|uninstall)
                CONDA_INTERNAL_OLDPATH="${PATH}"
                __add_sys_prefix_to_path
                "/tmp/stall/nextstrain/runtimes/conda/micromamba/bin/micromamba" "$cmd" "$@"
                \local t1=$?
                PATH="${CONDA_INTERNAL_OLDPATH}"
                if [ $t1 = 0 ]; then
                    __mamba_reactivate
                else
                    return $t1
                fi
                ;;
            *)
                CONDA_INTERNAL_OLDPATH="${PATH}"
                __add_sys_prefix_to_path
                "/tmp/stall/nextstrain/runtimes/conda/micromamba/bin/micromamba" "$cmd" "$@"
                \local t1=$?
                PATH="${CONDA_INTERNAL_OLDPATH}"
                return $t1
                ;;
        esac
    fi
}

if [ -z "${CONDA_SHLVL+x}" ]; then
    \export CONDA_SHLVL=0
    # In dev-mode MAMBA_EXE is python.exe and on Windows
    # it is in a different relative location to condabin.
    if [ -n "${_CE_CONDA+x}" ] && [ -n "${WINDIR+x}" ]; then
        PATH="${MAMBA_ROOT_PREFIX}/condabin:${PATH}"
    else
        PATH="${MAMBA_ROOT_PREFIX}/condabin:${PATH}"
    fi
    \export PATH

    # We're not allowing PS1 to be unbound. It must at least be set.
    # However, we're not exporting it, which can cause problems when starting a second shell
    # via a first shell (i.e. starting zsh from bash).
    if [ -z "${PS1+x}" ]; then
        PS1=
    fi
fi

if [ -n "${ZSH_VERSION:+x}" ]; then
  if ! command -v compinit > /dev/null; then
    autoload -U +X compinit && if [[ "${ZSH_DISABLE_COMPFIX-}" = true ]]; then
      compinit -u
    else
      compinit
    fi
  fi
  autoload -U +X bashcompinit && bashcompinit

  _umamba_zsh_completions()
  {
    COMPREPLY=($($MAMBA_EXE completer "${(@s: :)${(@s: :)COMP_LINE}:1}"))
  }

  complete -o default -F _umamba_zsh_completions micromamba
fi
if [ -n "${BASH_VERSION:+x}" ]; then
  _umamba_bash_completions()
  {
    COMPREPLY=($($MAMBA_EXE completer "${COMP_WORDS[@]:1}"))
  }
  complete -o default -F _umamba_bash_completions micromamba
fi

That shell function ends up calling:

/tmp/stall/nextstrain/runtimes/conda/micromamba/bin/micromamba shell -s bash activate /tmp/stall/nextstrain/runtimes/conda/env

in order to capture output and eval it in the context of the current shell. This is the bulk of what micromamba activate does. And it's this micromamba shell invocation that spins, never returning.

Crucially, two things must be true for me to reproduce the spinning:

  1. HOME must be set to something that's not my original, actual home dir. This is something our Conda runner does.
  2. CONDA_PROMPT_MODIFIER must be set to the empty string, which is the case in my exported environment inside a separate, already-activated Conda env. If unset or set to a non-empty string, the process doesn't spin!

Wild.

I tested unsetting all other Conda-related env vars I could think of/find and none else seemed to matter. Here's what's in the already-activated Conda env:

CONDA_SHLVL=1
CONDA_EXE=/home/tom/.conda/bin/conda
CONDA_PREFIX=/tmp/stall
CONDA_PYTHON_EXE=/home/tom/.conda/bin/python
_CE_CONDA=
CONDA_PROMPT_MODIFIER=
CONDA_DEFAULT_ENV=/tmp/stall

Confirming the issue is related to the prompt-modification code in libmamba, I can also set MAMBA_CHANGEPS1=false (which overrides the default config param change_ps1) and the issue will no longer occur.

The possible code paths this implicates in libmamba/micromamba are a bit sprawling and complicated, as it's part of the intricate activation/deactivation/reactivation/env stacking/unstacking logic.

I want a debugger attached and a backtrace, but I don't know how easy that will be.

@tsibley
Copy link
Member Author

tsibley commented Oct 7, 2022

$ (HOME=/tmp/stall/nextstrain/runtimes/conda; gdb --args /tmp/stall/nextstrain/runtimes/conda/micromamba/bin/micromamba shell -s bash activate /tmp/stall/nextstrain/runtimes/conda/env)
GNU gdb (Ubuntu 8.1.1-0ubuntu1) 8.1.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /tmp/stall/nextstrain/runtimes/conda/micromamba/bin/micromamba...(no debugging symbols found)...done.
(gdb) show args
Argument list to give program being debugged when it is started is "shell -s bash activate /tmp/stall/nextstrain/runtimes/conda/env".
(gdb) run
Starting program: /tmp/stall/nextstrain/runtimes/conda/micromamba/bin/micromamba shell -s bash activate /tmp/stall/nextstrain/runtimes/conda/env
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff6b20700 (LWP 18500)]
^\
Thread 1 "micromamba" received signal SIGQUIT, Quit.
0x0000555555f98fce in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::find(char const*, unsigned long, unsigned long) const ()
(gdb) bt
#0  0x0000555555f98fce in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::find(char const*, unsigned long, unsigned long) const ()
#1  0x00005555559468d7 in mamba::replace_all(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
#2  0x0000555555a0dd1f in mamba::PosixActivator::update_prompt(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
#3  0x0000555555a152aa in mamba::Activator::build_activate(fs::u8path const&) ()
#4  0x0000555555a15d93 in mamba::Activator::activate[abi:cxx11](fs::u8path const&, bool) ()
#5  0x0000555555a02055 in mamba::shell(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool) ()
#6  0x000055555583843d in ?? ()
#7  0x0000555555880f18 in ?? ()
#8  0x00005555557af963 in main ()
(gdb) 

@tsibley
Copy link
Member Author

tsibley commented Oct 7, 2022

Backtrace lands right in the code I'd been looking at related to prompt modification, but a few function calls deeper than I'd gotten. Nice.

I need to spend some more time with the debugger, but I'm pretty sure I found the infinite loop.

I believe that this replace_all() call where the search string is the empty string (from CONDA_PROMPT_MODIFIER) and the replacement string is the empty string:

https://github.com/mamba-org/mamba/blob/cb2342a00953580abe676c8de80d3322a252066f/libmamba/src/core/activation.cpp#L781-L785

…leads to this implementation with basic_string::find() and basic_string::replace() from the stdlib:

https://github.com/mamba-org/mamba/blob/cb2342a00953580abe676c8de80d3322a252066f/libmamba/src/core/util.cpp#L322-L330

and that ends up in an infinite loop never advancing pos and replacing the first empty string with itself over and over and over again.

@tsibley
Copy link
Member Author

tsibley commented Oct 7, 2022

Yep, infinite loop confirmed with a tiny C++ program exercising those same methods. Compile it with g++ bug.c and then run ./a.out.

data = «test» (size = 4)
search = «» (size = 0)
replace = «» (size = 0)
----
pos = 0
pos == npos? nope
----
replace(0, 0, "")
data = «test» (size = 4)
----
pos = 0
pos == npos? nope

tsibley added a commit to tsibley/mamba that referenced this pull request Oct 7, 2022
The empty string is found at the start of the string, pos = 0, and
replaced over and over again.  If the replacement string is also empty,
the size of the data string will not change.  If the replacement string
is not empty, it will be repeatedly prepended to the data string.

While most callers are careful to ensure the search string is not empty,
PosixActivator::update_prompt() lost this property in "Micromamba:
Substitute env vars in .condarc (mamba-org#1423)" (9e5f8fd).  Rather than restore
the check in the caller, fix the root of the issue in replace_all()
itself.

I ran into this bug as a Micromamba user in a particular set of
circumstances where the CONDA_PROMPT_MODIFIER env var was the empty
string.¹  It manifested to me the user as a `micromamba create`
invocation hanging at the linking phase for any package with a post-link
script (such as openmpi).  The generated post-link script wrapper
invoked, indirectly, `micromamba shell --shell bash activate <path>`
which was hanging during update_prompt().

¹ nextstrain/cli#223 (comment)
@tsibley
Copy link
Member Author

tsibley commented Oct 8, 2022

Submitted a PR to fix libmamba. A little closer over there to understanding why HOME matters—MAMBA_NO_RC=true has same effect—but still not sure.

In any case, my fix for our runtime here allowed CI to get past nextstrain setup conda! \o/

…and now test-dist jobs are failing during the zika-tutorial build with --conda. orz

Progress!

@tsibley tsibley force-pushed the trs/ci/conda-runtime branch from ed66080 to d07fde6 Compare October 12, 2022 18:27
@tsibley
Copy link
Member Author

tsibley commented Oct 12, 2022

I adjusted last week's fix to increase isolation.

…and now test-dist jobs are failing during the zika-tutorial build with --conda. orz

These are now resolved. Recapping Slack this morning

So these were running into this Snakemake bug because the bug was only fixed in Snakemake ≥7.15.2 but we were getting 6.8.0 because that's the upper limit required by pangolin.

The Pangolin Conda recipe also requires exactly tabulate 0.8.10 which should in theory avoid that bug on Snakemake 6.8.0, but during env resolution somehow tabulate 0.9.0 is getting installed instead.

Ah! The pin to tabulate 0.8.10 was only added yesterday. So it works now. The issue was the pin was missing from pangolin's recipe last week.

@tsibley tsibley requested a review from a team October 12, 2022 20:46
Base automatically changed from trs/docker/setup-download-image to master October 12, 2022 21:01
@tsibley tsibley force-pushed the trs/ci/conda-runtime branch from d07fde6 to 151de50 Compare October 12, 2022 21:09
@tsibley
Copy link
Member Author

tsibley commented Oct 12, 2022

Repushed to rebase onto latest master given merge of #222 and then added a changelog commit.

Splits up the big initial command block to setup runtimes into
single-command steps so we can condition runtime setup on host OS
support (instead of trying setup on all OSes and ignoring errors).
…ivated environments

Filter out all CONDA_* and MAMBA_* env vars and set the root prefix and
no rc option for descendant processes of our micromamba invocation.

The CONDA_* vars are set during Conda environment activation and can
impact the operation and behaviour of Micromamba.  We don't need or want
that impact, as we don't want to make use of any of the activation
features of Conda/Mamba/Micromamba.
@tsibley tsibley force-pushed the trs/ci/conda-runtime branch from 151de50 to f69c8f8 Compare October 12, 2022 21:57
@tsibley
Copy link
Member Author

tsibley commented Oct 12, 2022

Rebased onto latest master again for merge of #224.

@tsibley
Copy link
Member Author

tsibley commented Oct 19, 2022

Going to merge this pre-review as it's important to be testing the Conda runtime in CI, esp. for other work.

@tsibley tsibley merged commit f5637f8 into master Oct 19, 2022
@tsibley tsibley deleted the trs/ci/conda-runtime branch October 19, 2022 21:21
Copy link
Member

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looked ugly! Glad you got it resolved. Post-merge approval 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

2 participants