Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v1.21.x] Intel/CI: Cherrypick CI updates #10754

Open
wants to merge 11 commits into
base: v1.21.x
Choose a base branch
from

Conversation

Juee14Desai
Copy link
Contributor

Update the oneccl CPU and separate it to use different partitions.

zachdworkin
zachdworkin previously approved these changes Jan 31, 2025
Juee14Desai and others added 4 commits February 6, 2025 12:24
Instead of running a single stage, separated the stages
to run different providers on different partitions

Signed-off-by: Juee Himalbhai Desai <juee.himalbhai.desai@intel.com>
Signed-off-by: Juee Himalbhai Desai <juee.himalbhai.desai@intel.com>
Signed-off-by: Zach Dworkin <zachary.dworkin@intel.com>
Signed-off-by: Juee Himalbhai Desai <juee.himalbhai.desai@intel.com>
zachdworkin
zachdworkin previously approved these changes Feb 6, 2025
Juee14Desai and others added 2 commits February 7, 2025 10:41
    The '-' in the xfer-method option in the runmultinode.sh script causes
    method and xfer as separate instead of a single bash variable xfer-method.
    This supplies invalid inputs to the fi_multinode test. changing the bash variable xfer-method
    to xfer_method fixes this issue.

Signed-off-by: Juee Himalbhai Desai <juee.himalbhai.desai@intel.com>
Currently no way exists to specify env variables to the multinode
scripts. Added option to runmultinode.sh. Changes similar to runfabtests.sh

Signed-off-by: Nikhil Nanal <nikhil.nanal@intel.com>
Juee14Desai and others added 2 commits February 14, 2025 05:18
Signed-off-by: Juee Himalbhai Desai <juee.himalbhai.desai@intel.com>
Signed-off-by: Nikhil Nanal <nikhil.nanal@intel.com>
zachdworkin
zachdworkin previously approved these changes Feb 14, 2025
@zachdworkin
Copy link
Contributor

Can you update the name to be [v1.21.x] Intel/CI: Cherrypick CI updates and then elaborate what they are in the description?

@Juee14Desai Juee14Desai changed the title v1.21.x Intel/CI: Update oneccl CPU stage [v1.21.x] Intel/CI: Cherrypick CI updates Feb 20, 2025
amirshehataornl and others added 2 commits February 21, 2025 09:06
Made a few updates to the multinode test:

1. accept a -x flag to turn off setting the service/node/flags
  - this is needed to work with CXI
2. accept a -u flag to set a process manager: pmi or pmix
3. modify the code to get the rank from the appropriate environment
   variable if a process manager is specified.
4. Add a runmultinode.py script which enables users to run the test
   using a backing process manager. The python script takes a YAML
   configuration file which defines the environment and test. An example
   python configuration file:

multinode:
    environment:
        FI_MR_CACHE_MAX_SIZE: -1
        FI_MR_CACHE_MAX_COUNT: 524288
        FI_SHM_USE_XPMEM: 1
        FI_LOG_LEVEL: info
    bind-to: core
    map-by-count: 1
    map-by: l3cache
    pattern: full_mesh

Script Usage:
usage: runmultinode.py [-h] [--dry-run] [--ci CI] [-C CAPABILITY]
                       [-i ITERATIONS] [-l {internal,srun,mpirun}]
                       [-p PROVIDER] [-np NUM_PROCS] [-c CONFIG]
                       [-t PROCS_PER_NODE]

libfabric multinode test with slurm

optional arguments:
  -h, --help            show this help message and exit
  --dry-run             Perform a dry run without making any changes.
  --ci CI               Commands to prepend to test call. Only used with the
                        internal launcher option
  -C CAPABILITY, --capability CAPABILITY
                        libfabric capability
  -i ITERATIONS, --iterations ITERATIONS
                        Number of iterations
  -l {internal,srun,mpirun}, --launcher {internal,srun,mpirun}
                        launcher to use for running job. If nothing is
                        specified, test manages processes internally.
                        Available options: internal, srun and mpirun

Required arguments:
  -p PROVIDER, --provider PROVIDER
                        libfabric provider
  -np NUM_PROCS, --num-procs NUM_PROCS
                        Map process by node, l3cache, etc
  -c CONFIG, --config CONFIG
                        Test configuration

Required if using srun:
  -t PROCS_PER_NODE, --procs-per-node PROCS_PER_NODE
                        Number of procs per node

Running the script:

runmultinode.py -p cxi -i 1 --procs-per-node 8 --num-procs 8 -l srun -c mn.yaml

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
ubertest implementation currently requires FI_RMA_EVENT when using RMA and counters
This will cause tcp to return ENODATA for these combinations and cause runfabtests
to fail.
This should get updated in ubertest to not require it but remove testing for now

Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>
Update arguments to avoid conflicts with other tests.

Signed-off-by: Alex McKinley <alex.mckinley@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants