Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run failure with less than 192 cores in latest access-om3 #156

Closed
minghangli-uni opened this issue May 3, 2024 · 5 comments
Closed

Run failure with less than 192 cores in latest access-om3 #156

minghangli-uni opened this issue May 3, 2024 · 5 comments
Assignees

Comments

@minghangli-uni
Copy link
Contributor

minghangli-uni commented May 3, 2024

With the latest build at /g/data/ik11/spack/0.21.2/opt/linux-rocky8-cascadelake/intel-2021.10.0/access-om3-d6813d6b9e1df560ac3f6ba6a605daab9cfd9569_main-5pjh7z2/bin/access-om3-MOM6-CICE6, an error message appears when running 0.25deg configuration with less than 192 cores, but it doesn't seem particularly useful:

Obtained 10 stack frames.
/g/data/ik11/spack/0.21.2/opt/linux-rocky8-cascadelake/intel-2021.10.0/parallelio-2.6.2-3x6zvih/lib/libpioc.so(print_trace+0x29) [0x1529a5752b19]
/g/data/ik11/spack/0.21.2/opt/linux-rocky8-cascadelake/intel-2021.10.0/parallelio-2.6.2-3x6zvih/lib/libpioc.so(pio_err+0xab) [0x1529a5752acb]
/g/data/ik11/spack/0.21.2/opt/linux-rocky8-cascadelake/intel-2021.10.0/parallelio-2.6.2-3x6zvih/lib/libpioc.so(PIOc_Init_Intracomm+0x497) [0x1529a57551c7]
/g/data/ik11/spack/0.21.2/opt/linux-rocky8-cascadelake/intel-2021.10.0/parallelio-2.6.2-3x6zvih/lib/libpioc.so(PIOc_Init_Intracomm_from_F90+0x34) [0x1529a5754cc4]
/g/data/ik11/spack/0.21.2/opt/linux-rocky8-cascadelake/intel-2021.10.0/parallelio-2.6.2-3x6zvih/lib/libpiof.so(piolib_mod_mp_init_intracom_+0x70) [0x1529a59afdc0]
/scratch/tm70/ml0072/access-om3/work/COSIMA_MOM6-CICE6-candelete-6e4a05f5/access-om3-MOM6-CICE6() [0x4afc203]
/scratch/tm70/ml0072/access-om3/work/COSIMA_MOM6-CICE6-candelete-6e4a05f5/access-om3-MOM6-CICE6() [0x43648e]
/scratch/tm70/ml0072/access-om3/work/COSIMA_MOM6-CICE6-candelete-6e4a05f5/access-om3-MOM6-CICE6() [0x20a73bf]
/scratch/tm70/ml0072/access-om3/work/COSIMA_MOM6-CICE6-candelete-6e4a05f5/access-om3-MOM6-CICE6() [0x20a7338]
/scratch/tm70/ml0072/access-om3/work/COSIMA_MOM6-CICE6-candelete-6e4a05f5/access-om3-MOM6-CICE6() [0x20a79a9]
/g/data/ik11/spack/0.21.2/opt/linux-rocky8-cascadelake/intel-2021.10.0/parallelio-2.6.2-3x6zvih/lib/libpioc.so(print_trace+0x29) [0x14599f02db19]
/g/data/ik11/spack/0.21.2/opt/linux-rocky8-cascadelake/intel-2021.10.0/parallelio-2.6.2-3x6zvih/lib/libpioc.so(pio_err+0xab) [0x14599f02dacb]
/g/data/ik11/spack/0.21.2/opt/linux-rocky8-cascadelake/intel-2021.10.0/parallelio-2.6.2-3x6zvih/lib/libpioc.so(PIOc_Init_Intracomm+0x497) [0x14599f0301c7]
/g/data/ik11/spack/0.21.2/opt/linux-rocky8-cascadelake/intel-2021.10.0/parallelio-2.6.2-3x6zvih/lib/libpioc.so(PIOc_Init_Intracomm_from_F90+0x34) [0x14599f02fcc4]
/g/data/ik11/spack/0.21.2/opt/linux-rocky8-cascadelake/intel-2021.10.0/parallelio-2.6.2-3x6zvih/lib/libpiof.so(piolib_mod_mp_init_intracom_+0x70) [0x14599f28adc0]
/scratch/tm70/ml0072/access-om3/work/COSIMA_MOM6-CICE6-candelete-6e4a05f5/access-om3-MOM6-CICE6() [0x4afc203]
/scratch/tm70/ml0072/access-om3/work/COSIMA_MOM6-CICE6-candelete-6e4a05f5/access-om3-MOM6-CICE6() [0x43648e]
/scratch/tm70/ml0072/access-om3/work/COSIMA_MOM6-CICE6-candelete-6e4a05f5/access-om3-MOM6-CICE6() [0x20a73bf]
/scratch/tm70/ml0072/access-om3/work/COSIMA_MOM6-CICE6-candelete-6e4a05f5/access-om3-MOM6-CICE6() [0x20a7338]
/scratch/tm70/ml0072/access-om3/work/COSIMA_MOM6-CICE6-candelete-6e4a05f5/access-om3-MOM6-CICE6() [0x20a79a9]
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 18 in communicator MPI_COMM_WORLD
with errorcode -1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[gadi-cpu-clx-1844.gadi.nci.org.au:18073] PMIX ERROR: UNREACHABLE in file /jobfs/78105093.gadi-pbs/0/openmpi/4.1.5/source/openmpi-4.1.5/opal/mca/pmix/pmix3x/pmix/src/server/pmix_server.c at line 2198
[gadi-cpu-clx-1844.gadi.nci.org.au:18073] PMIX ERROR: UNREACHABLE in file /jobfs/78105093.gadi-pbs/0/openmpi/4.1.5/source/openmpi-4.1.5/opal/mca/pmix/pmix3x/pmix/src/server/pmix_server.c at line 2198

I am checking this issue to determine whether it's caused by differences in components or build methods.

@minghangli-uni minghangli-uni self-assigned this May 3, 2024
@anton-seaice
Copy link
Contributor

Thanks - can you share the nuopc.runconfig ?

cd: /scratch/tm70/ml0072/access-om3/work/COSIMA_MOM6-CICE6-candelete-6e4a05f5/: No such file or directory :(

@minghangli-uni
Copy link
Contributor Author

Sorry. I removed that directory. Can you please check this directory cd /scratch/tm70/ml0072/access-om3/work/COSIMA_MOM6-CICE6-candelete-0ee9c9a9?

@anton-seaice
Copy link
Contributor

Thanks.

In this section:

ICE_modelio::
     diro = ./log
     logfile = ice.log
     pio_async_interface = .false.
     pio_netcdf_format = nothing
     pio_numiotasks = 5
     pio_rearranger = 1
     pio_root = 1
     pio_stride = 48
     pio_typename = netcdf4p

The settings mean that it is trying to request a PE that doesn't exist. Ill try to explain!

The number of ice pio tasks is 5, the first io core is PE 1 (pio_root) (Noting master_task is PE 0 in this case!).

But then the pio_stride=48, which means that it will request the second pio core as PE 49, third as PE 97, fourth as PE 129, PE 193.

Obviously PE 193 doesn't exist in the MPI call, so you get an error . So either reduce pio_numtasks or reduce pio_stride.

(Its annoying you don't get a clear error, I think that is how the dependencies are written (and related to it being C within fortran) rather than how the models are written)

@anton-seaice
Copy link
Contributor

I guess we should update the CICE nuopc driver to check these settings look reasonable during model initialisation.

@minghangli-uni
Copy link
Contributor Author

Thank you @anton-seaice. It is fixed now. I didnt realise this option has been changed as the error did not directly point to this :(

At this stage, I suggest setting pio_numiotasks = 1 and pio_root = 0 for the current 0.25deg config. This can help prevent other users from encountering this issue. I will close this issue and open a new one for the pio.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants