Error running nnunet conda env #380

Dhananjhay · 2025-02-25T02:38:32Z

I'm running wet run tests on lowresMRI data pulled from OSF and it keeps failing at rule run_inference.

The command I'm running is

hippunfold lowresMRI/ test-lowresMRI participant --participant-label 01 --modality T1w --cores all --use-conda --output_density 0p5mm 2mm unfoldiso

Error output on the terminal:

host: AFI-CBS-H-5
Your conda installation is not configured to use strict channel priorities. This is however important for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorities by executing 'conda config --set channel_priority strict'.
[Mon Feb 24 13:38:55 2025]
Error in rule run_inference:
    jobid: 0
    input: work/sub-01/anat/sub-01_hemi-L_space-corobl_desc-preproc_T1w.nii.gz, /localscratch/.cache/hippunfold/model/trained_model.3d_fullres.Task101_hcp1200_T1w.nnUNetTrainerV2.model_best.tar
    output: work/sub-01/anat/sub-01_hemi-L_space-corobl_desc-nnunet_dseg.nii.gz
    log: logs/sub-01/sub-01_hemi-L_space-corobl_nnunet.txt (check log file(s) for error details)
    conda-env: /local/scratch/test-lowresMRI/.snakemake/conda/484b42a4a5ef5fb1ce0f34e315e3ecae_
    shell:
        mkdir -p tempmodel tempimg templbl && cp work/sub-01/anat/sub-01_hemi-L_space-corobl_desc-preproc_T1w.nii.gz tempimg/temp_0000.nii.gz && tar -xf /localscratch/.cache/hippunfold/model/trained_model.3d_fullres.Task101_hcp1200_T1w.nnUNetTrainerV2.model_best.tar -C tempmodel && export RESULTS_FOLDER=tempmodel && export nnUNet_n_proc_DA=12 && nnUNet_predict -i tempimg -o templbl -t Task101_hcp1200_T1w -chk model_best -tr nnUNetTrainerV2 --disable_tta &> logs/sub-01/sub-01_hemi-L_space-corobl_nnunet.txt && cp templbl/temp.nii.gz work/sub-01/anat/sub-01_hemi-L_space-corobl_desc-nnunet_dseg.nii.gz
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Exiting because a job execution failed. Look above for error message
WorkflowError:
At least one job did not complete successfully.

And this is what the log file says:

nnUNet_raw_data_base is not defined and nnU-Net can only be used on data for which preprocessed files are already present on your system. nnU-Net cannot be used for experiment planning and preprocessing like this. If this is not intended, please read documentation/setting_up_paths.md for information on how to set this up properly.
nnUNet_preprocessed is not defined and nnU-Net can not be used for preprocessing or training. If this is not intended, please read documentation/setting_up_paths.md for information on how to set this up.
using model stored in  tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1
This model expects 1 input modalities for each image
Found 1 unique case ids, here are some examples: ['temp']
If they don't look right, make sure to double check your filenames. They must end with _0000.nii.gz etc
number of cases: 1
number of cases that still need to be predicted: 1
emptying cuda cache
loading parameters for folds, None
folds is None so we will automatically look for output folders (not using 'all'!)
found the following folds:  ['tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_0', 'tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_1', 'tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_2', 'tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_3', 'tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_4']
using the following model files:  ['tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_0/model_best.model', 'tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_1/model_best.model', 'tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_2/model_best.model', 'tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_3/model_best.model', 'tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_4/model_best.model']
Traceback (most recent call last):
  File "/local/scratch/test-lowresMRI/.snakemake/conda/484b42a4a5ef5fb1ce0f34e315e3ecae_/bin/nnUNet_predict", line 10, in <module>
    sys.exit(main())
  File "/local/scratch/test-lowresMRI/.snakemake/conda/484b42a4a5ef5fb1ce0f34e315e3ecae_/lib/python3.9/site-packages/nnunet/inference/predict_simple.py", line 219, in main
    predict_from_folder(model_folder_name, input_folder, output_folder, folds, save_npz, num_threads_preprocessing,
  File "/local/scratch/test-lowresMRI/.snakemake/conda/484b42a4a5ef5fb1ce0f34e315e3ecae_/lib/python3.9/site-packages/nnunet/inference/predict.py", line 658, in predict_from_folder
    return predict_cases(model, list_of_lists[part_id::num_parts], output_files[part_id::num_parts], folds,
  File "/local/scratch/test-lowresMRI/.snakemake/conda/484b42a4a5ef5fb1ce0f34e315e3ecae_/lib/python3.9/site-packages/nnunet/inference/predict.py", line 184, in predict_cases
    trainer, params = load_model_and_checkpoint_files(model, folds, mixed_precision=mixed_precision,
  File "/local/scratch/test-lowresMRI/.snakemake/conda/484b42a4a5ef5fb1ce0f34e315e3ecae_/lib/python3.9/site-packages/nnunet/training/model_restore.py", line 147, in load_model_and_checkpoint_files
    all_params = [torch.load(i, map_location=torch.device('cpu')) for i in all_best_model_files]
  File "/local/scratch/test-lowresMRI/.snakemake/conda/484b42a4a5ef5fb1ce0f34e315e3ecae_/lib/python3.9/site-packages/nnunet/training/model_restore.py", line 147, in <listcomp>
    all_params = [torch.load(i, map_location=torch.device('cpu')) for i in all_best_model_files]
  File "/local/scratch/test-lowresMRI/.snakemake/conda/484b42a4a5ef5fb1ce0f34e315e3ecae_/lib/python3.9/site-packages/torch/serialization.py", line 1470, in load
    raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
_pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint. 
	(1) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
	(2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
	WeightsUnpickler error: Unsupported global: GLOBAL numpy.core.multiarray.scalar was not an allowed global by default. Please use `torch.serialization.add_safe_globals([scalar])` or the `torch.serialization.safe_globals([scalar])` context manager to allowlist this global if you trust this class/function.

I'm skeptical this has something to do with nnunet conda env because there haven't been any changes to the config file and we have had run successful wet run tests using --use-conda flag.

The text was updated successfully, but these errors were encountered:

akhanf · 2025-02-25T14:20:06Z

I'm also not sure what's going on, but with #381 we should once again be able to use the container dependencies without conda. Trying that out now to see how it goes..

akhanf · 2025-02-25T14:39:27Z

One thing to try is to make a conda env based on the exact nnunet version we are using in the container. It is here:

https://github.com/yinglilu/nnUNet/tree/inference_on_cpu_v1.6.6
https://pypi.org/project/nnunet-inference-on-cpu-and-gpu/

If we can get that working, then can we at least have a starting point to see what changes break things..

Dhananjhay · 2025-02-26T17:41:45Z

Unfortunately that also didn't work; I ended up with the same error. I've put v1.6.6 on khanlab if you want to check it out.

akhanf · 2025-02-26T17:47:51Z

Ah - based on the error message it looks like it is using the latest version of pytorch, which is probably causing problems.

If you pin the major packages (torch, numpy) based on https://github.com/khanlab/hippunfold_deps/blob/main/pyproject.toml it should hopefully fix the issue..

akhanf · 2025-02-26T17:49:43Z

note you probably don't need an archaic version of torch -- since 2.6 was released last month I am thinking that is what broke it, so maybe <2.6 will work..

Dhananjhay · 2025-02-26T17:50:16Z

Right, it was complaining about using pytorch 2.6+... I'll try that, thanks @akhanf!

Dhananjhay · 2025-02-27T17:24:03Z

Unfortunately, the error still persists even after pinning major packages and using an older version of pytorch. This is a snippet from the meta.yaml file:

requirements:
  host:
    - python
    - pip
  run:
    - python
    - pytorch <2.6.0
    - tqdm
    - dicom2nifti
    - scikit-image >=0.14
    - medpy
    - scipy==1.7.1
    - batchgenerators==0.21
    - numpy==1.21.2
    - scikit-learn
    - simpleitk==2.0.2
    - pandas >=1.2.0,<=1.3.0
    - requests
    - nibabel >=3.2.1
    - tifffile
    - matplotlib==3.4.2

Will keep you updated!

akhanf · 2025-02-27T17:40:08Z

Did you try with the version numbers pinned in hippunfold_deps too? (thats with a pretty old version of torch)

…

________________________________ From: Dhananjhay Bansal (Djay) ***@***.***> Sent: February 27, 2025 12:24 PM To: khanlab/hippunfold ***@***.***> Cc: Ali Khan ***@***.***>; Mention ***@***.***> Subject: Re: [khanlab/hippunfold] Error running nnunet conda env (Issue #380) Unfortunately, the error still persists even after pinning major packages and using an older version of pytorch. This is a snippet from the meta.yaml file: requirements: host: - python - pip run: - python - pytorch <2.6.0 - tqdm - dicom2nifti - scikit-image >=0.14 - medpy - scipy==1.7.1 - batchgenerators==0.21 - numpy==1.21.2 - scikit-learn - simpleitk==2.0.2 - pandas >=1.2.0,<=1.3.0 - requests - nibabel >=3.2.1 - tifffile - matplotlib==3.4.2 Will keep you updated! — Reply to this email directly, view it on GitHub<#380 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ACXV2XOSP7GUIEKAYN75XU32R5C4RAVCNFSM6AAAAABXZQAYSKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOBYGYZDKNJSGU>. You are receiving this because you were mentioned.Message ID: ***@***.***> [Dhananjhay]Dhananjhay left a comment (khanlab/hippunfold#380)<#380 (comment)> Unfortunately, the error still persists even after pinning major packages and using an older version of pytorch. This is a snippet from the meta.yaml file: requirements: host: - python - pip run: - python - pytorch <2.6.0 - tqdm - dicom2nifti - scikit-image >=0.14 - medpy - scipy==1.7.1 - batchgenerators==0.21 - numpy==1.21.2 - scikit-learn - simpleitk==2.0.2 - pandas >=1.2.0,<=1.3.0 - requests - nibabel >=3.2.1 - tifffile - matplotlib==3.4.2 Will keep you updated! — Reply to this email directly, view it on GitHub<#380 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ACXV2XOSP7GUIEKAYN75XU32R5C4RAVCNFSM6AAAAABXZQAYSKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOBYGYZDKNJSGU>. You are receiving this because you were mentioned.Message ID: ***@***.***>

Dhananjhay added the v2.0 label Feb 25, 2025

Dhananjhay mentioned this issue Feb 27, 2025

Fix nnunet conda env #383

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error running nnunet conda env #380

Error running nnunet conda env #380

Dhananjhay commented Feb 25, 2025

akhanf commented Feb 25, 2025

akhanf commented Feb 25, 2025

Dhananjhay commented Feb 26, 2025

akhanf commented Feb 26, 2025

akhanf commented Feb 26, 2025

Dhananjhay commented Feb 26, 2025

Dhananjhay commented Feb 27, 2025

akhanf commented Feb 27, 2025 via email

Error running nnunet conda env #380

Error running nnunet conda env #380

Comments

Dhananjhay commented Feb 25, 2025

akhanf commented Feb 25, 2025

akhanf commented Feb 25, 2025

Dhananjhay commented Feb 26, 2025

akhanf commented Feb 26, 2025

akhanf commented Feb 26, 2025

Dhananjhay commented Feb 26, 2025

Dhananjhay commented Feb 27, 2025

akhanf commented Feb 27, 2025 via email