Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error running nnunet conda env #380

Open
Dhananjhay opened this issue Feb 25, 2025 · 8 comments
Open

Error running nnunet conda env #380

Dhananjhay opened this issue Feb 25, 2025 · 8 comments
Labels

Comments

@Dhananjhay
Copy link
Collaborator

I'm running wet run tests on lowresMRI data pulled from OSF and it keeps failing at rule run_inference.

The command I'm running is

hippunfold lowresMRI/ test-lowresMRI participant --participant-label 01 --modality T1w --cores all --use-conda --output_density 0p5mm 2mm unfoldiso 

Error output on the terminal:

host: AFI-CBS-H-5
Your conda installation is not configured to use strict channel priorities. This is however important for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorities by executing 'conda config --set channel_priority strict'.
[Mon Feb 24 13:38:55 2025]
Error in rule run_inference:
    jobid: 0
    input: work/sub-01/anat/sub-01_hemi-L_space-corobl_desc-preproc_T1w.nii.gz, /localscratch/.cache/hippunfold/model/trained_model.3d_fullres.Task101_hcp1200_T1w.nnUNetTrainerV2.model_best.tar
    output: work/sub-01/anat/sub-01_hemi-L_space-corobl_desc-nnunet_dseg.nii.gz
    log: logs/sub-01/sub-01_hemi-L_space-corobl_nnunet.txt (check log file(s) for error details)
    conda-env: /local/scratch/test-lowresMRI/.snakemake/conda/484b42a4a5ef5fb1ce0f34e315e3ecae_
    shell:
        mkdir -p tempmodel tempimg templbl && cp work/sub-01/anat/sub-01_hemi-L_space-corobl_desc-preproc_T1w.nii.gz tempimg/temp_0000.nii.gz && tar -xf /localscratch/.cache/hippunfold/model/trained_model.3d_fullres.Task101_hcp1200_T1w.nnUNetTrainerV2.model_best.tar -C tempmodel && export RESULTS_FOLDER=tempmodel && export nnUNet_n_proc_DA=12 && nnUNet_predict -i tempimg -o templbl -t Task101_hcp1200_T1w -chk model_best -tr nnUNetTrainerV2 --disable_tta &> logs/sub-01/sub-01_hemi-L_space-corobl_nnunet.txt && cp templbl/temp.nii.gz work/sub-01/anat/sub-01_hemi-L_space-corobl_desc-nnunet_dseg.nii.gz
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Exiting because a job execution failed. Look above for error message
WorkflowError:
At least one job did not complete successfully.

And this is what the log file says:

nnUNet_raw_data_base is not defined and nnU-Net can only be used on data for which preprocessed files are already present on your system. nnU-Net cannot be used for experiment planning and preprocessing like this. If this is not intended, please read documentation/setting_up_paths.md for information on how to set this up properly.
nnUNet_preprocessed is not defined and nnU-Net can not be used for preprocessing or training. If this is not intended, please read documentation/setting_up_paths.md for information on how to set this up.
using model stored in  tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1
This model expects 1 input modalities for each image
Found 1 unique case ids, here are some examples: ['temp']
If they don't look right, make sure to double check your filenames. They must end with _0000.nii.gz etc
number of cases: 1
number of cases that still need to be predicted: 1
emptying cuda cache
loading parameters for folds, None
folds is None so we will automatically look for output folders (not using 'all'!)
found the following folds:  ['tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_0', 'tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_1', 'tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_2', 'tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_3', 'tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_4']
using the following model files:  ['tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_0/model_best.model', 'tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_1/model_best.model', 'tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_2/model_best.model', 'tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_3/model_best.model', 'tempmodel/nnUNet/3d_fullres/Task101_hcp1200_T1w/nnUNetTrainerV2__nnUNetPlansv2.1/fold_4/model_best.model']
Traceback (most recent call last):
  File "/local/scratch/test-lowresMRI/.snakemake/conda/484b42a4a5ef5fb1ce0f34e315e3ecae_/bin/nnUNet_predict", line 10, in <module>
    sys.exit(main())
  File "/local/scratch/test-lowresMRI/.snakemake/conda/484b42a4a5ef5fb1ce0f34e315e3ecae_/lib/python3.9/site-packages/nnunet/inference/predict_simple.py", line 219, in main
    predict_from_folder(model_folder_name, input_folder, output_folder, folds, save_npz, num_threads_preprocessing,
  File "/local/scratch/test-lowresMRI/.snakemake/conda/484b42a4a5ef5fb1ce0f34e315e3ecae_/lib/python3.9/site-packages/nnunet/inference/predict.py", line 658, in predict_from_folder
    return predict_cases(model, list_of_lists[part_id::num_parts], output_files[part_id::num_parts], folds,
  File "/local/scratch/test-lowresMRI/.snakemake/conda/484b42a4a5ef5fb1ce0f34e315e3ecae_/lib/python3.9/site-packages/nnunet/inference/predict.py", line 184, in predict_cases
    trainer, params = load_model_and_checkpoint_files(model, folds, mixed_precision=mixed_precision,
  File "/local/scratch/test-lowresMRI/.snakemake/conda/484b42a4a5ef5fb1ce0f34e315e3ecae_/lib/python3.9/site-packages/nnunet/training/model_restore.py", line 147, in load_model_and_checkpoint_files
    all_params = [torch.load(i, map_location=torch.device('cpu')) for i in all_best_model_files]
  File "/local/scratch/test-lowresMRI/.snakemake/conda/484b42a4a5ef5fb1ce0f34e315e3ecae_/lib/python3.9/site-packages/nnunet/training/model_restore.py", line 147, in <listcomp>
    all_params = [torch.load(i, map_location=torch.device('cpu')) for i in all_best_model_files]
  File "/local/scratch/test-lowresMRI/.snakemake/conda/484b42a4a5ef5fb1ce0f34e315e3ecae_/lib/python3.9/site-packages/torch/serialization.py", line 1470, in load
    raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
_pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint. 
	(1) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
	(2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
	WeightsUnpickler error: Unsupported global: GLOBAL numpy.core.multiarray.scalar was not an allowed global by default. Please use `torch.serialization.add_safe_globals([scalar])` or the `torch.serialization.safe_globals([scalar])` context manager to allowlist this global if you trust this class/function.

I'm skeptical this has something to do with nnunet conda env because there haven't been any changes to the config file and we have had run successful wet run tests using --use-conda flag.

@akhanf
Copy link
Member

akhanf commented Feb 25, 2025

I'm also not sure what's going on, but with #381 we should once again be able to use the container dependencies without conda. Trying that out now to see how it goes..

@akhanf
Copy link
Member

akhanf commented Feb 25, 2025

One thing to try is to make a conda env based on the exact nnunet version we are using in the container. It is here:

https://github.com/yinglilu/nnUNet/tree/inference_on_cpu_v1.6.6
https://pypi.org/project/nnunet-inference-on-cpu-and-gpu/

If we can get that working, then can we at least have a starting point to see what changes break things..

@Dhananjhay
Copy link
Collaborator Author

Unfortunately that also didn't work; I ended up with the same error. I've put v1.6.6 on khanlab if you want to check it out.

@akhanf
Copy link
Member

akhanf commented Feb 26, 2025

Ah - based on the error message it looks like it is using the latest version of pytorch, which is probably causing problems.

If you pin the major packages (torch, numpy) based on https://github.com/khanlab/hippunfold_deps/blob/main/pyproject.toml it should hopefully fix the issue..

@akhanf
Copy link
Member

akhanf commented Feb 26, 2025

note you probably don't need an archaic version of torch -- since 2.6 was released last month I am thinking that is what broke it, so maybe <2.6 will work..

@Dhananjhay
Copy link
Collaborator Author

Right, it was complaining about using pytorch 2.6+... I'll try that, thanks @akhanf!

@Dhananjhay
Copy link
Collaborator Author

Unfortunately, the error still persists even after pinning major packages and using an older version of pytorch. This is a snippet from the meta.yaml file:

requirements:
  host:
    - python
    - pip
  run:
    - python
    - pytorch <2.6.0
    - tqdm
    - dicom2nifti
    - scikit-image >=0.14
    - medpy
    - scipy==1.7.1
    - batchgenerators==0.21
    - numpy==1.21.2
    - scikit-learn
    - simpleitk==2.0.2
    - pandas >=1.2.0,<=1.3.0
    - requests
    - nibabel >=3.2.1
    - tifffile
    - matplotlib==3.4.2

Will keep you updated!

@akhanf
Copy link
Member

akhanf commented Feb 27, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants