updated reader_factory to correct extra folders #2448
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Main train.py does not properly instantiate ImageFolder dataset if --train-split --val-split and --num-classes are specified. It looks like this may only be happening if the main data-dir had additional folders than only data/train and data/val. My issue was a data/test folder.
run 1 (original)
CUDA_VISIBLE_DEVICES=0 python train.py --data-dir data --model seresnet34 --sched cosine --epochs 150 --warmup-epochs 5 --lr 0.4 --reprob 0.5 --remode pixel --batch-size 256 --amp -j 4 --train-split data/train/ --val-split data/val/ --num-classes 200
Training with a single process on 1 device (cuda).
Model seresnet34 created, param count:21550016
Data processing configuration for current model + dataset:
input_size: (3, 224, 224)
interpolation: bicubic
mean: (0.485, 0.456, 0.406)
std: (0.229, 0.224, 0.225)
crop_pct: 0.875
crop_mode: center
Created SGD (sgd) optimizer: lr: 0.4, momentum: 0.9, dampening: 0, weight_decay: 2e-05, nesterov: True, maximize: False, foreach: None, differentiable: False, fused: None
Using native Torch AMP. Training in mixed precision.
Scheduled epochs: 150 (epochs + cooldown_epochs). Warmup within epochs when warmup_prefix=False. LR stepped per epoch.
Traceback (most recent call last):
File "/pai/train.py", line 1235, in
main()
File "/pai/train.py", line 888, in main
train_metrics = train_one_epoch(
^^^^^^^^^^^^^^^^
File "/pai/train.py", line 1083, in train_one_epoch
loss = _forward()
^^^^^^^^^^
File "/pai/train.py", line 1051, in _forward
loss = loss_fn(output, target)
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pai/timm/loss/cross_entropy.py", line 22, in forward
nll_loss = -logprobs.gather(dim=-1, index=target.unsqueeze(1))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions./opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [70,0,0] Assertion
idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed./opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [84,0,0] Assertion
idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed./opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [23,0,0] Assertion
idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed./opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [100,0,0] Assertion
idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed./opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [104,0,0] Assertion
idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed./opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [116,0,0] Assertion
idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed./opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [119,0,0] Assertion
idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed./opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [55,0,0] Assertion
idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed./opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [58,0,0] Assertion
idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed./opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [63,0,0] Assertion
idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.---- making edit
nano timm/data/readers/reader_factory.py
---- run 2
CUDA_VISIBLE_DEVICES=0 python train.py --data-dir data --model seresnet34 --sched cosine --epochs 150 --warmup-epochs 5 --lr 0.4 --reprob 0.5 --remode pixel --batch-size 256 --amp -j 4 --train-split data/train/ --val-split data/val/ --num-classes 200
Training with a single process on 1 device (cuda).
Model seresnet34 created, param count:21550016
Data processing configuration for current model + dataset:
input_size: (3, 224, 224)
interpolation: bicubic
mean: (0.485, 0.456, 0.406)
std: (0.229, 0.224, 0.225)
crop_pct: 0.875
crop_mode: center
Created SGD (sgd) optimizer: lr: 0.4, momentum: 0.9, dampening: 0, weight_decay: 2e-05, nesterov: True, maximize: False, foreach: None, differentiable: False, fused: None
Using native Torch AMP. Training in mixed precision.
Scheduled epochs: 150 (epochs + cooldown_epochs). Warmup within epochs when warmup_prefix=False. LR stepped per epoch.
Train: 0 [ 0/390 ( 0%)] Loss: 5.36 (5.36) Time: 1.552s, 164.95/s (1.552s, 164.95/s) LR: 1.000e-05 Data: 0.517 (0.517)