-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM in rnn-t training #521
Comments
Some people had CUDA OOM errors with CUDA 11.1 in #247 I am not sure which CUDA version you are using. Can you try torch 1.10 + CUDA 10.2 ? |
Thank you very much for your reply. I'll try it |
Are you using your own dataset? Does reducing max duration help? |
Yes, I use my own dataset. |
What does this mean ? Does it mean the batch size is 78476 or does it throw the OOM at the 78476th batch? |
it throw the OOM at the 78476th batch |
Could you post more log messages? There must be a |
I see.
Could you copy the code from librispeech to wenetspeech and post the log after the change? If possible, would you mind making a PR with your changes about adding |
Are you sure that you have used |
I notice that your number of tokens is 8007. In my training for wenetspeech, the number of tokens is 5537. Comparing the two numbers, the 8007 is too large. I think it may causes more computing memory. If possible, I suggest that you can have a try with 5537 tokens. You can get the |
I think he uses his own datasets, your tokens might not be suitable for him. I'd rather he can dump the problematic batches, so we can debug the code to see if there is any bugs in our loss. |
It is not a regular OOM issue, it's a bug because it's trying to allocate nearly 60G. He needs to run it in gdb with e.g.
so we can get a stack trace. It may be the same issue reported in #396 |
OK, I'll try |
I just looked at the function trim_to_supervision.The audio in wenetspeech is very long, so it is necessary to trim according to supervision. However, my data is distributed between 1-15s, it should not be trim, so I still can't determine what the problem is. |
Could you follow #521 (comment) |
OK, I will try it when my GPU is free. According to my GPU, what is the appropriate duration to set? 180? |
You can try 200 first and use |
The problem has been solved. Thank you for all replies. |
Background information: I use recipes in wenetspeech for training. There's no problem with oom in scan_pessimistic_batches_for_oom, but it's always oom in the later training. it will not work even max_duration is setted 60.
Error information:
RuntimeError: CUDA out of memory. Tried to allocate 50.73 GiB (GPU 0; 31.75 GiB total capacity; 3.75 GiB already allocated; 26.30 GiB free; 4.27 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Environment: pytorch-1.12, k2-1.17, python-3.7
Gpu:v100, 32g memory, 8
Data: more than 10000 hours of data, 8000 sampling rate, 40 dimensional features, audio length between 1-15s
The text was updated successfully, but these errors were encountered: