-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix PyTorch RAG tests GPU OOM #16881
Fix PyTorch RAG tests GPU OOM #16881
Conversation
The documentation is not available anymore as the PR was closed or merged. |
Also cc @patil-suraj and @stas00 to see if they have suggestions |
Good for me! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What cache is this emptying since the model is not deleted? I think this is a symptom there is a memory leak in the model, which would need to be fixed.
From the documentations torch.cuda.empty_cache and Memory management, in particular
and
my understanding is: This doesn't mean those GPU memory are leaked - PyTorch still controls them. But for other applications (say Of course, the memory occupied by current available tensors won't be release. |
if you're trying to use multiple programs concurrently accessing the same GPU (regardless if they all are pytorch, or mixed framework), But why when the pytorch test finishes the torch still has allocated tensors? Why not free them? Often |
My words in the previous comment might be a bit confusing: we don't have the issue of
There are some discussions, like this one |
when you do In such a case the solution is not to run the program inside I created a special framework for running external programs transformers/src/transformers/testing_utils.py Line 1536 in 6d90d76
You can see it extensively used in the deepspeed and extended tests. |
Yeah, I know this approach, but wasn't very sure how not use it in a good way with testing. By the way: I am really surprised (and not very happy) that there is no way to free these kinds of memory allocation. |
Chances are is that there was no need for that until now and nobody asked for it. If I may propose you could create a feature request at pytorch asking for a feature that releases the cuda env completely. It's very possible that there is a C++ API to do that and it just wasn't made available in python. The use case can be for example this exact situation, where the same program needs to alternate between different frameworks in the same run and needs to be able to access all of gpu's memory. Does tf give the memory fully back when it's done and the process is still running? |
If the goal is to recover as much memory as possible, shouldn't we delete the model before calling the |
There have been some requests in torch GH page without response pytorch/pytorch#28829 Same situation for TF: not fully giving back GPU memory, and the requests are always without response |
@sgugger You are right! I tried it and just like you said. Maybe I can just implement I am going to try this and see if how it goes. |
Note that the |
Implement and before |
OK, I can do that. But I feel that while we are in PyTorch test itself, we don't need to call This And since the TF tests are in other modules, But again, I can go for |
Of course, we are discussing this particular test. I wasn't suggesting to do it to all tests. The reason I suggested |
Unrelated to this PR, but since you work a lot with tests (thank you!), in case you're not aware of it, awhile ago I have developed:
which extends You don't need to do anything about it, other than perhaps I hope it'll save you time in the future. |
Thank you, @stas00 . Maybe I can play with it, and at some point have a discussion with other team members to see if to use it by default! |
And of course please feel free to extend it if there are other features that can be re-used. |
Would like to have @sgugger and/or @LysandreJik opinion before merge :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for iterating!
Merge now - we should have 13 test failure fewer (if this PR also works well on multiple GPUs too) |
* add torch.cuda.empty_cache in some PT RAG tests * torch.cuda.empty_cache in tearDownModule() * tearDown() * add gc.collect() Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
What does this PR do?
Fix PyTorch RAG tests GPU OOM.
The GPU OOM
could be found in https://github.com/huggingface/transformers/runs/6100697349?check_suite_focus=true
Results
9.5 GB
GPU memory. There are 10 TF RAG tests failed.