From 0e67208569645d8236086583864953d5b2090e6d Mon Sep 17 00:00:00 2001
From: Hollow Man <hollowman@opensuse.org>
Date: Fri, 29 Nov 2024 22:40:59 +0200
Subject: [PATCH] Avoid poisoning process with CUDA calls as soon as importing
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Call `torch.cuda.device_count() > 0` before `torch.cuda.is_available()`,
to give priority to nvml based availability, so that we can try not to poison
process with CUDA calls as soon as we execute `import deepspeed`.

https://github.com/pytorch/pytorch/blob/v2.5.1/torch/cuda/__init__.py#L120-L124

There are 2 reasons to make this change:

Firstly, if we accidentally import deepspeed, since the CUDA runtime initializes
when the first CUDA API call is made and caches the device list, changing the
CUDA_VISIBLE_DEVICES within the same process after initialization won't have any
effect on the visible devices. The specific case:
https://github.com/OpenRLHF/OpenRLHF/pull/524#issuecomment-2501505023

A demo for reproduction before the fix is applied:

```python
import torch
import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""
import deepspeed
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"
torch.cuda.set_device('cuda:0')
```

Secondly, https://pytorch.org/docs/stable/notes/cuda.html

When assessing the availability of CUDA in a given environment (is_available()),
PyTorch’s default behavior is to call the CUDA Runtime API method cudaGetDeviceCount.
Because this call in turn initializes the CUDA Driver API (via cuInit) if it is not
already initialized, subsequent forks of a process that has run is_available() will
fail with a CUDA initialization error.

Signed-off-by: Hollow Man <hollowman@opensuse.org>
---
 accelerator/real_accelerator.py | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/accelerator/real_accelerator.py b/accelerator/real_accelerator.py
index 69e96d285bb8..a6173ac70abd 100644
--- a/accelerator/real_accelerator.py
+++ b/accelerator/real_accelerator.py
@@ -167,7 +167,12 @@ def get_accelerator():
                 import torch
 
                 # Determine if we are on a GPU or x86 CPU with torch.
-                if torch.cuda.is_available():  #ignore-cuda
+                # "torch.cuda.is_available()" provides a stronger guarantee,     #ignore-cuda
+                # ensuring that we are free from CUDA initialization errors.
+                # While "torch.cuda.device_count() > 0" check ensures that       #ignore-cuda
+                # we won't try to do any CUDA calls when no device is available
+                # For reference: https://github.com/microsoft/DeepSpeed/pull/6810
+                if torch.cuda.device_count() > 0 and torch.cuda.is_available():  #ignore-cuda
                     accelerator_name = "cuda"
                 else:
                     if accel_logger is not None: