some cuda devices not working in multi gpu setting

Hi, I started my pod with 8x H100 pcle & 4xH100 to see if this bug is reproducible.
and for 8x H100, I failed immediately to assign tensors to device 0.
while for 4xH100 it succeeds, then it fails at device 2. Do you have any idea why this is happening?
I'm keep wasting my credit because of this bug

Python 3.12.3 (main, Aug 14 2025, 17:47:21) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

import torch
>>> for i in range(4): print(torch.randn(10).cuda(i))
...
tensor([ 0.2891, -1.5423, 0.9641, -0.9828, -0.2903, -0.1162, -0.3382, -0.4224,
-1.0990, 0.0097], device='cuda:0')
tensor([-0.0208, -0.8867, -0.9426, -0.0929, -0.2264, -0.2705, 0.0863, -0.0632,
-0.3770, -1.2062], device='cuda:1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
torch.AcceleratorError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with
TORCH_USE_CUDA_DSA
TORCH_USE_CUDA_DSA
to enable device-side assertions.

Runpod•5mo ago•

5 replies

pseudo-admin

some cuda devices not working in multi gpu setting

import torch
>>> for i in range(4): print(torch.randn(10).cuda(i))
...
tensor([ 0.2891, -1.5423, 0.9641, -0.9828, -0.2903, -0.1162, -0.3382, -0.4224,
-1.0990, 0.0097], device='cuda:0')
tensor([-0.0208, -0.8867, -0.9426, -0.0929, -0.2264, -0.2705, 0.0863, -0.0632,
-0.3770, -1.2062], device='cuda:1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
torch.AcceleratorError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with
TORCH_USE_CUDA_DSA
TORCH_USE_CUDA_DSA
to enable device-side assertions.

some cuda devices not working in multi gpu setting

Similar Threads

some cuda devices not working in multi gpu setting

Similar Threads

Similar Threads

Similar Threads