GPU's are unavailable on pod.

Hi guys, I've set 4xH100 instance (default one at most). And when the pod is instantiated GPU's are not available within. (I have a script to validate that) here's c6ghnnsno6fkvu whatever pod id. I'll keep it for a day, to let you check it exactly. Here's my script output:
[sanitycheck] VISIBLE=all | WORLD_SIZE=1 | NPROC=1 | NGPUS=0 | NAMES=[]
[sanitycheck] VISIBLE=all | WORLD_SIZE=1 | NPROC=1 | NGPUS=0 | NAMES=[]
Usually it looks like this:
[sanitycheck] VISIBLE=all | WORLD_SIZE=1 | NPROC=1 | NGPUS=4 | NAMES=['NVIDIA H100 PCIe', 'NVIDIA H100 PCIe', 'NVIDIA H100 PCIe', 'NVIDIA H100 PCIe']
[sanitycheck] VISIBLE=all | WORLD_SIZE=1 | NPROC=1 | NGPUS=4 | NAMES=['NVIDIA H100 PCIe', 'NVIDIA H100 PCIe', 'NVIDIA H100 PCIe', 'NVIDIA H100 PCIe']
This thing happens with me second time. Last time I had the very same issue with A100 PCIe GPUs. Recreation of pod helps but not always (that case with A100 it's obviously same resources were realocated few times in a row).
8 Replies
Dj
Dj3mo ago
Let me take a look
Yaroslav Ya
Yaroslav YaOP3mo ago
Oh, guys it's getting worse, 2-nd in a row pod not seeing GPU right after instantiation. k7oi1139oop135
Dj
Dj3mo ago
It's very likely we just put you on the exact same server the issue is usually pretty isolated
Yaroslav Ya
Yaroslav YaOP3mo ago
Same thing with 4xA100 instances that I've just created
Dj
Dj3mo ago
On your first two Pods you received the same GPUs both times, I'm working on hunting down their actual GPU IDs If you have a GPU experiencing this issue can you do nvidia-smi -L Easier for me to just have the GPU Ids, hunting them down is proving difficult lol
Yaroslav Ya
Yaroslav YaOP3mo ago
Ok, gonna go back with it next time
Unknown User
Unknown User3mo ago
Message Not Public
Sign In & Join Server To View
Dj
Dj3mo ago
The GPU can become unavailable for a variety of reasons, it usually just so happens that the specific workload a user wants is just put back onto the same machine if they ask for it relatively fast. I don't think we do anything special to prioritize it, divine intervention maybe

Did you find this page helpful?