GPU's are unavailable on pod.
Hi guys, I've set 4xH100 instance (default one at most). And when the pod is instantiated GPU's are not available within. (I have a script to validate that)
here's
c6ghnnsno6fkvu whatever pod id. I'll keep it for a day, to let you check it exactly.
Here's my script output:
Usually it looks like this:
This thing happens with me second time. Last time I had the very same issue with A100 PCIe GPUs.
Recreation of pod helps but not always (that case with A100 it's obviously same resources were realocated few times in a row).8 Replies
Let me take a look
Oh, guys it's getting worse, 2-nd in a row pod not seeing GPU right after instantiation.
k7oi1139oop135It's very likely we just put you on the exact same server the issue is usually pretty isolated
Same thing with 4xA100 instances that I've just created
On your first two Pods you received the same GPUs both times, I'm working on hunting down their actual GPU IDs
If you have a GPU experiencing this issue can you do
nvidia-smi -L
Easier for me to just have the GPU Ids, hunting them down is proving difficult lolOk, gonna go back with it next time
Unknown User•3mo ago
Message Not Public
Sign In & Join Server To View
The GPU can become unavailable for a variety of reasons, it usually just so happens that the specific workload a user wants is just put back onto the same machine if they ask for it relatively fast. I don't think we do anything special to prioritize it, divine intervention maybe