Runpod•2y ago

Frequent GPU problem with H100

Hello,
I've seen that 9 times out of 10, I would get an H100 (PCIe) machine where in Cuda won't work with torch.
For instance, this machine runs Cuda 12.2, but the torch-cuda integration is broken?
@JM or someone from the RunPod team, can you please see since it's happening extremely frequently now?
ID: b6d3hcqct79d7o, runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04
Please find image attached. Note that this is a freshly provisioned VM, with NO commands executed but the ones shown in the screenshot.

Solution

@Dhruv Mullick H100 PCIe have caused us lots of headaches lately. We are soon releasing a very powerful detection tool for the totality of RunPod servers, which will help us fix these non trivial issues.

It seems it's always around some specific kernel version that might not be compatible even though it's supposed to be. That being said, expect a strong resolution in the near term!

Jump to solution

Madiator2011•3/4/24, 4:57 PM

@Dhruv Mullick could you give a try to my tool

Madiator2011•3/4/24, 4:57 PM

#RunPod GPU Tester (recomended for H100 users)

Dhruv MullickOP•3/4/24, 5:26 PM

Thanks @Papa Madiator , I've released the H100 now (to save costs) and instead provisioned an A100 cluster. But I'll post here once I run into the problem again.

Dhruv MullickOP•3/4/24, 5:32 PM

Provisioned another one where Cuda doesn't work, and here are the results

{
"PyTorch Version": "2.2.0+cu121",
"Environment Info": {
"RUNPOD_POD_ID": "7zb8qedy1qzr0v",
"Template CUDA_VERSION": "Not Available",
"NVIDIA_DRIVER_CAPABILITIES": "Not Available",
"NVIDIA_VISIBLE_DEVICES": "Not Available",
"NVIDIA_PRODUCT_NAME": "Not Available",
"RUNPOD_GPU_COUNT": "4",
"machineId": "krn533olhyna"
},
"Host Machine Info": {
"CUDA Version": "12.2",
"Driver Version": "535.154.05",
"GPU Name": "NVIDIA H100 PCIe"
},
"CUDA Test Result": {
"GPU 0": "Error: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.",
"GPU 1": "Error: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.",
"GPU 2": "Error: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.",
"GPU 3": "Error: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero."
}
}

Dhruv MullickOP•3/4/24, 5:32 PM

@Papa Madiator

Madiator2011•3/4/24, 5:33 PM

next time try use my tool and share errors as they help debug

Madiator2011•3/4/24, 5:34 PM

btw what template do you use?

Dhruv MullickOP•3/4/24, 5:34 PM

I'm using: runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04

Dhruv MullickOP•3/4/24, 5:34 PM

I don't think there's any specific one for 12.2?

Madiator2011•3/4/24, 5:35 PM

btw next time you can just upload json file

Madiator2011•3/4/24, 5:36 PM

btw after you get file you can remove pod as I saved machine ID

Dhruv MullickOP•3/4/24, 5:37 PM

Perfect, thanks!

Madiator2011•3/4/24, 5:39 PM

I made that script to help get info on broken H100 trust me they are problematic

Madiator2011•3/4/24, 5:43 PM

In meantime pls enjoy woman crying over broken GPU.

Btw fell free to give feedback about my tool

Solution

JM•3/4/24, 7:46 PM

Dhruv MullickOP•3/4/24, 9:09 PM

Thank you!!

Dhruv MullickOP•3/4/24, 9:09 PM

It would be great if an update post could be made once that happens

DDhruv Mullick It would be great if an update post could be made once that happens

JM•3/18/24, 5:09 AM

@Dhruv Mullick I remembered you sir!

JM•3/18/24, 5:09 AM

So, we got a very good detection tool in place now, but it's manual

JM•3/18/24, 5:10 AM

I believe the problem is largelly solved for H100s. We will be looking to automate the script now to expand it to all servers on RunPod. In the mean time, do not hesitate to reach out if you have any question

Frequent GPU problem with H100

Similar Threads

Similar Threads

Similar Threads