RunPod•15mo ago

Frequent GPU problem with H100

Hello, I've seen that 9 times out of 10, I would get an H100 (PCIe) machine where in Cuda won't work with torch. For instance, this machine runs Cuda 12.2, but the torch-cuda integration is broken? @JM or someone from the RunPod team, can you please see since it's happening extremely frequently now? ID: b6d3hcqct79d7o, runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04 Please find image attached. Note that this is a freshly provisioned VM, with NO commands executed but the ones shown in the screenshot.

Solution:

@Dhruv Mullick H100 PCIe have caused us lots of headaches lately. We are soon releasing a very powerful detection tool for the totality of RunPod servers, which will help us fix these non trivial issues. It seems it's always around some specific kernel version that might not be compatible even though it's supposed to be. That being said, expect a strong resolution in the near term!...

Jump to solution

11 Replies

Madiator2011•15mo ago

@Dhruv Mullick could you give a try to my tool #RunPod GPU Tester (recomended for H100 users)

Dhruv MullickOP•15mo ago

Thanks @Papa Madiator , I've released the H100 now (to save costs) and instead provisioned an A100 cluster. But I'll post here once I run into the problem again. Provisioned another one where Cuda doesn't work, and here are the results { "PyTorch Version": "2.2.0+cu121", "Environment Info": { "RUNPOD_POD_ID": "7zb8qedy1qzr0v", "Template CUDA_VERSION": "Not Available", "NVIDIA_DRIVER_CAPABILITIES": "Not Available", "NVIDIA_VISIBLE_DEVICES": "Not Available", "NVIDIA_PRODUCT_NAME": "Not Available", "RUNPOD_GPU_COUNT": "4", "machineId": "krn533olhyna" }, "Host Machine Info": { "CUDA Version": "12.2", "Driver Version": "535.154.05", "GPU Name": "NVIDIA H100 PCIe" }, "CUDA Test Result": { "GPU 0": "Error: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.", "GPU 1": "Error: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.", "GPU 2": "Error: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.", "GPU 3": "Error: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero." } } @Papa Madiator

Madiator2011•15mo ago

next time try use my tool and share errors as they help debug btw what template do you use?

Dhruv MullickOP•15mo ago

I'm using: runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04 I don't think there's any specific one for 12.2?

Madiator2011•15mo ago

btw next time you can just upload json file btw after you get file you can remove pod as I saved machine ID

Dhruv MullickOP•15mo ago

Perfect, thanks!

Madiator2011•15mo ago

I made that script to help get info on broken H100 trust me they are problematic

Madiator2011•15mo ago

In meantime pls enjoy woman crying over broken GPU. Btw fell free to give feedback about my tool

Solution

JM•15mo ago

Dhruv MullickOP•15mo ago

Thank you!! It would be great if an update post could be made once that happens

JM•14mo ago

@Dhruv Mullick I remembered you sir! 😉 So, we got a very good detection tool in place now, but it's manual I believe the problem is largelly solved for H100s. We will be looking to automate the script now to expand it to all servers on RunPod. In the mean time, do not hesitate to reach out if you have any question 🙂

Gaming

Programming

Frequent GPU problem with H100

Did you find this page helpful?