R
RunPod•5mo ago
Dhruv Mullick

Frequent GPU problem with H100

Hello, I've seen that 9 times out of 10, I would get an H100 (PCIe) machine where in Cuda won't work with torch. For instance, this machine runs Cuda 12.2, but the torch-cuda integration is broken? @JM or someone from the RunPod team, can you please see since it's happening extremely frequently now? ID: b6d3hcqct79d7o, runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04 Please find image attached. Note that this is a freshly provisioned VM, with NO commands executed but the ones shown in the screenshot.
No description
Solution:
@Dhruv Mullick H100 PCIe have caused us lots of headaches lately. We are soon releasing a very powerful detection tool for the totality of RunPod servers, which will help us fix these non trivial issues. It seems it's always around some specific kernel version that might not be compatible even though it's supposed to be. That being said, expect a strong resolution in the near term!...
Jump to solution
11 Replies
Madiator2011
Madiator2011•5mo ago
@Dhruv Mullick could you give a try to my tool #RunPod GPU Tester (recomended for H100 users)
Dhruv Mullick
Dhruv Mullick•5mo ago
Thanks @Papa Madiator , I've released the H100 now (to save costs) and instead provisioned an A100 cluster. But I'll post here once I run into the problem again. Provisioned another one where Cuda doesn't work, and here are the results { "PyTorch Version": "2.2.0+cu121", "Environment Info": { "RUNPOD_POD_ID": "7zb8qedy1qzr0v", "Template CUDA_VERSION": "Not Available", "NVIDIA_DRIVER_CAPABILITIES": "Not Available", "NVIDIA_VISIBLE_DEVICES": "Not Available", "NVIDIA_PRODUCT_NAME": "Not Available", "RUNPOD_GPU_COUNT": "4", "machineId": "krn533olhyna" }, "Host Machine Info": { "CUDA Version": "12.2", "Driver Version": "535.154.05", "GPU Name": "NVIDIA H100 PCIe" }, "CUDA Test Result": { "GPU 0": "Error: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.", "GPU 1": "Error: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.", "GPU 2": "Error: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.", "GPU 3": "Error: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero." } } @Papa Madiator
Madiator2011
Madiator2011•5mo ago
next time try use my tool and share errors as they help debug btw what template do you use?
Dhruv Mullick
Dhruv Mullick•5mo ago
I'm using: runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04 I don't think there's any specific one for 12.2?
Madiator2011
Madiator2011•5mo ago
btw next time you can just upload json file btw after you get file you can remove pod as I saved machine ID
Dhruv Mullick
Dhruv Mullick•5mo ago
Perfect, thanks!
Madiator2011
Madiator2011•5mo ago
I made that script to help get info on broken H100 trust me they are problematic
Madiator2011
Madiator2011•5mo ago
In meantime pls enjoy woman crying over broken GPU. Btw fell free to give feedback about my tool
No description
Solution
JM
JM•5mo ago
@Dhruv Mullick H100 PCIe have caused us lots of headaches lately. We are soon releasing a very powerful detection tool for the totality of RunPod servers, which will help us fix these non trivial issues. It seems it's always around some specific kernel version that might not be compatible even though it's supposed to be. That being said, expect a strong resolution in the near term!
Dhruv Mullick
Dhruv Mullick•5mo ago
Thank you!! It would be great if an update post could be made once that happens
JM
JM•5mo ago
@Dhruv Mullick I remembered you sir! 😉 So, we got a very good detection tool in place now, but it's manual I believe the problem is largelly solved for H100s. We will be looking to automate the script now to expand it to all servers on RunPod. In the mean time, do not hesitate to reach out if you have any question 🙂