Frequent GPU problem with H100

Hello,
I've seen that 9 times out of 10, I would get an H100 (PCIe) machine where in Cuda won't work with torch.
For instance, this machine runs Cuda 12.2, but the torch-cuda integration is broken?
@JM or someone from the RunPod team, can you please see since it's happening extremely frequently now?
ID: b6d3hcqct79d7o, runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04
Please find image attached. Note that this is a freshly provisioned VM, with NO commands executed but the ones shown in the screenshot.
Screenshot_2024-03-04_at_9.41.33_AM.png
Solution
@Dhruv Mullick H100 PCIe have caused us lots of headaches lately. We are soon releasing a very powerful detection tool for the totality of RunPod servers, which will help us fix these non trivial issues.

It seems it's always around some specific kernel version that might not be compatible even though it's supposed to be. That being said, expect a strong resolution in the near term!
Was this page helpful?