Frequent GPU problem with H100
Hello,
I've seen that 9 times out of 10, I would get an H100 (PCIe) machine where in Cuda won't work with torch.
For instance, this machine runs Cuda 12.2, but the torch-cuda integration is broken?
@JM or someone from the RunPod team, can you please see since it's happening extremely frequently now?
ID: b6d3hcqct79d7o, runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04
Please find image attached. Note that this is a freshly provisioned VM, with NO commands executed but the ones shown in the screenshot.
I've seen that 9 times out of 10, I would get an H100 (PCIe) machine where in Cuda won't work with torch.
For instance, this machine runs Cuda 12.2, but the torch-cuda integration is broken?
@JM or someone from the RunPod team, can you please see since it's happening extremely frequently now?
ID: b6d3hcqct79d7o, runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04
Please find image attached. Note that this is a freshly provisioned VM, with NO commands executed but the ones shown in the screenshot.

Solution
@Dhruv Mullick H100 PCIe have caused us lots of headaches lately. We are soon releasing a very powerful detection tool for the totality of RunPod servers, which will help us fix these non trivial issues.
It seems it's always around some specific kernel version that might not be compatible even though it's supposed to be. That being said, expect a strong resolution in the near term!
It seems it's always around some specific kernel version that might not be compatible even though it's supposed to be. That being said, expect a strong resolution in the near term!
