I have many issues when starting pods. Sometime bandwidth is very bad and it takes 1h to spin up, sometime my script have Cuda that doesn't find a GPU.
example: torch.AcceleratorError: CUDA error: CUDA-capable device(s) is/are busy or unavailable Search for
cudaErrorDevicesUnavailable' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.Also, sometimes
cudaErrorDevicesUnavailable' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.Also, sometimes
runpodctl send` doesn't work at all so I must find another pod.
Some other times, when ComfyUI finally boot up, I go to the web interface and the websocket refuse to connect (some pods work, some don't).
I've tried 6 times this morning, and each time it fail somewhere different among these errors. There are only ~4 x H100 pods that fit my criterias and I suspect they remain available because they are bad ?
I'd need some guidance here because i'm wasting a lot of money on pods that don't work :/