Bad pods
I have many issues when starting pods. Sometime bandwidth is very bad and it takes 1h to spin up, sometime my script have Cuda that doesn't find a GPU.
example:
torch.AcceleratorError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
Search for
root@5907e062b071:/assets/models/loras# runpodctl receive 6035-prize-program-junior-2
securing channel...runpodctl-receive: croc: receive: room (secure channel) not ready, maybe peer disconnected
Some other times, when ComfyUI finally boot up, I go to the web interface and the websocket refuse to connect (some pods work, some don't).
I've tried 6 times this morning, and each time it fail somewhere different among these errors. There are only ~4 x H100 pods that fit my criterias and I suspect they remain available because they are bad ?
I'd need some guidance here because i'm wasting a lot of money on pods that don't work :/
example:
torch.AcceleratorError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
Search for
cudaErrorDevicesUnavailable' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
Also, sometimes runpodctl send` doesn't work at all so I must find another pod.root@5907e062b071:/assets/models/loras# runpodctl receive 6035-prize-program-junior-2
securing channel...runpodctl-receive: croc: receive: room (secure channel) not ready, maybe peer disconnected
Some other times, when ComfyUI finally boot up, I go to the web interface and the websocket refuse to connect (some pods work, some don't).
I've tried 6 times this morning, and each time it fail somewhere different among these errors. There are only ~4 x H100 pods that fit my criterias and I suspect they remain available because they are bad ?
I'd need some guidance here because i'm wasting a lot of money on pods that don't work :/