R
Runpod•7mo ago
chuunizzz

Intermittent Pod Issues: CUDA Errors and Pod Unresponsiveness

I'm experiencing intermittent but frequent issues with my pod running on the runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04 image. The pod becomes unresponsive in a way that resembles a crash, but without actually showing as down in the dashboard. Problem Description: - When the issue occurs, Jupyter Lab opens but shows no folders/files - ComfyUI fails to start with CUDA errors (logs below) - Basic commands like nvidia-smi don't work - Restarting the pod temporarily resolves the issue - This happens frequently, despite no changes to ComfyUI or plugins Error logs when trying to run ComfyUI:
[2025-04-05 22:23:38.296] RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 304: OS call failed or operation not supported on this OS
[2025-04-05 22:23:38.296] RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 304: OS call failed or operation not supported on this OS
The nginx logs also show:
2025/04/05 20:09:35 [error] 322#322: *13468 upstream timed out (110: Unknown error) while connecting to upstream, client: ****, server: _, request: "POST /prompt HTTP/1.1", upstream: "****", host: "****"
2025/04/05 20:09:35 [error] 322#322: *13468 upstream timed out (110: Unknown error) while connecting to upstream, client: ****, server: _, request: "POST /prompt HTTP/1.1", upstream: "****", host: "****"
This began occurring recently, even though my setup was previously stable. There have been no changes to ComfyUI or its plugins. Questions: 1. What might be causing this issue? Is it related to CUDA, GPU allocation, or something else? 2. Are there any logs I should check to better diagnose the problem? 3. Is there anything I can do to prevent these failures or make the pod more stable? 4. Is this a known issue with the PyTorch 2.4.0 image? Any help would be greatly appreciated as this is disrupting my workflow significantly. Thank you!
4 Replies
Unknown User
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
chuunizzz
chuunizzzOP•7mo ago
no, might be a misunderstanding im not using serverless im using a on demand pod with a saving plan maybe i should migrate my serevr to another pod, would that help?
Unknown User
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
chuunizzz
chuunizzzOP•7mo ago
weird... after I reported this issue, it never happened again..😂

Did you find this page helpful?