Intermittent Pod Issues: CUDA Errors and Pod Unresponsiveness
runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04
image. The pod becomes unresponsive in a way that resembles a crash, but without actually showing as down in the dashboard.
Problem Description:
- When the issue occurs, Jupyter Lab opens but shows no folders/files
- ComfyUI fails to start with CUDA errors (logs below)...Pod ran out of CPU RAM
model.save_pretrained
... while the weights are still in VRAM... The pod is still running, but completely unresponsive.
Now that you're done laughing at my misfortune, is there anything at all I can do to save those weights? Even enabling some swap would be completely fine... I just want the weights to save to the networked drive...
Pod ID: tybrzp4aphrz3d...What's the right procedure for creating custom template images
nginx
and openssh-server
as they are requirements listed in the github repo. I also copied over
- start.sh
from here and added that as the CMD in my custom dockerfile
- nginx.conf
from here to /etc/nginx/nginx.conf as in the official dockerfiles
start.sh
seems to be running properly and my public key is in ~/.ssh/authorized_keys
. But,...Struggling with runpod unable to access htpp server or terminal error.
cliploader error
Uncorrectable ECC error encountered

runpodctl project example doesn't work
runpodctl project dev
OR deploy:
- pod created
- console show : ...RunPod Deploy Streamlit App
POD stops working after a day or 2:
error creating container: cant create container; volume must exist
SSH over exposed TCP connection refused
NGINX, Uvicorn and FastAPI setup not working

DNS resolution
Axolotl Fine Tune Error (flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so: undefined symbol)
flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so: undefined symbol
Everything was working normally until yesterday. I'm following the steps in the fine tuning tutorial: https://docs.runpod.io/tutorials/pods/fine-tune-llm-axolotl#using-a-hugging-face-dataset...Jupyter bug with checkpoint folder in comfyui
Is there an api to sync with Backblaze B2?
How to build on top of runpod dockerfile?
Training AI with a RunPod GPU
RTX 4090 Instances Not Starting Up

hardware graphics acceleration