RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods-clusters

Intermittent Pod Issues: CUDA Errors and Pod Unresponsiveness

I'm experiencing intermittent but frequent issues with my pod running on the runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04 image. The pod becomes unresponsive in a way that resembles a crash, but without actually showing as down in the dashboard. Problem Description: - When the issue occurs, Jupyter Lab opens but shows no folders/files - ComfyUI fails to start with CUDA errors (logs below)...

Pod ran out of CPU RAM

I somehow managed to run out of RAM (not VRAM, system RAM)... right after a very compute-heavy operation (calculating quantized KV-Cache scales)... while running model.save_pretrained... while the weights are still in VRAM... The pod is still running, but completely unresponsive. Now that you're done laughing at my misfortune, is there anything at all I can do to save those weights? Even enabling some swap would be completely fine... I just want the weights to save to the networked drive... Pod ID: tybrzp4aphrz3d...

What's the right procedure for creating custom template images

I'm tyring to make my custom dockerfiles work with runpod. I've installed nginx and openssh-server as they are requirements listed in the github repo. I also copied over - start.sh from here and added that as the CMD in my custom dockerfile - nginx.conf from here to /etc/nginx/nginx.conf as in the official dockerfiles start.sh seems to be running properly and my public key is in ~/.ssh/authorized_keys. But,...

Struggling with runpod unable to access htpp server or terminal error.

Hey guys! I have posted this same issue some time ago, but I cannot use any of the pods at all when I try to run Kobold AI with Fallen Llama or any other Model. The first time I used this I been able to access runpod normally, but after a few days I am dealing with the runpod issue ever since. I have this for my screenshots....
No description

cliploader error

"Error while deserialzing header: metadatainconpletebuffer" Is the error im getting whenever i trying to run wan in a pod. It seems like its a result of a corrupt model, but i have redownloaded and replaced it in the pod 5 times. Still the same error. Always the same on every workflow i try. Anybody got this issue also?...

Uncorrectable ECC error encountered

Recently I'm getting many "Uncorrectable ECC error encountered" errors on H200 and H100 instances (all that I've tried). I always run a GPU health check first, the 4x H200 pods that I've tried usually don't pass here. An 8x H100 instance did pass there, but then failed during axolotl finetuning with this error. Any ideas why this might be happening all of a sudden?
No description

runpodctl project example doesn't work

Hey, I am trying to reproduce tutorial from https://docs.runpod.io/runpodctl/projects/get-started when i do runpodctl project dev OR deploy: - pod created - console show : ...

RunPod Deploy Streamlit App

Hello folks, I have a Streamlit app and I want to use this app from anywhere like a website. I use Whisper model for STT, and Gemma 3 LLM locally with Ollama. I use these models with LangChain and have web UI with Streamlit. This app is a prototype. What I want to do is; I want to serve this app and show some people. How do I do it? What should I use? I can be more specific if you need to know better.

POD stops working after a day or 2:

Pod stops working after a day or 2, and I have to terminate it and redeploy it and upload the models again. This takes up almost half of the day, as the models are large (more than 6 GBs)

error creating container: cant create container; volume must exist

"create 20GB network volume create container nisokalizzo1/bid:v8 error creating container: cant create container; volume must exist" Someone can explain this log? It's shows I created 20GB volume but in another hand it's said I volume must exist....

SSH over exposed TCP connection refused

I am unable to connect to SSH using TCP... i have restarted the pod multiple times but with no luck. Using "Connect to your pod using SSH. (No support for SCP & SFTP)" works. I have already reset my public key, but i am a bit out of any thinks that i can try. Someone could help me?

NGINX, Uvicorn and FastAPI setup not working

I'm going to put as much information here because I'm so lost and hopefully it makes helping me easier. I've got an uvicorn server running and have verified it works when I ssh into the pod. ``` INFO: Started server process [748]...
No description

DNS resolution

Attempting to connect to my pod, I’m seeing NameResoutionError Failed to resolve “us.i.posthog.com temporary failure in nams resolution Any suggestions?

Axolotl Fine Tune Error (flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so: undefined symbol)

Hi! I was using axolotl image for fine tuning successfully but now I'm getting this error: flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so: undefined symbol Everything was working normally until yesterday. I'm following the steps in the fine tuning tutorial: https://docs.runpod.io/tutorials/pods/fine-tune-llm-axolotl#using-a-hugging-face-dataset...

Jupyter bug with checkpoint folder in comfyui

Hey, so like multiple users mentioned, there is a bug in jupyterlab when it comes to the checkpoints folder inside comfyui, so it becomes not clickable at some point - what is the solution to this or a workaround?

Is there an api to sync with Backblaze B2?

For example, I'd be happy to create network volumes with some objects in blackbaze and vice versa: upload data from network volumes data to blackbaze.

How to build on top of runpod dockerfile?

Hi. My dockerfile is +- this: ``` FROM runpod/pytorch:2.8.0-py3.11-cuda12.8.1-cudnn-devel-ubuntu22.04 RUN pip install ninja ENV MAX_JOBS=4...

Training AI with a RunPod GPU

Hi, I'm pretty new to all this AI stuff and cloud GPU and i'm currently trying to create an AI. I'm trying to train a yolov8xl model with a dataset of about 100k images and 31 class and because it's a big project, my GPU cannot handle such a massive project or it will be really slow. So, I wanted to use an Nvidia A6000 to train my model but I really don't understand how does it work, I even asked chatGPT that told me that i needed to import my dataset into runpod but i don't see anything to impo...

RTX 4090 Instances Not Starting Up

Hello, RTX 4090 instances in Secure Cloud do not seem to be starting up properly. Attached is a screenshot of what I see when I try to start up 7 RTX 4090s. Thanks...
No description

hardware graphics acceleration

Hello, I am trying to create an instance on RunPod that provides hardware graphics acceleration (using an NVIDIA GPU) along with a fully functional remote desktop environment. To achieve this, I have tried using the following images: ...