RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods-clusters

my pod deleted

my pod deleted...

Creation of new pods in EU-CZ-1 results in "ssh: connection refused"

I have a previous pod running that works fine in that same region, but any new pods don't work with the same ssh key and everything. Using direct connect as the proxy is broken for ssh keys. Using for 3090 gpu

Pod networking issues?

I have 8x L40S and 8x RTX 6000 pods that seem to have no internet connectivity. I've been trying to install python packages hosted on github (via pip) and load models from torch hub but I get the following errors.
No description

ROCm 6.3

Any ETA on supporting ROCm 6.3 for MI300?

ComfyUI never opens on port 3000

In last 24 hours comfyui on port 3000 always fails to load, just constant 'transferring data' message in browser. All logs show everything running and ready as usual

Unexpected Pod Billing After Failed Deployment

Yesterday, I attempted to deploy a new pod, but after clicking "Deploy," I received an error message along the lines of: "This GPU is no longer available, we couldn't deploy your pod." This happened when the GPUs went down yesterday. Once everything was back up, I checked my account. The pod had not appeared in my "My Pods" list, and I hadn’t been charged — so I assumed the deployment had failed....

Pods are terribly slow

Hey. I usually deploy pods from US-TX3 and everything became very slow. I have to wait minutes to see jupyter and/or comfyui launch. I also have problems with images not loading. I tried changing machine then server. Still unusable. Any clue ? Thanks...

Slow Image previews regardless of pod

Hey, I am experiencing issues with image previews showing up really slowly, no matter what community pod I use. This kind of loading happens (I am using a1111). There is no problem with my internet, and I have not experienced this kind of slowdowns before. Pod download speed via API to CivitAI for example seems completely fine also....
No description

Issues with SSH in Axolotl Pod

I can't do an SSH connection with SCP / SFTP. I tried generating new SSH key pairs, I specifically adjusted their permissions, connected with the recommended ssh command for the pod (I also tried starting a separate pod and manually building the command), and double checked that I copied the public ssh string. I am running an arch-linux based host OS, and connected via a terminal SSH command....
Solution:
check your ssh server (and the configs) again with web terminal

CUDA device uncorrectable ECC error

I'm using a 5xH100 pod and got uncorrectable ECC error for device 1,2,3. Device 0 and 4 can be used without a problem. It seems the device or the system needs a reboot. Any help on this? I've already submitted a ticket on the website with the pod id. Python 3.12.5 | packaged by Anaconda, Inc. | (main, Sep 12 2024, 18:27:27) [GCC 11.2.0] on linux Type "help", "copyright", "credits" or "license" for more information.
import torch...

Volume full and deleting files doesnt free up space

My storage volume is full and deleting big files doesnt free up space. What can i do?

Created a pod, and it is not appearing in my list of Pods so I can't view it to turn it off

I created a pod and it does not appear in my list of pods, but I'm being charged for it. Cannot view it to connect or turn it off.
Solution:
Pod console UI is back, seems like no downtime for a GPU Pod

Problem with hanging pod

Hi team, I've issue about the hanging pod, somehow the GPU is crashed and now all the process is hanging Tried to restart the pod, it didn't work. Tried to stop and start again, and now it's can't get the pod up Please help me with this. This is the pod ID:...

h100 servers having issues?

Hey RunPod folks, is something going on with the h100 secure cloud machines? I first got a number of weird issues on a 8xH100 (SXM) server (cross GPU links going down randomly? Hard to say what is exactly going on - I get random timeouts in multi GPU comms after days of work). I tried spinning a new machine (ID: nyotnwudbsq0mu, ID: 23xahufe1yk33g) but they are stuck loading the docker images from our private Docker (that works great and I can access from other RunPod machines). Can someone please have a look?...

vLLM Inconsistently Hangs at NCCL Initialization

Hi, I am trying to run vLLM on 2x A40s GPUs and it will sometimes hang at NCCL initialization. This inconsistently occurs and sometimes will work fine. But for a pod that it hangs on, repeated attempts will aways hang... CUDA 12.4.1 python 3.10 vllm 0.7.3...
No description

Issues when restarting stopped pod

For a few days, I've had multiple issues when restarting a stopped pod. It will just hang saying "Container is not running" -- once I briefly caught an error in the system console about 'failed to start networking' and 'driver failed to program' -- is that an issue on the RunPod infra? I should note that I'm running the exact same container image over and over again, and if I terminate the one the failed and re-create it from scratch it works every time, but I thought you were supposed to be able to restart a stopped pod? Oh, and I can confirm that the API says the pod was restarted with the GPU attached and is in 'RUNNIING' state, but it has the issue described above....

Need Pytorch 2.5 and 2.6 offical Docker Image

We need Pytorch 2.5 and 2.6 offical Docker Image that is safe and can be used for new features such as mit-han-lab/nunchaku project.

500 Response when creating pod using API

I always getting error when trying to create POD using API. Always the same response "create pod: There are no instances currently available"
No description

Missunderstanding what I need to do to rent a GPU with ssd...

Hey guys. Just wanna be sure i'm doing right I need RTX 4090 and 300 GB ssd storage Should I go to Stoarage > select Data Center I needed (EUR-IS-1) ...

Intermittent Pod Issues: CUDA Errors and Pod Unresponsiveness

I'm experiencing intermittent but frequent issues with my pod running on the runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04 image. The pod becomes unresponsive in a way that resembles a crash, but without actually showing as down in the dashboard. Problem Description: - When the issue occurs, Jupyter Lab opens but shows no folders/files - ComfyUI fails to start with CUDA errors (logs below)...