Runpod

R

Runpod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

🔧|api-opensource

📡|instant-clusters

🗂|hub

torch cuda shows no devices available (B200)

on 8x B200 system lz1ew4cgoiot8f : ``` Python 3.11.11 (main, Dec 4 2024, 08:55:07) [GCC 11.4.0] on linux Type "help", "copyright", "credits" or "license" for more information....

very slow 5090 pod

hello, this pod a02462e46395 seems to be terribly slow. i'm trying to install flash_attn and it's building for more than 30 minutes. can someone please check?

error starting container: Error response from daemon: failed to create task for container

Received this email at 2AM: There seems to have been a possible issue with the server that one or more of your pods is hosted on. The following pods were impacted. 80pvzctxtwaruc ...

template for 24.04 or 22.04 with VNC?

I'm trying to get a simple ubuntu 24.04 or 22.04 desktop environment with web VNC working and I'm struggling. The default desktop template from runpod is focal, and absolutely ancient; the programs I want to run do not support it. I tried upgrading the distro, but plenty of things broken there. Would appreciate any help with getting a more modern version of ubuntu with VNC working! Thank you ❤️...

Why can't the connect button be clicked anymore?

status no display and I can't do anything except restart.
No description

Pod loses connection/reconnects periodically

Hey everyone, what I've noticed is that secure cloud pod drops connection and after couple of minutes reconnects. Not sure what causes that, could it be related to server load? It's handling a lot of requests but not sure why that would cause server to disconnect. Vllm logs don't display anything unusual. It feels like proxy is failing and starting again....

Network Volume (Storage) EU-SE-1 with bad performance/latency.

I’ve noticed a significant drop in performance for my pods using the network volume on EU-SE-1 over the past few days. Startup times have increased, and there’s noticeable lag when interacting with Jupyter notebooks or typing commands in the terminal. Applications like ComfyUI are also taking much longer to load. Overall, everything feels slower. This is a recent issue—everything was running smoothly for several weeks prior, and I haven’t made any changes to my setup. Occasionally, ComfyUI or Jupyter won’t load at all and just display a blank white screen, even though they are running on the correct ports....

Network drive issues

So here's the problem I've been having with RunPod (and that may lead to me giving up on it): I have a network drive and use it to install pods. I use ComfyUI, so I usually use a pre-install template to get started, then add nodesets and models according to the workflow. So far so good. The problem is that when I stop the pod to preserve the downloaded models, LoRAs, nodes, etc. I've installed, the data is preserved on the network drive. But when I next come to start up the pod again, the drive no longer has access to a GPU on that server and no RTX 4090s (for example) are available. Sometimes no GPUs are available at all on the server hosting my network drive....

Kohya service not starting. Error: No API Found

I just booted a pod back up and I get this error. I can't do any training.

RunPod + ComfyUI Setup Help (Torch 2.4)

Hi all! I need some help setting up my RunPod environment using this template: ComfyUI with Manager inst. Permanent Disk torch2.4 I’m using Network Storage. With the old Torch 2.2 template, I just ran ./run_gpu.sh and everything worked fine. Now with Torch 2.4, I get errors about missing modules like aiohttp, safetensors, etc. Even if I install them, they disappear when I shut down and restart the pod....

Degraded performance/latency with network volume on EU-SE-1

Performance on my pods deployed using my network volume on EU-SE-1 has been significantly worse in the last couple of days. Everything takes longer to start up/there is noticeable lag when clicking through Jupyter notebook/typing commands in terminal. Apps like ComfyUI take longer to load etc. Everything is simply slower. Before there were no issues for several weeks and I haven't changed anything about my setup. Sometimes ComfyUI/Jupyter will simply not load (stuck at white blank page) despite them running on the designated ports. Last pod ID I used: 9758dru77pqb02...
Solution:
Performance back to normal

I can't deploy any pod. It creates it but it wont start up

It doesn't matter which region, with or without network storage. I restarted my browser, and computer to be sure if there weren't any caching issues. See screenshots. There are no logs.
Solution:
I know you say doesn't matter which region, but I see EU-RO-1 with network storage which is a datacenter actively being restarted. Typically with UI issues like this, trying through a VPN is the next thing we ask.
No description

Pod Download Speed Much Slower

Hi! I’ve noticed recently that the download speed my pods get is much slower than it used to be. I have datasets stored in an S3 bucket and are collected on-the-fly by my pod, but the download is now very slow compared to what it used to be. Would be great to get some clarity on this!...

High latency of DEBUG level on Jupyter Notebook compared to real-time

Hi. I am currently using RunPod for the development of deep learning models for my study, however, the jupyter notebook has been too laggy and unresponsive that the debug level that really take 3 hours to complete, it took me 8 hours just to finish it. I feel ripped off. Is this normal?
Solution:
Hello. It seems the issue comes from the output cell during runtime. I had it on DEBUG logging level and it was overloading the system. I changed it to INFO level and turned off my ad blocker and it fixed the issue.

Registry auth credential not working

Hi all, I have workflow automation that builds a container, adds it to my container registry (ghcr), and then tries to spin up a pod using this container via https://rest.runpod.io/v1/pods. I pass in the ID of the Container Registry Auth credential I created in RunPod for my container registry. While I can use my container registry credentials to pull the container by hand without issue, RunPod can't. The pod's logs show authorization errors. I've triple checked the container registry auth credential ID, as well as the username (mine, as a org github admin) and PAT stored there (which has read:packages and read:org permissions). I also tried using _json_key_base64 as the username, as suggested by the AI bot support, but that didn't work....

Unable to Access My Pod to Recover Data

Hello, I received the following message regarding my previous pod: We have detected a critical error on this machine which may affect some pods. We are looking into the root cause and apologize for any inconvenience. We recommend backing up your data and creating a new pod in the meantime....

I can't deploy GPU pod with H200 SXM in US-GA-2.

I attempted to deploy an H200 SXM pod in US-GA-2, but the container is not being created. I’ve noticed that the number of available H200 SXM units keeps fluctuating, so I’m wondering if others are currently able to use them without issue. I'm not sure whether this is a problem specific to my account or if there is a broader issue with the US-GA-2 region or the H200 SXM resource itself. I would appreciate it if you could look into this. Here is what I’ve confirmed so far:...
No description

'Container is not running' issue / bug?

I'm encountering a persistent issue when trying to launch pods, and I'm hoping you can investigate. I'm seeing similar behaviour across different templates and regions. Pods are failing to reach a stable "Running" state. They have started showing a "container is not running" status shortly after creation. There are no application-specific logs generated. This happens even for templates that were previously working. What I've Tried:...
Solution:
can you check if there is any firewall or extensions that might block requests to runpod's server
Message Not Public
Sign In & Join Server To View
No description

Cuda Driver Version issues

The error below shows up on various machines. I am using the following base images: nvidia/cuda:12.8.0-cudnn-devel-ubuntu22.04 nvidia/cuda:12.6.0-cudnn-devel-ubuntu22.04 nvidia/cuda:12.5.1-cudnn-devel-ubuntu22.04 ...