Runpod

R

Runpod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

🔧|api-opensource

📡|instant-clusters

🗂|hub

Why always process taking only around 50% of vram

Like the topic, is there any way to increase it to 90%?
No description

Need urgent help. My important data is stuck in pod. Right now, alert is issued on it.

Need disk data from tame_black_salmon pod. Please help. Right now, alert is issued on it. And I cannot login into the pod.

IS ANYONE POD STILL UNUSABLE? IT HAS BEEN 18 HRS FOR ME

I’ll be straight — I’m brokie dawg, both in time and money, and it’s been 18 hours staring at a grey screen interface. It began with some lag yesterday, and then suddenly everything went down. Restarting the pod doesn’t resolve the issue. I even tried creating another pod with different GPUs and settings… same issue. Three attempts so far....
No description

Fixing CPU bottlenecks on the EPYC 9354

Tips first; redeploying can sort out the box and the hypervisor issue for a short period of time. Hardware concerns, noisy neighbors, it's all out there to understand and get out of a runaway/sluggish pod. Thanks runpod for the infrastructure handling that so seamlessly. I even did up a whole patch (2daysrate) when the answer was right in front of me just redeploy (15-20m) Still, what can we talk about for fixing this Specific chip always being the squeaky wheel? I don't know if the rest of users can get into helping while crediting the team. We're a varied bunch 😎 . It's a hiccup that I wouldn't want to open up given the steady fixes and workarounds already automatically (I assume painstakingly playing whack-a-whole) but any guidance could be stellar....

switch jupyter notebook kernel to conda environment

I have a conda environment I created that I want to be able to set as a notebook kernel, but I don't see this option. I already installed ipykernel inside my conda env and registered it as a jupyter kernel.

Fixing I/O Issues & Note To Runpod Support

MOST IMPORTANTLY: if you're struggling with I/O, particularly with comfyui templates because that's what I have experience with, accessing your ports via TCP instead of HTTP completely alleviated it for me. you can edit your template and move all the ports to TCP and voila you'll have completely normal super stable speed. Please do not dismiss or gaslight us when we point out unusable levels of service degradation. I know it's annoying because of the false flags but give users the benefit of the doubt, especially if they have substantial amounts of usage logged. One of the most frustrating things is putting in a ticket and getting a message that because none of their telemetry indicated a spike everything is fine and you should restart your pod or actually spend your own credits doing analysis and research. It's like maybe your telemetry is incomplete or you haven't dug deep enough. Getting brushed aside and struggle-busing with severely degraded performance because you rely on 99%+ service level is not fun....

Getting root access to install tools like NVIDIA NSight Systems

Hey, I wanted to check if it was possible to setup a pod so that on spin up it can have tools like NVIDIA's nsight systems which require root access to run profiling installed and ready to go or if it's not possible to run such tools on Runpod's systems conceptually due to the way pods are setup? Thanks!...

How to sync faster 2 network volumes across regions using Global Networking or S3?

I have 2 Network Volume in 2 Regions both allowing S3 and Global Networking. I tried to use a Pod with one Volume mounted and SSH connect to the other one to do rsync but it was slow. Are there any faster solution I should consider? Thx!...

JupyterLab and ComfyUI - Stable

Hey guys, looks like most of the major issues are resolved including loading of JuypterLab and ComfyUI - even via the standard Cloudflare environment. I'm running my pod on EU-RO-1 based out of Sydney Australia so probably the furthest you can go from EU. I have gotten off a remote call with their support manager and we tested my setup together. Everything started up quickly and I was able to run my Comfy workflow in under 10 mins. Hope you also have the same good experience with Runpod. I have to say their support team is getting 10/10 from me....

Runpod can't access endpoint huggingface

Hi, my pods in location EU-RO-1 has problema with access to this endpoint bellow, why? 2025-10-09T13:16:53.085995Z WARN Reqwest(reqwest::Error { kind: Request, url: "https://cas-server.xethub.hf.co/reconstructions/b5de55e781fd93b7d472c8ca3f7d40d870a3be764f8bb1a9c4a0511c55f9ca5b", source: hyper_util::client::legacy::Error(Connect, ConnectError("tcp connect error", 52.205.151.89:443, Os { code: 110, kind: TimedOut, message: "Connection timed out" })) }). Retrying... at /home/runner/work/xet-core/xet-core/cas_client/src/http_client.rs:233...

My pod is not able to connect to postgres anymore. It was working fine till yesterday

Postgres is fine. I am able to connect from any other machine. Also today my SSH with (No support for SCP & SFTP) suddenly stopped working for me too. I literally didn't change anything.

how to persist volume for pod?

If I terminate the pod, and launch new one, I should be able to us same volume data with new instance.

H100 generation issue and lora downloads failing

At some point today all H100 SXMs began failing to generate video on cuda 12.8. They only output static or black videos. We have troubleshooted this by trying various templates/workflows as well as different servers and pods and the issue still persists, this highly suggests the issue is on runpods side. Also as of 1 hour ago a large amount of Lora downloads from civit AI report as failed in the startup logs. Civit AI is still up and running so again this would suggest an issue on runpods side,...

data from my network volume gone!

I have a network volume in EU-CZ-1, I've created several pods using this volume - it got mounted correctly but all my data is gone! WTF??

network volume not persisting, help!

i create a storage in runpod, i hit deploy pod with volume, i pick a 5090 gpu, i pick a comfy ui template, i dont change anything else, i run, everythings good, terminate the pod, all new workflows, all new lors, outputs, gone from the drivve, jupiter doesnt show em at all what is happening?...

Unable to start a pod with GPU

I've been really patient (4 days) but I am unable to start my pod with a GPU (A40 in CA-MTL-1 secure cloud). When I try to spawn my pod it does not even show the modal screen when it says 0 GPUs available and that I can run it in CPU mode - it shows a different modal with CPU only mode. It does not look like all A40s are rented (I've been trying different times of day for 4 days), it looks like they have been removed from the hosting center. also when I try to start a pod with my network volume I keep getting this error over and over again: error creating container: cant create container; volume must exist create container ashleykza/stable-diffusion-webui:8.9.13...

Pod connect tab says: No support for SCP & SFTP

Hi! My pod connect tab says: (No support for SCP & SFTP), but the documentation says we should be using SCP to copy file. I cannot scp files to my pod although port 22 is listed as open. What should I do to copy files to my pod, scp or other?...

Failed to initialize NVML: Unknown Error

Not sure, if I am doign something wrong, or what is happening. But every 20-30 ish min the pod restarts and it seems like I lose connection with the GPU until I restart the pod manually. Runnign RTX 5090....

Unable to download the trained lora

I trained the lora using rtx 5090 and is ready to download on ai toolkit. why is it taking 3 hours to download my trained lora that is just 211 mb size. i have tried and waited 3 times for several hours now but it failed every time at the end of download. did i just spent $40 just waiting for nothing how do i solve this?...

Charged overnight for terminated pad

I terminated my pod last night before going to bed, when I came back to runpod today my entire balance had gone and the session logs in the billing section showed that i spent $36 in one session which is more than 10 hours. The server I was using last night was also showing signs of connection issues, I'm pretty sure something went wrong here and the terminate command never made it to the pod even though when i clicked it i was sent back to the gpu selection screen.