Runpod

R

Runpod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

🔧|api-opensource

📡|instant-clusters

🗂|hub

Bricked A40 GPU at 100% utilisation and nvidia-smi error at launch

I’m trying to run a cluster of 4 x A40s for mode finetuning. When launching an instance, exactly one of the gpus is always at 100% utilisation with nvidia-smi showing it having an error. I’ve tried making new clusters multiple times but each time one gpu is broken. Can I somehow avoid this bricked gpu, or replace it? Thanks

Stuck pod can’t start due to nvidia-smi parse error

Pod ID: 3rfs5gmkcmhusc Pod repeatedly fails to start with error creating container: nvidia-smi: parsing output of line 0: failed to parse (pcie.link.gen.max) into int: strconv.Atoi: parsing "": invalid syntax ...

US-TX-3 - Zero GPU availability over the last three days

GPI availability has been non-existent on this region the last few days. This is particularly frustrating when we have a volume we need there. What's going on? Is this region dead? If so, you really should provide a way to swap volumes to another region.

Jupyter lab can't read custom environment

Hello, I have a (probably) easy question. I bought a monthly storage so that I can install my libraries on it. However, when I launch a gpu pod, jupyter lab doesn't recognize the custom environment... Has anyone else had this problem?

authorization header is malformed

I get crazy, since 2 hours I try to access my storage via S3. I have my network volume on runpod, EU-RO-1. I have my S3 API created in my runpod account...
Solution:
Some stuff such as #Better ComfyUI Slim (5090 supported) from @Elder Papa Madiator has a lot of these things already installed with file browser extension etc. Just be wary that #Better ComfyUI Slim (5090 supported) expects a clean network volume for it to set itself up and not conflict with other stuff

"System Memory per GPU" filter doesn't work for more than a single GPU

Select 100gb per GPU, 16 vCPUs per GPU. Select 5090 GPU. Note Pod Summary fulfilling criteria. Increase GPU Count to 2. Note that the amount of RAM in the Pod Summary decreases from 116GB RAM to 108GB RAM. Change region (to SE in this case). Note that the pod Summary now shows 232GB RAM - proving that a Pod with 100Gb+ per GPU actually exists.
No description

No secure ssh or jupyter notebook option on rtx 4090s

Hi, there is no longer a secure ssh or jupyter notebook option on the rtx 4090s. It was working fine a few days ago I didn't make any modifications or anything to the template.
No description

ComfyUI F*ck ups...

I am really fed up... I spend money for nothing... I installed everything correctly, but workflows don't work, showing no drop down options. missing nodes all the time...In case I upload them correctly they dont't appear.... what is all that???

"We have detected a critical error..."

Pod id: z7ybjsm3y0uylk Location: EUR-IS-1 Machine fails to start, in the logs: error creating container: nvidia-smi: parsing output of line 0: failed to parse (pcie.link.gen.max) into int: strconv.Atoi: parsing "": invalid syntax I cannot, on pod startup, choose to run it with a CPU only (which might or might not help, if I was able to, given that the error refers to nvidia-smi ...). Would be cool if I was at least able to start it with the CPU only to grab some data....
Solution:
the machine boots up now, so I'm marking this as resolved

We have detected a critical error on this machine...

Hi there, I am currently experiencing the following issue on my pod: "We have detected a critical error on this machine which may affect some pods. We are looking into the root cause and apologize for any inconvenience. We would recommend backing up your data and creating a new pod in the meantime."...

Full performance of an H100

When running on H100 SXM version, subleased on Prime Intellect (PI) I can't seem to achieve the full performance of an H100. When running it I get at most 760TFLOPS - so the baseline from the PCIe version. Is there ANY way to get the full 989 TFLOPS? ```py import torch...
No description

whats going on with the pods today?

literally been trying to start up a pod since the moment the day started, waited many rounds of 30 min to an hour and never loads, one time only did it load only to direct me towards a broken http link that didnt go up for another hour. Is there some sort of maintenance thing going on? just so i know whether or not to keep throwing money

Issues Accessing FastAPI /docs and /health via Proxy

I am running a FastAPI application inside my RunPod GPU pod. The application works correctly inside the container — for example: curl http://127.0.0.1:8000/upscale ...

vuturpb023l9iz way oversubscribed making my B200 useless

I have no idea what someone is doing to it but it's making it worthless to me and I am in the middle of inference. Please FIX!!!!!! thank you!

Custom Nodes not loading in RunPod ComfyUI template

"Hello, I am using the runpod/stable-diffusion:comfy-ui-6.0.0 template. I have manually installed ComfyUI-AnimateDiff-Evolved and ComfyUI-VideoHelperSuite into the /workspace/ComfyUI/custom_nodes/ directory. The installation and pip install -r requirements.txt completed without errors. After restarting the pod, the nodes (like AnimateDiff Loader and Video Combine) are still not found in the ComfyUI search, even after a hard refresh and using an incognito window. The server log file seems to indicate that the nodes are being loaded successfully without any 'IMPORT FAILED' messages. This suggests there is a fundamental issue with the template environment. Could you please investigate?"...

Failed to pay with visa card

hi i couldt pay with my visa card
Error We are unable to authenticate your payment method. Please choose a different payment method and try again....

How to expose TCP port?

Hi, In pod setting, I added HTTP, TCP ports. But I can not access this port from outside. curl X.X.X.X:XXXX refused. but 22 got not allowed(I assume it works). Why??...

Seedream Edit does not respect the original proportions of the image as output.

https://fal.ai/models/fal-ai/bytedance/seedream/v4/edit Seedream Edit does not respect the original proportions of the image as output. I currently have it set to "default" in size and it always crops it to square....

Subject: Problems NVDEC Hardware Video Decoding Access

I'm running a video processing pipeline for sports analytics on RTX 4090 pods and need hardware-accelerated video decoding (NVDEC) for performance reasons. TECHNICAL DETAILS: - Pod ID: y50n3mu0w6z3nr - GPU: RTX 4090 - Current Issue: Getting "cuvidGetDecoderCaps error 100" when trying to use NVDEC...