Runpod

R

Runpod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

🔧|api-opensource

📡|instant-clusters

🗂|hub

LLM Not Understanding?

I asked a simple question to the LLM and it keeps replying randomly, why?
No description

ComfyUI serverless

Support please with my issue I get error in serverless worker log: Cannot execute because node FaceDetailer does not exist.", "details": "Node ID '#137' ...

Slower than usual job times

Has anyone else noticed this? started happening in the last 1-2 days.

Random EGL initialization errors

I have a container that uses OpenGL / EGL (for headless 3D rendering with pyrender). In some workers, everything works as it should, and keeps running fine for days on end. But sometimes, I'll get a new worker where it just doesn't work, and I'll have to keep refreshing/deleting the worker until I get one where it does work. ```...

prompt_outputs_failed_validation - Serverless workflow broke suddenly without code changes

We made changes to our code 7 days ago and our serverless instance has been working perfectly fine for the past week, but for some reason things broke today. We are getting this error, but we are clueless why this is happening...
No description

Serverless Instance Queue Stalling

Our team has encountered a pretty consistent issue wherein requests will queue without action to our Serverless endpoints (with the following config) and not be actioned by available idling or throttled instances. Find the original AI help thread here with more details: https://discord.com/channels/912829806415085598/1392974715567607862 Here's some TLDR notes:...

Why test_input.json is a must?

I tried to create my own docker image for serverless, but without test_input.json. when handler gets to this line: runpod.serverless.start({"handler": handler}) ...
No description

ComfyUI frontend to work with serverless?

Hi guys, Is there a fork of https://github.com/Comfy-Org/ComfyUI_frontend that can work with serverless? or an extension. I want to try serverless, but creating own frontend for it is a bit too much hassle for now. ...

Containers silently charge users while stuck in infinite CUDA compatibility failure loop

When containers are failing to initialize due to CUDA version compatibility issues, these failures are stuck in an infinite loop & are being charged to the user's account without proper visibility or error reporting in the UI. Additionally, queued requests for failing containers remain in queue indefinitely instead of being failed immediately. Expected Behavior: - Container initialization failures should not incur charges - CUDA version compatibility errors should be clearly displayed in the UI or at least in the logs (out of 5 workers in such state I was able to see logs only for 1 of them)...

All nodes become unhealthy?

I did nothing, but all nodes become unavailable. Pls help.
No description

Is there a way to lock serverless down to a specific IP range/CIDR?

It would be nice to have added security with RunPod (eg. to Cloudflare) to specify a specific range or set of IP addresses that endpoints can communicate with. Is this possible?...

Serverless Github with Huggingface token

How can I pass a Hugging Face token during the build process when deploying a serverless endpoint using the GitHub method and the Dockerfile requires the token? ex) //Dockerfile ......

Serverless Worker Fails: “cuda>=12.6” Requirement Error on RTX 4090 (with Network Storage)

Hi team, Hi team, I’m deploying a serverless worker that requires CUDA 12.6 or higher. My endpoint is set to use RTX 4090 GPUs, and I’m using a network volume for model storage (models are not baked into the image). However, when the worker starts, I get this error: ...
No description

Stuck at "loading container image from cache" in a loop for 3 hours

My worker is stuck at: worker is ready loading container image from cache Loaded image ID ...

Low H200 availability for all storage network regions

Im looking to create a h200 sxm serverless endpoint with persistent storage. However, all the storage network regions have very low h200 sxm gpu availability, and wont start any workers. But when i don't amount any volume to the worker, the availability of h200 sxm shows as very high. So which region actually have h200 gpus available? Are there regions that only have gpus and no storage volumes, for example when i select ap-jp-1 region for workers, it has high h200 availability, but there seems to be no ap-jp-1 region for storage....

Get TorchCompile working in serverless with Flux/ComfyUI

Has anyone been able to get TorchCompile working in serverless environment when using Flux/ComfyUI? For some reason I keep getting pure noise back after the job is completed. The workflow seems to work fine in a pod but in serverless the TorchCompile node for Flux gives problems....

How do I bust the cache on Serverless builds

I am currently trying to build something but for some reason the layer I edited still gets pulled from cache. Is there any way to purge cache?

Runpod S3 multipart upload error

When I use boto3.client.upload_file for a multipart upload, all parts are successfully uploaded to S3, but I encounter an error: "An error occurred (InternalError) when calling the CompleteMultipartUpload operation (reached max retries: 10): Failed to create final object file". File size is 2.1GB and storage space(20GB) && part sizes(200~300MB) are within limits....

serverless payment

We’re seeing unexpected charges on our RunPod account related to Serverless usage. On checking the usage logs: - Only one endpoint shows a few request hits. - The tasks on that endpoint typically complete within 10–20 seconds....

Network Volume Attached but not really?

Hi! I have a serverless worker running that is supposed to have a network volume attached. On the overview tab its getting displayed, in the worker metrics its not shown and I get out of storage errors when pulling a model:
/usr/local/lib/python3.12/dist-packages/huggingface_hub/file_download.py:799: UserWarning: Not enough free disk space to download the file. The expected file size is: 3077.77 MB. The target location /workspace/.hf-cache/hub/models--Menlo--Jan-nano-128k/blobs only has 273.49 MB free disk space.
/usr/local/lib/python3.12/dist-packages/huggingface_hub/file_download.py:799: UserWarning: Not enough free disk space to download the file. The expected file size is: 3077.77 MB. The target location /workspace/.hf-cache/hub/models--Menlo--Jan-nano-128k/blobs only has 273.49 MB free disk space.
...
No description