RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

🔧|api-opensource

📡|instant-clusters

🗂|hub

First runs always fail

when using serverless api endpoint (comfyUI installed), the first run always fails even tho the following ones work fine. this is what the api returns on the first run:...

RunPod GPU Availability: Volume and Serverless Endpoint Compatibility

Hey everyone! Quick question about RunPod's GPU availability across different deployment types. I'm a bit confused about something: I created a volume in a data center where only a few GPU types were available. But when I'm setting up a serverless endpoint, I see I can select configs with up to 8 GPUs - including some that weren't available when I created my volume. Also noticed that GPU availability keeps fluctuating - sometimes showing low availability and sometimes none at all. So I'm wondering:...

How long does it normally take to get a response from your VLLM endpoints on RunPod?

Hello. I've tested a very tiny model (Qwen2.5-0.5B-Instruct) on the official RunPod VLLM image. But the job takes 30+ seconds each time - 99% of it is loading the engine and the model (counted as delay time), and the execution itself is under 1s. Flashboot is on. Is this normal or is there a setting or something else I should check to make the Flashboot kick in? How long do your models and endpoints normally take to return a response?

This server has recently suffered a network outage

This server has recently suffered a network outage and may have spotty network connectivity. We aim to restore connectivity soon, but you may have connection issues until it is resolved. You will not be charged during any network downtime.

serverless health

https://api.runpod.ai/v2/adaejhk*****/health The health endpoint while cold starting, never indicate the state of initializing? It just goes from idle/ready to running and back. Is there a way to indicate the Serverless Endpoint is warming up for the application indication?...

Monitoring Queue Runpod

I have had a lot of issues these past days on Runpod. I'd like to be able to quickly react to them with a notification when the queue of a given pod is > 5 during X seconds. Is there an easy way to check that?...

Why dont have any thing A100 or H100 now :(.

I don't understand,, I have reserved about 7 H100 but currently all H100 or A100 worldwide on runpod I don't see which one :(.
No description

Need help *paid

Hey, So i have a custom workflow and just can't run it on runpod serverless. Currently I'm trying with this template but just getting the following error {'type': 'invalid_prompt', 'message': 'Cannot execute because node IPAdapterUnifiedLoader does not exist.', 'details': "Node ID '#109'", 'extra_info': {}}...

Runpod requests fail with 500

also when i try to open my endpoint in UI, it redirects to 404. I didn't change anything....

Upgrade faster-whisper version for quick deploy

Hey guys, can you please upgrade faster-whisper pip dependency version of quick deploy as the current one 0.0.10 does not support the Turbo model. Thanks!

LoRA path in vLLM serverless template

I want to attach a custom LoRA adapter to Llama-3.1-70B model. Usually while using vLLM, after the --enable-lora we also specify the --lora-modules name=lora_adapter_path, something like this. But in the template, it only gives option to enable LoRA, where do I add the path of the LoRA adapter?

Wish to split model files with docker, but it slows down significantly when using storage

I want to split model files with docker. The model files are getting bloated, I tried to store the model files to storage but found that the inference time grows a lot, is there a way around this?...

Intermittent timeouts on requests

I have a custom docker image serverless endpoint. I am sending a payload with the python package endpoint.run_sync(payload, timeout=60). I currently have 0 active workers. I can typically send the first request. After completion, if I send a following request before that worker times out, it will often timeout without ever logging that the main function started. Basically, it seems like the message never gets in the RunPod Queue. What could be causing this behavior? How can I avoid it (or debug it)?
Logs are attached - this case is 2 successful requests, then a third request just times out - it seems like the request never gets to the queue (no logs)....
No description

"Failed to return job results. | Connection timeout to host https://api.runpod.ai/v2/91gr..."

I keep having these errors on my endpoints. It happens most of the time for "high-res" images (4K) but they're JPEG and max 2MB. Runpod serverless pods have significantly deteriorated these last days for me....

Why when I try to post it already tags it Solved?

Why when I try to post it already tags it Solved?

HF Cache

Hey I got this email from you guys
Popular Hugging Face models have super fast cold-start times now
We know lots of our developers love working with Hugging Face models. So we decided to cache them on our GPU servers and network volumes.
Popular Hugging Face models have super fast cold-start times now
We know lots of our developers love working with Hugging Face models. So we decided to cache them on our GPU servers and network volumes.
...

GPU Availability Issue on RunPod – Need Assistance

Hi everyone, I’m currently facing an issue with GPU availability for my ComfyUI endpoint (id: kw9mnv7sw8wecj) on RunPod. When trying to configure the worker, all GPU options show as “Unavailable”, including 16GB, 24GB, 48GB, and 80GB configurations (as shown in the attached screenshot). This is significantly impacting my workflow and the ability to deliver results to my clients since I rely on timely image generation....
No description

job timed out after 1 retries

Been seeing this a ton on my endpoint today resulting in being unable to return images. response_text: "{"delayTime":33917,"error":"job timed out after 1 retries","executionTime":31381,"id":"sync-80dbbd6d-309c-491f-a5d0-2bd79df9c386-e1","retries":1,"status":"FAILED","workerId":"a42ftdfxrn1zhx"} ...

Unable to fetch docker images

During worker initialization I am seeing errors such as: error pulling image: Error response from daemon: Get "https://registry-1.docker.io/v2/": context deadline exceeded 2024-11-18T18:10:47Z error pulling image: Error response from daemon: Get "https://registry-1.docker.io/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)...