RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

🔧|api-opensource

📡|instant-clusters

🗂|hub

Serverless vLLM deployment stuck at "Initializing" with no logs

I've been trying for hours, initially I was trying to deploy Ollama on Serverless GPU, not working, stuck at initializing. Now I am directly using the Serverless vLLM option and it is still not working. Every time I click the deploy button, it just says "Initializing" and there's nothing more, no logs whatsoever. Any idea? Thanks!
No description

Serverless rate limits for OpenAI chat completions

I have set up an OpenAI chat completions endpoint on Runpod serverless with access to 8 GPUs. I can see all 8 GPUs are running and show healthy logs, but when I run tests I notice that the rate at which requests are processed becomes very slow after approximately 500 requests, even slower than if I only ran on a single dedicated GPU pod. The first 500 requests get processed at a rate in line with expectations for 8 GPUs, but then it immediately falls off a cliff, dropping from ~150 req/s to ~15 req/s I saw Runpod has rate limits for /run and /runsync endpoints, but does this also apply for all endpoints? My endpoint is https://api.runpod.ai/v2/<endpoint-id>/openai/v1/completions...

How to set up runpod-worker-comfy with custom nodes and models

hi i managed to set up a serverless api using the SD image example template github.com/blib-la, but what if i have my own comfy workflow that uses custom nodes and models? How do i make a docker image for that so I can use that as the template? I want to use a network drive ideally but when i use the base template timpietruskyblibla/runpod-worker-comfy:3.1.0-base and try to start a serverless endpoint connected to a network drive i previously downloaded the nodes/models to, they aren't there

Discord webhook

How to use discord webhook with serverless? I tried with both "webhook" and "webhookV2"

HIPAA BAA

Do you guys support signing a HIPAA BAA? Thank you!...

Attaching python debugger to docker image

How is it possible to attach debuger to docker image: docker run -it --rm --name model_container \ --runtime=nvidia --gpus all \ -p 10002:5678 -p 10082:8000 ...

Error requiring "flash_attn"

I'm trying to run MiniCPM-V which according to docs supports VLLM (https://github.com/OpenBMB/MiniCPM-V/tree/main?tab=readme-ov-file#inference-with-vllm), but on run I'm getting ImportError: This modeling file requires the following packages that were not found in your environment: flash_attn. Run pip install flash_attn Any help on how to overcome this error? I was trying to use the webUI to configure serverless....
Solution:
It looks like you need flash_attn python module. You need to uncomment the flash_attn line in requirements.txt. It currently looks like this:
#flash_attn==2.3.4
#flash_attn==2.3.4
It needs to look like this:...

worker exited with exit code 137

My serverless worker seems to get the error, worker exited with exit code 137 after multiple consecutive requests (around 10 or so). Seems like the container is running out of memory. Does anyone know what could be the issue as the script runs gc.collect() to free up resources already but the issue still persists.

All workers saying Retrying in 1 second.

I am trying to bring up an endpoint. I have it set to 3 max workers. It is trying to bring up 3 workers and 2 extra workers, and all of them are showing
Retrying in 1 second
Retrying in 1 second
I am not seeing any other output. Is something happening in the background or are these crashed?...
No description

How can I limit the queue "in progress"?

I don't understand what has changed. A few days ago, instead of queuing, tasks almost immediately get into progress. Because of this, the execution time is increasing. I want 1-2 tasks to be in progress, and the rest to wait in line. How to do it? It should be the other way around....
No description

webhooks on async completion

Is there some functionality in server less than would be event drive so I don’t need to keep polling to see if a job was completed?

How to obtain a receipt after making a payment on the RunPod platform?

Hi, does anyone know how to obtain a receipt after making a payment on the RunPod platform? I need it for reimbursement purposes. Thanks!

GGUF vllm

It seems that the newest version of vllm's supports gguf models, have anyone figured out how to make this work in runpod serverless? Seems like need to set some custom ENV vars, or maybe anyone knows a way to convert gguf back to safetensors?

Speeding up loading of model weights

Hi guys, I have setup my severless docker image to contain all my required model weights. My handler script also loads the weights using the diffusers library's .from_pretrained with local_files_only=true so we are loading everything locally. I notice that during cold starts, loading those weights still take around 25 seconds till the logs display --- Starting Serverless Worker | Version 1.6.2 ---. Anyone has experience optimising the time needed tp load weights? Could we pre-load it on ram or something (I may be totally off)?...

Serverless service to run the Faster Whisper

Dear RunPod Technical Support, I'm using your Serverless service to run the Faster Whisper model and have an issue when sending large audio files for transcription. When I send large files through the API, I receive this error: ```...

Assincronous Job

Is i possible to run a long task (30min-1hour) on a serverless endpoint, return the job id and when the job is completed hit a endpoint (to tell the job has finished)?

Is there a way to speed up the reading of external disks(network volume)?

Is there a way to speed up the reading of external disks? The network volume is a bit slow, or are there any similar plans? I need to load the model from an external disk, 6.4G, but it takes 7 times longer than the container volume....

One request = one worker

How can I configure my endpoint so that one request is equal to one worker, and one worker does not complete more than one request within a certain timeframe? My workload is bursty and requires all of the workers to be available at once. However, my endpoint does not give that and takes a long time to start all the workers I need. In addition, workers are sometimes reused instead of creating a new instance which I do not want....

Very slow upload speeds from serverless workers

I'm uploading files to Supabase from within the serverless workers and I noticed the process is extremly slow. I understand there's some latency because most workers I'm getting are in Europe and my Supabase instance is on US east, but still, almost 20 seconds to upload a 8MB file is bad. I've check it's not a Supabase issue as I'm based in Europe and my upload speeds are just fine....

TTL for vLLM endpoint

Is there a way to specify TTL value when calling a vLLM endpoint via OpenAI-compatible API?