Runpod

R

Runpod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

🔧|api-opensource

📡|instant-clusters

🗂|hub

Error with the pre-built serverless docker image

Hi completely random, because sometimes it works smoothly, using the runpod serverless VLLM the machine gets stuck on Using model weights format ['*.safetensors']...
No description

How to use environment variables

I have added environment variables in my runpod serverless endpoind, the thing is i cant reach then inside the pod, i have defined it like this in the UI: key. | value SOME_KEY | keyvaluehere ...

job timed out after 1 retries

I'm getting this message with a FAILED state, in roughly 10% of the jobs coming to this endpoint. Usually this comes with a 2-3 minute delay time as well. Where should I start looking to figure out what could be the issue here? ...
No description

Serverless vLLM deployment stuck at "Initializing" with no logs

I've been trying for hours, initially I was trying to deploy Ollama on Serverless GPU, not working, stuck at initializing. Now I am directly using the Serverless vLLM option and it is still not working. Every time I click the deploy button, it just says "Initializing" and there's nothing more, no logs whatsoever. Any idea? Thanks!
No description

Serverless rate limits for OpenAI chat completions

I have set up an OpenAI chat completions endpoint on Runpod serverless with access to 8 GPUs. I can see all 8 GPUs are running and show healthy logs, but when I run tests I notice that the rate at which requests are processed becomes very slow after approximately 500 requests, even slower than if I only ran on a single dedicated GPU pod. The first 500 requests get processed at a rate in line with expectations for 8 GPUs, but then it immediately falls off a cliff, dropping from ~150 req/s to ~15 req/s I saw Runpod has rate limits for /run and /runsync endpoints, but does this also apply for all endpoints? My endpoint is https://api.runpod.ai/v2/<endpoint-id>/openai/v1/completions...

How to set up runpod-worker-comfy with custom nodes and models

hi i managed to set up a serverless api using the SD image example template github.com/blib-la, but what if i have my own comfy workflow that uses custom nodes and models? How do i make a docker image for that so I can use that as the template? I want to use a network drive ideally but when i use the base template timpietruskyblibla/runpod-worker-comfy:3.1.0-base and try to start a serverless endpoint connected to a network drive i previously downloaded the nodes/models to, they aren't there

Discord webhook

How to use discord webhook with serverless? I tried with both "webhook" and "webhookV2"

HIPAA BAA

Do you guys support signing a HIPAA BAA? Thank you!...

Attaching python debugger to docker image

How is it possible to attach debuger to docker image: docker run -it --rm --name model_container \ --runtime=nvidia --gpus all \ -p 10002:5678 -p 10082:8000 ...

Error requiring "flash_attn"

I'm trying to run MiniCPM-V which according to docs supports VLLM (https://github.com/OpenBMB/MiniCPM-V/tree/main?tab=readme-ov-file#inference-with-vllm), but on run I'm getting ImportError: This modeling file requires the following packages that were not found in your environment: flash_attn. Run pip install flash_attn Any help on how to overcome this error? I was trying to use the webUI to configure serverless....
Solution:
It looks like you need flash_attn python module. You need to uncomment the flash_attn line in requirements.txt. It currently looks like this:
#flash_attn==2.3.4
#flash_attn==2.3.4
It needs to look like this:...

worker exited with exit code 137

My serverless worker seems to get the error, worker exited with exit code 137 after multiple consecutive requests (around 10 or so). Seems like the container is running out of memory. Does anyone know what could be the issue as the script runs gc.collect() to free up resources already but the issue still persists.

All workers saying Retrying in 1 second.

I am trying to bring up an endpoint. I have it set to 3 max workers. It is trying to bring up 3 workers and 2 extra workers, and all of them are showing
Retrying in 1 second
Retrying in 1 second
I am not seeing any other output. Is something happening in the background or are these crashed?...
No description

How can I limit the queue "in progress"?

I don't understand what has changed. A few days ago, instead of queuing, tasks almost immediately get into progress. Because of this, the execution time is increasing. I want 1-2 tasks to be in progress, and the rest to wait in line. How to do it? It should be the other way around....
No description

webhooks on async completion

Is there some functionality in server less than would be event drive so I don’t need to keep polling to see if a job was completed?

How to obtain a receipt after making a payment on the RunPod platform?

Hi, does anyone know how to obtain a receipt after making a payment on the RunPod platform? I need it for reimbursement purposes. Thanks!

GGUF vllm

It seems that the newest version of vllm's supports gguf models, have anyone figured out how to make this work in runpod serverless? Seems like need to set some custom ENV vars, or maybe anyone knows a way to convert gguf back to safetensors?

Speeding up loading of model weights

Hi guys, I have setup my severless docker image to contain all my required model weights. My handler script also loads the weights using the diffusers library's .from_pretrained with local_files_only=true so we are loading everything locally. I notice that during cold starts, loading those weights still take around 25 seconds till the logs display --- Starting Serverless Worker | Version 1.6.2 ---. Anyone has experience optimising the time needed tp load weights? Could we pre-load it on ram or something (I may be totally off)?...

Serverless service to run the Faster Whisper

Dear RunPod Technical Support, I'm using your Serverless service to run the Faster Whisper model and have an issue when sending large audio files for transcription. When I send large files through the API, I receive this error: ```...

Assincronous Job

Is i possible to run a long task (30min-1hour) on a serverless endpoint, return the job id and when the job is completed hit a endpoint (to tell the job has finished)?

Is there a way to speed up the reading of external disks(network volume)?

Is there a way to speed up the reading of external disks? The network volume is a bit slow, or are there any similar plans? I need to load the model from an external disk, 6.4G, but it takes 7 times longer than the container volume....