We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!




Faster-Whisper worker template is not fully up-to-date

Hi, We're using the Faster-Whisper worker ( on Serverless. I saw that Faster-Whisper itself is currently on version 1.0.2, whereas the Runpod template is still on 0.10.0. There are a few changes that have been introduced in Faster-Whisper (now using CUDA 12) since, that we would like to benefit from, especially the language_detection_threshold setting, since it seems like most of our transcriptions done by people with British accent are being transcribed into Welsh (with a language detection confidence of around 0.51 to 0.55) - which could be circumvented by increasing the threshold....

Slow IO speeds on serverless

An A6000 always active worker takes twice as run to run my code than a normal A6000, I think it is IO speed. How can I see IO speeds?

How to download models for Stable Diffusion XL on serverless?

1) I created a new network storage of 26 GB for various models I'm interested in trying.
2) I created a Stable Diffusion XL endpoint on serverless, but couldn't attach the network storage.
3) After the deployment succeeded, I clicked on edit endpoint and attached that network storage to it. So far so good I believe. But how do I exactly download various SDXL models into my network storage, so that I could use them via Postman?...

0% GPU utilization and 100% CPU utilization on Faster Whisper quick deploy endpoint

I used the "Quick Deploy" option to deploy a Faster Whisper custom endpoint ( Then, I called the endpoint to transcribe a 1 hour long podcast by using the following parameters: ``` { 'input': { 'audio': '',...
No description

Loading models from network volume cache is taking too long.

Hello all, I'm loading my model like following so that I can use the cache from my network volume. model = AutoModel.from_pretrained(...

Are webhooks fired from Digital Ocean?

I setup a WAF in AWS to block bots and I am getting a bunch of requests to my RunPod Serverless Webhook blocked by AWS#AWSManagedRulesBotControlRuleSet#SignalKnownBotDataCenter . The IP address in these requests seems to be a Digital Ocean Data Center. I have disabled the WAF for my ALB for my RunPod webhooks temporarily, but hoping that someone can confirm whether these are legitimate requests or not, because I was under the impression that RunPod uses AWS and not Digital Ocean.

best architecture opinion

Hello, I would like to build an app that out of 1 prompt specified by a user, create 10 prompts. Then call a model once for each of these 10 prompts, giving me 10 responses. Then, do a final call to aggregate the 10 responses into one final response that will be returned to the user. My question is the following, do you have any advice on how to build this ? option a) send the user prompt to the serverless endpoint, and within the endpoint, create the 10 prompts, and call the model sequentially, and then one last time to aggregate the result. All of that in 1 call from the user to the serverless endpoint...

Cancelling job resets flashboot

For some reason whenever we cancel a job the next time the serverless worker cold boots it doesn't use flash boot and instead it reloads the llm model weights into the gpu from scratch. Any idea why cancelling jobs might be causing this problem? Is there maybe a more graceful solution for stopping jobs early than the /cancel/{job_id} endpoint?


We are also starting a vLLM project and I have two questions: 1) In the environment variables, do I have to define the RUNPOD_API_KEY with my own secret key to access the final vLLM OpenAI endpoint? 2) Isn't MAX_CONTEXT_LEN_TO_CAPTURE now deprecated? Do we still need to provide it, if MAX_MODEL_LEN is already set? ...

Do I need to allocate extra container space for Flashboot?

I'm planning to use Llama3 model that takes about 40 GB space. I believe Flashboot takes a snapshot of the worker and keeps it on the disk to load it within seconds when the worker becomes active. Do I need to allocate enough space on the container for this? In this case, since I'm planning to select a 48 GB vRAM GPU, do I need to allocate 40 GB Model + 48 GB for snapshot + 5 GB extra space = 93 GB container space?

When servless is used, does the machine reboot if it is executed consecutively? Currently seeing iss

When servless is used, does the machine reboot if it is executed consecutively? Currently seeing issues with last execution affecting the next

unusual usage

Hello ! we got billed weirdly this past weekend...
No description

Slow I/O

Hey, I am trying to download a 7GB file and run a ffmpeg process to extract an audio from that file (its a video). Locally it takes on average around 5 minutes, but when I try it on the cloud (I chose the CPU, general purpose since a GPU doesn't seem to give any advantage here) and it looks like the I/O is SUPER SLOW. Is there anything I can do to speed up the Disk I/O?...

Problem with RunPod cuda base image. Jobs stuck in queue forever

Hello, I'm trying to do a request to a serverless endpoint that uses this base image on its Dockerfile FROM runpod/base:0.4.0-cuda11.8.0 I want the serverside to run the input_fn function when I do the request. This is part of the server side code: ```model = model_fn('/app/src/tapnet/checkpoints/')...
Hmm yeah I guess python 3.11 is missing from that runpod base image..
No description

runpod-worker-a1111 and loras

I dont think my loras are working with this worker? But it seems to be able to get loras with the /sdapi/va/loras so am i able to use loras with this worker or no?...

Intermittent connection timeouts to

```json { "endpointId":"oic105cyzlovnk" "workerId":"3cwou4m0x6hxl0" "level":"error"...

vLLM streaming ends prematurely

I'm having issues with my vLLM worker ending a generation early. When I send the same prompt to my API without "stream": true, the prompt returns fully. When "stream": true is added to the API, it stops early, sometimes right after {"user":"assistant"} gets sent. It was working earlier this AM, I see this in the system logs around the time that it stopped working: 2024-06-13T15:37:10Z create pod network 2024-06-13T15:37:10Z create container runpod/worker-vllm:stable-cuda12.1.0 2024-06-13T15:37:11Z start container...

Why no gpu in canada data center today?

My network volume is in ca-mtl-1, there is no any gpu now.
Hey y'all, we disable the creation of new pods four days before a maintenance to stop further issues (this was not something I was personally aware of until now otherwise it would have been posted in #🚨|incidents). However, I talked with the team and you should be able to create new pods again, let me know if you're running into any issues.
No description

is there example code to access the runpod-worker-comfy serverless endpoint

Hi, I have managed to run the runpod-worker-comfy serverless endpoint. and I know it supports for 5 entries: RUN, RUNSYNC, STATUS, CANCEL, HEALTH. but I do not exactly know how to access the service from my python code. like how to prepare the api-key, the worker id, how to prepare the request for RUN, and how to check the status until it is finished, and download the generated image. anywhere exists a example code to do these basic operation from my python code? Previously I have python code to communicate directly with the comfyUI server, which will create a websocket, send the workflow with http post, keep checking the history, once the work is done, read the image from the output which passed through the websocket connection. but when wrapped with runpod-worker-comfy, indeed, the interface is more easy, and there is input validation which is great. but I do not know how to use it from my code, and did not find any example code to access it, sorry for my ignorance....