RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods-clusters

Cold start issue

I stuck with cold start issue,that make the response very slow when make a new request after a longtime. Are there anyways to solve this issues ?

Strange results in Serverless mode

What Am I doing wrong? Why the response is that strange? I've attached the input params and one of the result....
No description

About building container with Git repo

I'm not sure, can I use buildx command in 'Container start commnd'. And from chatGPT, it said it need to push image to Docker hub before using. This is my command , is it valid ? docker buildx create --name mybuilder --use...

Generation with increasing worker`s amount from 5 to 10

Hello everyone, there is such a question: with an increase in the number of workers at the endpoint, will generation become more expensive or only faster?

Serverless Endpoint using official github repo Stuck at "Waiting for building"

Hi, I'm deploying a serverless endpoint using the official github repo runpod-workers/worker-template. I just fork this repo and add one line in dockerfile "RUN pip install --upgrade pip && pip install uv" WITHOUT any other changes. Build completes successfully with no errors in the logs. However, during the testing phase, the status remains at "Waiting for building" indefinitely. No test logs are generated. After about an hour, the process cancels automatically. I've tried increasing the max workers to 2 and allowing multiple GPU types, but the issue persists. Could someone assist me in identifying what's causing the worker to hang during initialization?...
No description

All workers idle despite many jobs in queue

I have 5 workers sitting idle and 100s of jobs stuck "in queue" without any processing
Solution:
OK, looks like Microsoft pulled their model off Hugging Face very unexpectedly: https://github.com/microsoft/TRELLIS/issues/264

slow model loading times with vllm

deployed vllm worker from webui with 0.8.5 version and attached a network storage. it is a finetuned gemma3 model. INFO 05-17 20:09:56 [loader.py:458] Loading weights took 113.32 seconds INFO 05-17 20:09:56 [model_runner.py:1140] Model loading took 23.3141 GiB and 160.792180 seconds...

Stop storing pull image process

For some reason, I suddenly got all of the pull image logs into the logs section in serverless, and it's now really cumbersome to find actual runtime logs.

nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.8, please update your drive

error starting container: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.8, please update your driver to a newer version, or use an earlier cuda container: unknown...

Some query take a long time than usual

I notice that some query take very long time (stucking in delay), why ? Ps. I notice thar problem occur when I leave server idle for a while...
No description

The total token limit at 131

I use vLLM and set max model length to 8000 a2048 but out is just 131 (total out + in ), although i have set max tokens to 2048. I try with 2 models and result is the same.
No description

failing to start job

One of 10 times we are getting error message when trying to pass message. The error is no inside serverless container. Job is not getting processed by runpod itself. running from fastapi background_task. Sot the trace is not full ```...
No description

5090 error serverless

Does the vllm image have the old pytorch? im getting this NVIDIA GeForce RTX 5090 with CUDA capability sm_120 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_50 sm_60 sm_70 sm_75 sm_80 sm_86 sm_90. If you want to use the NVIDIA GeForce RTX 5090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/...

How to edit the vLLM settings on a serverless instance originally created with "quick deploy"?

I'm trying to figure out how to change vLLM settings on a serverless instance that isn't working quite right. There's a ton of tunables on the quick deploy dialog but I can't figure out where to change them on an existing endpoint.
Solution:
Hmm, use the quick deploy again and see the env, or check vllm worker github repository for the env variables

Serverless TimeoutError: "Failed to get job"

Issue:
Getting repeated TimeoutError in RunPod Serverless with no clear cause (no GPU OOM or other errors).
- Error: Failed to get job. | Error Type: TimeoutError | Error Message: Runpod serverless
- Happens even with 120s timeout, single request takes at max 20 sec. Configuration: ...
No description

🚨 Inconsistent Execution Time Across Workers for Same Input on L40s (48GB Pro) – Need Help

Hi everyone, I'm facing a strange issue with my RunPod endpoint set up using latentsync on L40s 48GB Pro with 10 workers. The problem is that the same input request is taking vastly different execution times across different workers. - Some workers complete the task in 10–15 minutes...

vLLM Dynamic Batching

Hi, I currently use a locally hosted exl2 setup but want to migrate my inference to RunPod serverless. My use case requires processing hundreds, sometimes thousands of prompts at the same time. I'm currently taking advantage of exl2's dynamic batching to figure out the optimal collating for batch processing. Does vLLM backend support taking in thousands of prompts (some of which could be close to 4096 tokens long) through the openAI API and process them as a job and return the results as a ba...

How Low-Latency Is the VLLM Worker (OpenAI-Compatible API)?

Hey team! I'm looking into using Runpod's VLLM worker via the serverless endpoint for real-time voice interactions. For this use case, minimizing time-to-first-token during streaming is critical. Does the OpenAI-compatible API layer introduce any noticeable latency, or is it optimized for low-latency responses? Using llama3, I've seen ~70ms latencies when running a VLLM server on a dedicated pod. Is similar performance achievable with the serverless setup or is there any infrastructure induced latency? If there is, could you point me toward a way to achieve my goal ? Runpod auto scaling would be amazing for this project as it will handle large volumes of inferences....

Serverless Text Embedding - 400

I'm using a text embedding serverless endpoint to run an instance of "sentence-transformers/all-MiniLM-L6-v2". I keep getting a bad request 400 error. The old code I had (using openAI SDK) stopped working and I've tried to configure based on new documentation without any luck. Would greatly appreciate any help! New --------- runpod.api_key = os.getenv("RUNPOD_API_KEY")...

Why aren’t job ID s standard UUID?

When we create a job using runpod the returned job ID is not a standard UUID. Instead, it’s some UUID with some suffix. I would like to know the reason for this, and also how to standardize the job IDs. The reason for me to want this is because in our database we store the job ID but it violates UUID constraint...
Next