Runpod

R

Runpod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

🔧|api-opensource

📡|instant-clusters

🗂|hub

Serveless UI broken for some endpoints

Since the latest UI changes, clicking on some endpoints will create a constant loading of the runpod logo and the UI never loads. This seems to be only happening with certain enpoints.
No description

Need help in fixing long running deployments in serverless vLLM

Hi, I am trying to deploy migtissera/Tess-3-Mistral-Large-2-123B model on serverless using 8 48GB GPUs using vLLM. The total size of model weights around 245 GB. I have tried two ways: 1st way without any network volume, it takes really long time to serve first request as it needs to download the weights. Then if the worker goes to idle and I sent a request again it downloads weights and takes long time....
No description

A job start in a worker and seems to be relaunch in another worker.

Hi I have setup an image that install comfyui and some custom node, and as input I have a workflow, the entire workflow is supposed to take a few minutes to run entirely (maybe 5/6min on a A100), but strangely, it start well, and around the end it stop on a worker and re-start in another worker

delayTime representing negative value

On some requests I started to see delayTime in negative and this affects my own autoscaler.
No description

Serveless quants

Hi, how do you specify a specific gguf quant file from a hf repo when configuring a vllm serveless endpoint? Only seems to let you specify the repo level.

DeepSeek R1 Serverless for coding

I'm interested in running an FP16 DeepSeek R1 and I am wondering if Serverless is the way to go or if a Pod would be better. I need this for 2-3 hours at a time and I would like a 'dedicated' access to this environment. Which DeepSeek R1 model should I pick (GGUF?) and how should I configure the deployment tool in Serverless to get it to run on an H100? Thanks in advance for any help....

In Faster whisper serverless endpoint, how do i get english transcription for tamil audio

In Faster whisper serverless endpoint, how do i get english transcription for tamil audio. When i test it with tamil audio, I get output like this, how do I get it in English.
No description

Stuck vLLM startup with 100% GPU utilization

Twice now today I've deployed a new vLLM endpoint using the "Quick Deploy" "Serverless vLLM" option at: https://www.runpod.io/console/serverless only to have the worker stuck after launching the vLLM process and before reaching the weights downloading. It never reaches the state of actually downloading the HF model and loading it into vLLM. * The image I've used is Qwen/Qwen2.5-72B-Instruct * The problematic machines have all been A6000. * Only a single worker configured with 4 x 48GB GPUs was set in the template configuration, in order to make the problem easier to track down (a single pod and a single machine)....

How to respond to the requests at https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1

the openai input is in the job input, I extracted it and processes the request . when send the the response with yield or return it recived could you take a look at this [https://github.com/mohamednaji7/runpod-workers-scripts/blob/main/empty_test/test%20copy%203.py] ...

worker-vllm not working with beam search

Hi, I found another bug in your worker-vllm. Beam search is not supported even though your README says it is. This time around it's length_penalty not being accepted. Can you please work on a fix for beam search? Thanks!

All GPU unavailable

I just started using RunPod. Yesterday, I created my first serverless endpoint and submitted a job, but I didn't receive a response. When I investigated the issue, I found that all GPUs were unavailable. The situation hasn't changed since then. Could you tell me what I should do?
No description

/runsync returns "Pending" response

Hi, I've send a request to my /runsync endpoint and it returned a {job... status:"pending"} response. Can someone clarify when this happens? When the request is taking too long to complete?

Kicked Worker

Is there a webhhok for the event that a worker is kicked? Or is there only the /health call, where we need to track the change of reqeusts since the last /health call (tracking the change of failed request)

Possible to access ComfyUI interface in serverless to fix custom nodes requirements?

Hi RunPod addicts! I have a functional ComfyUI install running in a Pod that I want to replicate serverless. My comfyUI install is made for a specific workflow requiring 18 custom nodes. ...

How to truly see the status of an endpoint worker?

I'm trying out the vllm serverless endpoints and am running in to a lot of trouble. I was able to get responses from a running worker for a little while, then the worker went idle (as expected) and I tried sending a fresh request. That request has been stuck for minutes now and there's no sign that the worker is even starting up. The runpod UI says the worker is "running" but there's nothing in the logs for the past 9 minutes (the last log line was from the previous worker exiting). My latest requests have been stuck for about 7 minutes each. How do I see the status of an endpoint worker if there's nothing in the logs and nothing in the telemetry? What does "running" mean if there's no logs or telemetry?...

How do I calculate the cost of my last execution on a serverless GPU?

For example if I have one GPU with cost $0.00016 and the other ones with $0.00019. How do I know which serverless GPU actually picked this GPU after the request has been completed? Also, is there an easy way to just get the cost of the last runsync request instead of manually calculating it?

Serverless deepseek-ai/DeepSeek-R1 setup?

How can I configure a serverless end point for deepseek-ai/DeepSeek-R1?

what is the best way to access more gpus a100 and h100

flux is of 25 gb , if i download that model in network volume, then i can only access that region gpus only , and I can see everytime , a100 and h100 gpus are in LOW in all the regions . If i download the model flux in the container itself while building the docker image, instead of network volume, then it has to download 25 size of docker image everytime for the new pod could anyone please help me with this...

Guidance on Mitigating Cold Start Delays in Serverless Inference

We are experiencing delays during the cold starts of our serverless server used for inference of a machine learning model (Whisper). The main suspected cause is the download of model weights (custom model trained by us), which are fetched via the Hugging Face package within the Python code. We are exploring possible solutions and need guidance on feasibility and best practices. Additional Context: - The inference server currently fetches model weights dynamically from Hugging Face during initialization, leading to delays. - The serverless platform is being used for inference as part of a production system requiring low latency....
No description