Runpod•5mo ago

Why is this taking so long and why didn't RunPod time out the request?

Serverless endpoint: vLLM Model: meta-llama/Llama-3.1-8B-Instruct GPU: 48GB A40 I have a prompt to extract facts from a meeting transcript. When I run the prompt on o4-mini-2025-04-16 it takes <60 seconds with the following token usage:

{
  "prompt_tokens": 14037,
  "completion_tokens": 3095,
  "total_tokens": 17132
}

{
  "prompt_tokens": 14037,
  "completion_tokens": 3095,
  "total_tokens": 17132
}

When I run the same prompt on my RunPod serverless endpoint with the above specs, it has been running for over an hour. Logs, telemetry & config attached. Questions: 1. Why is this taking so long? 2. The execution timeout is 600 seconds (10 mins) - why didn't RunPod time out the request? 3. I sent a couple more requests - why are they stuck behind the first on the same worker instead of being served by the remaining 2 inactive workers as per the queue delay setting? Really appreciate help here 🙏 been in a muddle with this for a while

logs.txt

2 Replies

Unknown User•5mo ago

Message Not Public

EllroyOP•5mo ago

Thanks for the response @Jason. 1. The logs appear to show the model loaded pretty quickly and it was (supposedly) generating an output for over an hour. Unless I'm mistaken? 2. Hmmm 🤔 3. So both RunPod & vLLM have their own queuing system and RunPod will only spin up a new worker once the vLLM instance queue is at it's max capacity? And this is MAX_NUM_SEQS and by default on the RunPod vLLM template this is set to 256 – so RunPod will only spin up a second worker once it has given 256 concurrent requests to one vLLM instance, is that right?

Gaming

Programming

Why is this taking so long and why didn't RunPod time out the request?

Did you find this page helpful?