R
Runpod5mo ago
Ellroy

Why is this taking so long and why didn't RunPod time out the request?

Serverless endpoint: vLLM Model: meta-llama/Llama-3.1-8B-Instruct GPU: 48GB A40 I have a prompt to extract facts from a meeting transcript. When I run the prompt on o4-mini-2025-04-16 it takes <60 seconds with the following token usage:
{
"prompt_tokens": 14037,
"completion_tokens": 3095,
"total_tokens": 17132
}
{
"prompt_tokens": 14037,
"completion_tokens": 3095,
"total_tokens": 17132
}
When I run the same prompt on my RunPod serverless endpoint with the above specs, it has been running for over an hour. Logs, telemetry & config attached. Questions: 1. Why is this taking so long? 2. The execution timeout is 600 seconds (10 mins) - why didn't RunPod time out the request? 3. I sent a couple more requests - why are they stuck behind the first on the same worker instead of being served by the remaining 2 inactive workers as per the queue delay setting? Really appreciate help here 🙏 been in a muddle with this for a while
2 Replies
Unknown User
Unknown User5mo ago
Message Not Public
Sign In & Join Server To View
Ellroy
EllroyOP5mo ago
Thanks for the response @Jason. 1. The logs appear to show the model loaded pretty quickly and it was (supposedly) generating an output for over an hour. Unless I'm mistaken? 2. Hmmm 🤔 3. So both RunPod & vLLM have their own queuing system and RunPod will only spin up a new worker once the vLLM instance queue is at it's max capacity? And this is MAX_NUM_SEQS and by default on the RunPod vLLM template this is set to 256 – so RunPod will only spin up a second worker once it has given 256 concurrent requests to one vLLM instance, is that right?

Did you find this page helpful?