R
Runpod12mo ago
3WaD

How long does it normally take to get a response from your VLLM endpoints on RunPod?

Hello. I've tested a very tiny model (Qwen2.5-0.5B-Instruct) on the official RunPod VLLM image. But the job takes 30+ seconds each time - 99% of it is loading the engine and the model (counted as delay time), and the execution itself is under 1s. Flashboot is on. Is this normal or is there a setting or something else I should check to make the Flashboot kick in? How long do your models and endpoints normally take to return a response?
11 Replies
Unknown User
Unknown User12mo ago
Message Not Public
Sign In & Join Server To View
3WaD
3WaDOP12mo ago
It's the official VLLM selected in the RunPod dashboard. I added only the model name and used Ray. Otherwise, everything should be the default
Unknown User
Unknown User12mo ago
Message Not Public
Sign In & Join Server To View
3WaD
3WaDOP12mo ago
Nope
Unknown User
Unknown User12mo ago
Message Not Public
Sign In & Join Server To View
3WaD
3WaDOP12mo ago
The Flashboot just doesn't seem to work with the Ray distributed executor backend as I see now. This makes sense I guess. It's overkill for single-node inference anyway so I'll stick to the MP which works. But good to know. I'll try to discourage everyone from using it with my custom image.
Unknown User
Unknown User12mo ago
Message Not Public
Sign In & Join Server To View
3WaD
3WaDOP12mo ago
Then it would work the same since the Flashboot does not take any effect with the Ray. That's how I meant it. VLLM has two possible distributed executor backends - Ray or MultiProcessing which are needed if you want to use VLLM's continuous batching and RunPod worker concurrently.
Unknown User
Unknown User12mo ago
Message Not Public
Sign In & Join Server To View
Poddy
Poddy12mo ago
@3WaD
Escalated To Zendesk
The thread has been escalated to Zendesk!
Unknown User
Unknown User12mo ago
Message Not Public
Sign In & Join Server To View

Did you find this page helpful?