Runpod•12mo ago

How long does it normally take to get a response from your VLLM endpoints on RunPod?

Hello. I've tested a very tiny model (Qwen2.5-0.5B-Instruct) on the official RunPod VLLM image. But the job takes 30+ seconds each time - 99% of it is loading the engine and the model (counted as delay time), and the execution itself is under 1s. Flashboot is on. Is this normal or is there a setting or something else I should check to make the Flashboot kick in? How long do your models and endpoints normally take to return a response?

11 Replies

Unknown User•12mo ago

Message Not Public

3WaDOP•12mo ago

It's the official VLLM selected in the RunPod dashboard. I added only the model name and used Ray. Otherwise, everything should be the default

Unknown User•12mo ago

Message Not Public

3WaDOP•12mo ago

Nope

Unknown User•12mo ago

Message Not Public

3WaDOP•12mo ago

The Flashboot just doesn't seem to work with the Ray distributed executor backend as I see now. This makes sense I guess. It's overkill for single-node inference anyway so I'll stick to the MP which works. But good to know. I'll try to discourage everyone from using it with my custom image.

Unknown User•12mo ago

Message Not Public

3WaDOP•12mo ago

Then it would work the same since the Flashboot does not take any effect with the Ray. That's how I meant it. VLLM has two possible distributed executor backends - Ray or MultiProcessing which are needed if you want to use VLLM's continuous batching and RunPod worker concurrently.

Unknown User•12mo ago

Message Not Public

Poddy•12mo ago

@3WaD

Escalated To Zendesk

The thread has been escalated to Zendesk!

Unknown User•12mo ago

Message Not Public

Gaming

Programming

How long does it normally take to get a response from your VLLM endpoints on RunPod?

Did you find this page helpful?