RunpodR
Runpod•8mo ago
morrow

How Low-Latency Is the VLLM Worker (OpenAI-Compatible API)?

Hey team! I'm looking into using Runpod's VLLM worker via the serverless endpoint for real-time voice interactions. For this use case, minimizing time-to-first-token during streaming is critical.

Does the OpenAI-compatible API layer introduce any noticeable latency, or is it optimized for low-latency responses?

Using llama3, I've seen ~70ms latencies when running a VLLM server on a dedicated pod. Is similar performance achievable with the serverless setup or is there any infrastructure induced latency? If there is, could you point me toward a way to achieve my goal ? Runpod auto scaling would be amazing for this project as it will handle large volumes of inferences.

Note: Assume the instance is already warm and running. 👌
Was this page helpful?