How Low-Latency Is the VLLM Worker (OpenAI-Compatible API)?
Hey team! I'm looking into using Runpod's VLLM worker via the serverless endpoint for real-time voice interactions. For this use case, minimizing time-to-first-token during streaming is critical.
Does the OpenAI-compatible API layer introduce any noticeable latency, or is it optimized for low-latency responses?
Using llama3, I've seen ~70ms latencies when running a VLLM server on a dedicated pod. Is similar performance achievable with the serverless setup or is there any infrastructure induced latency? If there is, could you point me toward a way to achieve my goal ? Runpod auto scaling would be amazing for this project as it will handle large volumes of inferences.
Note: Assume the instance is already warm and running. 👌
3 Replies
not sure, but maybe it adds a little bit of latency imo
didnt see the warm instance note so i edited
dont take my word fully tho, try to just test it with your model & active workers
The openai api layer may add the tiniest bit of latency as we actually rewrite the request and resubmit it for you. But our default method of request scaling should be capable of growing with you as well.
There will be a bit extra latency for serverless, all requests go to queue first and then pick up by the worker. Compare to pod, you directly communicate with it.