How Low-Latency Is the VLLM Worker (OpenAI-Compatible API)?
Hey team! I'm looking into using Runpod's VLLM worker via the serverless endpoint for real-time voice interactions. For this use case, minimizing time-to-first-token during streaming is critical.
Does the OpenAI-compatible API layer introduce any noticeable latency, or is it optimized for low-latency responses?
Using llama3, I've seen ~70ms latencies when running a VLLM server on a dedicated pod. Is similar performance achievable with the serverless setup or is there any infrastructure induced latency? If there is, could you point me toward a way to achieve my goal ? Runpod auto scaling would be amazing for this project as it will handle large volumes of inferences.
Note: Assume the instance is already warm and running.
Does the OpenAI-compatible API layer introduce any noticeable latency, or is it optimized for low-latency responses?
Using llama3, I've seen ~70ms latencies when running a VLLM server on a dedicated pod. Is similar performance achievable with the serverless setup or is there any infrastructure induced latency? If there is, could you point me toward a way to achieve my goal ? Runpod auto scaling would be amazing for this project as it will handle large volumes of inferences.
Note: Assume the instance is already warm and running.