RunpodR
Runpod5mo ago
morrow

New load balancer serverless endpoint type questions

Hey team !

In the past, i've tried to use runpod's queue based serverless for my voice AI project but the added job queue latency was just making this impossible. Voice AI required sub 200ms inference latency and the overhead made it huge and unpredictable. This is ok for long running jobs but not for high frequency / low latency.

This new load balancer serverless endpoint type looks amazing and seem to be solving a real feature gap in the GPU provider game.

However, i'm lacking some informations:
  • Scaling algorithm type: how does the auto scaler decide it's time to boot up a new pod ? In my case i'd like to use either numbers of sessions per worker, or average time to first token
  • How is the load balancer actually balancing ? Is there any way to implement sticky sessions for instance ? Especially in the vllm example, it's better if the same conversation stay on the same worker 🙏
None of this stuff appear to be documented, and I think these are some pretty important parameters for a load balencer.

Waiting for some guidance on this as this is the only thing preventing us to migrate our infra to it 🙂
Was this page helpful?