New load balancer serverless endpoint type questions
Hey team !
In the past, i've tried to use runpod's queue based serverless for my voice AI project but the added job queue latency was just making this impossible. Voice AI required sub 200ms inference latency and the overhead made it huge and unpredictable. This is ok for long running jobs but not for high frequency / low latency.
This new load balancer serverless endpoint type looks amazing and seem to be solving a real feature gap in the GPU provider game.
However, i'm lacking some informations:
- Scaling algorithm type: how does the auto scaler decide it's time to boot up a new pod ? In my case i'd like to use either numbers of sessions per worker, or average time to first token
- How is the load balancer actually balancing ? Is there any way to implement sticky sessions for instance ? Especially in the vllm example, it's better if the same conversation stay on the same worker π
None of this stuff appear to be documented, and I think these are some pretty important parameters for a load balencer.
Waiting for some guidance on this as this is the only thing preventing us to migrate our infra to it π
8 Replies
Unknown Userβ’2w ago
Message Not Public
Sign In & Join Server To View
I couldn't see any of this in the create "load balancer" endpoint creation form. I do remember they are available with regular "job queue" serverless endpoints thought.
Unknown Userβ’2w ago
Message Not Public
Sign In & Join Server To View
Ah ok I need to deploy first then edit π
Unknown Userβ’2w ago
Message Not Public
Sign In & Join Server To View
Do you think I could use the API to programmatically start or end workers based on my own metrics ?
Unknown Userβ’2w ago
Message Not Public
Sign In & Join Server To View
- so far no sticky session support
- no support for programmatic of start or end workers, this is interesting to explore, we still want to avoid any latencties; any external metrics introduces higher latency