Runpod•3d ago

How to configure auto scaling for load balancing endpoints?

From the documentation: "The method used to scale up workers on the created Serverless endpoint. If QUEUE_DELAY, workers are scaled based on a periodic check to see if any requests have been in queue for too long. If REQUEST_COUNT, the desired number of workers is periodically calculated based on the number of requests in the endpoint's queue. Use QUEUE_DELAY if you need to ensure requests take no longer than a maximum latency, and use REQUEST_COUNT if you need to scale based on the number of requests." From what I understand the load balancing endpoints don't have a queue? How do I configure the auto scaling to work with serverless endpoints?

20 Replies

WeamonZ•3d ago

@emilwallner hey, what do you mean by that ? When no worker is available, your requests will be automatically queued until a worker is available The reason they give you those two options is that : - You have a fast operation (so you want to base it on QUEUE_DELAY) - You have a long operation (you want to use the REQUEST_COUNT)

emilwallnerOP•3d ago

So loadbalancing endpoints also have a queue? With several endpoints like /generate /search /enhance, is the REQUEST_COUNT based on all the requests the server receives? For example if you are using FastAPI?

WeamonZ•3d ago

Ooooh, no I thought you meant serverless endpoint, that what they call it for each "serverless project" You can duplicate your project, and in your code trigger a different "serverless endpoint" based on the URL that way you will be able to better track URL operation duration otherwise all your data will be merged, and your metrics will be impacted That's why I recommend to split each operation into individual "serverless endpoint"

emilwallnerOP•3d ago

Why not loadbalancing endpoints with several endpoints? I have several models on one server, splitting them will add a lot of cost

WeamonZ•3d ago

When you talk about endpoints you mean your fastAPI endpoints ? On one server what do you mean ?

emilwallnerOP•3d ago

Yeah, fastAPI, serveral endpoints, e.g. /generate /search /enhance, in one docker image

WeamonZ•3d ago

Ok so why I do not recommend this : (And thanks for this precise operation) When you will want to generate something you will load a model on a GPU, which will be different from the one in search or enhance, am I correct ?

emilwallnerOP•3d ago

All on the same GPU For example /generate (model.pth), /search (model_search.pth and model.pth), and /enhance (model_enhance.pth)

WeamonZ•3d ago

This model will take time to load and you will pay for this. Once the model is loaded in your GPU each time you trigger a worker, it wont have to be loaded again in the VRAM. If you add multiple models into the same GPU, you can reach the max VRAM capacity and therefore some models will be unloaded, so each time you make an operation, it will have to reload the model (so your operation will be slower, and cost more)

emilwallnerOP•3d ago

They all fit in the VRAM

WeamonZ•3d ago

All of them AT THE SAME TIME ?

emilwallnerOP•3d ago

Yeah

WeamonZ•3d ago

Okok, then it's fine for this, Now if you have different operations some longer than the other, the longers one might block the shorter ones to happen as they will end up being queued let's say you have a 120s operation and a 2s one, if you put them in the same docker, the 2s might be queued for 120s and for the third reason, if you split your operation, you can set up cheaper GPUs for lighter operation

emilwallnerOP•3d ago

All are between 10-100ms

WeamonZ•3d ago

Ok so in the end you don't need a per endpoint queue ? the runpod queuing system will be enough

emilwallnerOP•3d ago

I need it to scale once it hits say 30 requests per second

WeamonZ•3d ago

by setting a higher number of max workers and playing with the REQUEST_COUNT or QUEUE_DELAY runpod will automatically boot new workers to handle your requests and loadbalance them

emilwallnerOP•3d ago

nice!

WeamonZ•3d ago

you can try it yourself, set 2 max workers, and send 4 requests, in the "request" panel you should see that it's triggered on different workers (GPUs)

emilwallnerOP•3d ago

cool!

Gaming

Programming

How to configure auto scaling for load balancing endpoints?

Did you find this page helpful?