How to configure auto scaling for load balancing endpoints?
From the documentation: "The method used to scale up workers on the created Serverless endpoint. If QUEUE_DELAY, workers are scaled based on a periodic check to see if any requests have been in queue for too long. If REQUEST_COUNT, the desired number of workers is periodically calculated based on the number of requests in the endpoint's queue. Use QUEUE_DELAY if you need to ensure requests take no longer than a maximum latency, and use REQUEST_COUNT if you need to scale based on the number of requests."
From what I understand the load balancing endpoints don't have a queue? How do I configure the auto scaling to work with serverless endpoints?
20 Replies
@emilwallner hey, what do you mean by that ? When no worker is available, your requests will be automatically queued until a worker is available
The reason they give you those two options is that :
- You have a fast operation (so you want to base it on QUEUE_DELAY)
- You have a long operation (you want to use the REQUEST_COUNT)
So loadbalancing endpoints also have a queue? With several endpoints like /generate /search /enhance, is the REQUEST_COUNT based on all the requests the server receives?
For example if you are using FastAPI?
Ooooh, no
I thought you meant serverless endpoint, that what they call it for each "serverless project"
You can duplicate your project, and in your code trigger a different "serverless endpoint" based on the URL
that way you will be able to better track URL operation duration
otherwise all your data will be merged, and your metrics will be impacted
That's why I recommend to split each operation into individual "serverless endpoint"
Why not loadbalancing endpoints with several endpoints? I have several models on one server, splitting them will add a lot of cost
When you talk about endpoints you mean your fastAPI endpoints ?
On one server what do you mean ?
Yeah, fastAPI, serveral endpoints, e.g. /generate /search /enhance, in one docker image
Ok so why I do not recommend this :
(And thanks for this precise operation)
When you will want to generate something you will load a model on a GPU, which will be different from the one in search or enhance, am I correct ?
All on the same GPU
For example /generate (model.pth), /search (model_search.pth and model.pth), and /enhance (model_enhance.pth)
This model will take time to load and you will pay for this. Once the model is loaded in your GPU each time you trigger a worker, it wont have to be loaded again in the VRAM.
If you add multiple models into the same GPU, you can reach the max VRAM capacity and therefore some models will be unloaded, so each time you make an operation, it will have to reload the model (so your operation will be slower, and cost more)
They all fit in the VRAM
All of them AT THE SAME TIME ?
Yeah
Okok, then it's fine for this,
Now if you have different operations some longer than the other, the longers one might block the shorter ones to happen as they will end up being queued
let's say you have a 120s operation and a 2s one, if you put them in the same docker, the 2s might be queued for 120s
and for the third reason, if you split your operation, you can set up cheaper GPUs for lighter operation
All are between 10-100ms
Ok so in the end you don't need a per endpoint queue ?
the runpod queuing system will be enough
I need it to scale once it hits say 30 requests per second
by setting a higher number of max workers and playing with the REQUEST_COUNT or QUEUE_DELAY runpod will automatically boot new workers to handle your requests
and loadbalance them
nice!
you can try it yourself, set 2 max workers, and send 4 requests, in the "request" panel you should see that it's triggered on different workers (GPUs)
cool!