RunPod•15mo ago

Serverless calculating capacity & ideal request count vs. queue delay values

How do you calculate whether serverless worker is reaching it's capacity and what values to set for request count? I see in one of my serverless workers in production which is running regular Oobabooga (not vLLM so no concurrency) reaching 110k requests per day yesterday without starting a new worker. According to my observation my context length is usually 1000 input tokens and 10-70 output tokens which usually take between 2-5secs per request. Even if we take 1sec execution time per request it should have been able to handle only 86400 requests per day. How is it able to handle more without increasing the worker count especially when it takes 2-5secs per request?

2 Replies

octopusOP•15mo ago

@flash-singh any idea?

flash-singh•15mo ago

if your max worker is low, good metric to look put for is delayed time, that shows how long a request waits in queue before a worker picks it up

Gaming

Programming

Serverless calculating capacity & ideal request count vs. queue delay values

Did you find this page helpful?