Serverless doesn't scale

Endpoint id: cilhdgrs7rbzya
I have some requests which requrie workers with 4 GTX 4090s. “max worker” of the endpoint is 150 and “Request Count” in Scale type is 1.

When I sent 78 requests concurrently, only ~20% of these requests could start in 10s. P80 need to wait for ~600s.

Is this because there is not enough GPUs? When stock status “availibity: high”, how many workers can I expect to scale up in the mean time?

Jason•7/20/24, 10:08 AM

Whats your worker status

Jason•7/20/24, 10:08 AM

are they throttled?

Jason•7/20/24, 10:09 AM

Try increasing your max workers if your wokrers are full

Jason•7/20/24, 10:09 AM

And what do you run inside the worker? what kind of model

Ppxmwxd Endpoint id: cilhdgrs7rbzya I have some requests which requrie workers with 4 GT...

yhlong00000•7/20/24, 2:05 PM

I think using request count is great for handling a steady or predictable increase in request volume. Setting the count to 1 will immediately increase the workers, which I agree should work. However, for burst traffic, queue delay might work better. You can define the maximum wait time in the queue, ensuring that jobs don’t wait longer than that before they get processed.

flash-singh•7/20/24, 7:43 PM

are you asking for 4x 4090s in 1 worker?

Jason•7/21/24, 5:01 AM

I think he's asking about the scaling, when it's high availability Howmuch workers can it scale up to

Jason•7/21/24, 5:01 AM

And why the loading time/cold starts is high

pxmwxdOP•7/23/24, 6:03 AM

Not cold time. Delay tme is high. It could even reach ~600s

Jason•7/23/24, 7:54 AM

flash-singh•7/24/24, 1:57 AM

@pxmwxd can I ask why you need 4x 4090s in one worker? that will impact scale, even if we have plenty of 4090s, wanting 4x will impact scale since most are 2x 4x and rare 8x ones, whats likely happening during scale is your getting throttled

pm me endpoint id and i can check to make sure this is the case

flash-singh•7/24/24, 1:58 AM

2x a6000 will give you easier scale, the higher you increase gpu count/worker, the more likely chance of higher delay time, i can also see if we can optimize this for you

flash-singh•7/24/24, 3:24 PM

I've resolved the issue, for future reference to anyone else scaling too big, you will hit $40/hr spending limit even for serverless, only way to increase that is reaching out to us so you can scale beyond. This also means we need to do a better job of showing that possibly in logs.

Fflash-singh I've resolved the issue, for future reference to anyone else scaling too big, yo...

marcchen955•8/8/24, 2:20 AM

Is there any doc link about the 40$/hr limitation ? I am trying to research on a replacement of runpod to modal.com. The first priority thing is gpu concurrency limitation. (Which in modal.com is 30 for pro user)

Jason•8/8/24, 3:13 AM

Just reach out to them via contact

Jason•8/8/24, 3:13 AM

Link in the website dashboard

Mmarcchen955 Is there any doc link about the 40$/hr limitation ? I am trying to research on a...

flash-singh•8/8/24, 11:18 AM

we can increase that if needed, reach out to support

Serverless doesn't scale

Similar Threads

Similar Threads

Similar Threads