R
Runpod16mo ago
pxmwxd

Serverless doesn't scale

Endpoint id: cilhdgrs7rbzya I have some requests which requrie workers with 4 GTX 4090s. “max worker” of the endpoint is 150 and “Request Count” in Scale type is 1. When I sent 78 requests concurrently, only ~20% of these requests could start in 10s. P80 need to wait for ~600s. Is this because there is not enough GPUs? When stock status “availibity: high”, how many workers can I expect to scale up in the mean time?
10 Replies
Unknown User
Unknown User16mo ago
Message Not Public
Sign In & Join Server To View
yhlong00000
yhlong0000016mo ago
I think using request count is great for handling a steady or predictable increase in request volume. Setting the count to 1 will immediately increase the workers, which I agree should work. However, for burst traffic, queue delay might work better. You can define the maximum wait time in the queue, ensuring that jobs don’t wait longer than that before they get processed.
flash-singh
flash-singh16mo ago
are you asking for 4x 4090s in 1 worker?
Unknown User
Unknown User16mo ago
Message Not Public
Sign In & Join Server To View
pxmwxd
pxmwxdOP16mo ago
Not cold time. Delay tme is high. It could even reach ~600s
Unknown User
Unknown User16mo ago
Message Not Public
Sign In & Join Server To View
flash-singh
flash-singh16mo ago
@pxmwxd can I ask why you need 4x 4090s in one worker? that will impact scale, even if we have plenty of 4090s, wanting 4x will impact scale since most are 2x 4x and rare 8x ones, whats likely happening during scale is your getting throttled pm me endpoint id and i can check to make sure this is the case 2x a6000 will give you easier scale, the higher you increase gpu count/worker, the more likely chance of higher delay time, i can also see if we can optimize this for you I've resolved the issue, for future reference to anyone else scaling too big, you will hit $40/hr spending limit even for serverless, only way to increase that is reaching out to us so you can scale beyond. This also means we need to do a better job of showing that possibly in logs.
marcchen955
marcchen95516mo ago
Is there any doc link about the 40$/hr limitation ? I am trying to research on a replacement of runpod to modal.com. The first priority thing is gpu concurrency limitation. (Which in modal.com is 30 for pro user)
Unknown User
Unknown User16mo ago
Message Not Public
Sign In & Join Server To View
flash-singh
flash-singh16mo ago
we can increase that if needed, reach out to support

Did you find this page helpful?