RunPod•15mo ago

Is there a programatic way to activate servers on high demand / peak hours load?

We are testing the serverless for production deployment for next month. I want to assure we will have server times during peak hours. We'll have some active servers but we need to guarantee load for certain peak hours, is there a way to programatically activate the servers?

16 Replies

J.•15mo ago

There is let me find it..

J.•15mo ago

https://github.com/ashleykleynhans/runpod-api/blob/main/serverless/update_min_workers.py Ive never gotten around to it, but I have manually tested that setting minimum workers do seem to give you some sort of stronger priority in their system

GitHub

runpod-api/serverless/update_min_workers.py at main · ashleykleynha...

A collection of Python scripts for calling the RunPod GraphQL API - ashleykleynhans/runpod-api

J.•15mo ago

So my thought always was potentially to use this wrapper on their graphql endpoint to programatically toggle minimum active workers. It isn't "instantly" on, but it is still prob about 1-2 min or so~ sort of range, from my anecdotal testing in the past for workers to go from throttled state to an Active state and stay there. Maybe faster from idle > active

AC_pillOP•15mo ago

Yes, I need to avoid any throttle because the demand will be huge, but for short tasks ~15s

J.•15mo ago

@AC_pill If you are under utilizing the GPU on these short tasks u can add concurrency? If ur not already 🙂 btw https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless/blob/main/handler.py Here is an example for my OpenLLM, i have it set by default to 1, but u can play with 2-3 to see if u get a memory bottlenecked. https://docs.runpod.io/serverless/workers/handlers/handler-concurrency Here is there documentation on it Honestly, it wasn't until just 2-3 weeks ago, I really delved into concurrency in runpod, cause the docs didn't exist but after pestering the staff xD they were able to help me out on it~ and they got the doc pushed https://discord.com/channels/912829806415085598/1200525738449846342 Here the original thread on that if curious haha But if u dont have concurrency already, it would allow a single worker to handle multiple jobs at a time + that means if ur not fully utilizing the GPU u can increase the concurrency, or maybe even on a baseline gpu, if u know u can safely use xyz amt of parallel jobs might be good 🙂

AC_pillOP•15mo ago

Thanks for the advices, I'll need to check it's TurboXL, I need to check the memory usage

J.•15mo ago

dang this is great to learn this exists, haha, sounds good 🙂

AC_pillOP•15mo ago

yeah, but this is a heavy GPU consumer for the new models, I'm pretty sure there will be memory leaks, but can be a second research @justin [Not Staff] do you know if we can pull tasks from Serverless Queue line?

J.•15mo ago

No, unless you want to write your own circumvention logic or something Not possible as far as I can tell U could potentially highjack a worker at the end of a job, before it returns to check some circumvention queue / cache, and complete the job, and write back out

AC_pillOP•15mo ago

so probably I'll need to wrap tasks together on the same run (so say 4 tasks) for 1 queue

J.•15mo ago

Ah yeah, or do the concurrency stuff and set it to 4 unless those jobs are specifically grouped tgt for other logical reasons Yeah batching jobs together, concurrency, or a circumvention infrastructure

AC_pillOP•15mo ago

I saw that handler script, issue is my workflow network is complex and changes a lot, so that would be hard to let the handler do the work if the opposite and we can pull the JSON tasks, the async task handler would perform the best thanks for the reply that might help in the future

J.•15mo ago

Yeah, the complexity is higher, but what I tested before was to send empty requests to Runpod to spin up workers but have the worker find my own distribued queue i had on upstash to actually do the pulling job logic and then i write the answer to planetscale lol but it a bit of a crazy workaround only if u need such fine grain control + u wanna host ur own stuff To be honest, that can give you really fine grain controls, cause then, u could arbitrary return a value, ending the process and controlling when u want the worker to "terminate"; but all the surrounding infras is hosted by u

AC_pillOP•15mo ago

yeah, I need to be pragmatical with the complexity here, team is small and mostly devoted to frontend so backend will lag maintanance if it goes up the roof

J.•15mo ago

Makes sense~ gl! 🙂 👁️, u sound like u got a cool project / business ongoing

AC_pillOP•15mo ago

yes, it is, very AI driven like 99% of apps now 🙂 but it's cool I'll post news on how it's moving if it works could be a good case for Runpod @justin [Not Staff] yeap, not yet, memory leaks using ONNX models in concurrency: [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running FusedConv node. Name:'Conv_455' Status Message: /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:121 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char, const char, ERRTYPE, const char, const char, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:114 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char, const char, ERRTYPE, const char, const char, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] CUDA failure 2: out of memory ; GPU=0 ; hostname=c67b8afabaf8 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_allocator.cc ; line=47 ; expr=cudaMalloc((void**)&p, size); And memory is only 70% full with 3 instances In case you are using too

Gaming

Programming

Is there a programatic way to activate servers on high demand / peak hours load?

Did you find this page helpful?