R
RunPod4mo ago
AC_pill

Is there a programatic way to activate servers on high demand / peak hours load?

We are testing the serverless for production deployment for next month. I want to assure we will have server times during peak hours. We'll have some active servers but we need to guarantee load for certain peak hours, is there a way to programatically activate the servers?
16 Replies
justin
justin4mo ago
There is let me find it..
justin
justin4mo ago
https://github.com/ashleykleynhans/runpod-api/blob/main/serverless/update_min_workers.py Ive never gotten around to it, but I have manually tested that setting minimum workers do seem to give you some sort of stronger priority in their system
GitHub
runpod-api/serverless/update_min_workers.py at main · ashleykleynha...
A collection of Python scripts for calling the RunPod GraphQL API - ashleykleynhans/runpod-api
justin
justin4mo ago
So my thought always was potentially to use this wrapper on their graphql endpoint to programatically toggle minimum active workers. It isn't "instantly" on, but it is still prob about 1-2 min or so~ sort of range, from my anecdotal testing in the past for workers to go from throttled state to an Active state and stay there. Maybe faster from idle > active
AC_pill
AC_pill4mo ago
Yes, I need to avoid any throttle because the demand will be huge, but for short tasks ~15s
justin
justin4mo ago
@AC_pill If you are under utilizing the GPU on these short tasks u can add concurrency? If ur not already 🙂 btw https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless/blob/main/handler.py Here is an example for my OpenLLM, i have it set by default to 1, but u can play with 2-3 to see if u get a memory bottlenecked. https://docs.runpod.io/serverless/workers/handlers/handler-concurrency Here is there documentation on it Honestly, it wasn't until just 2-3 weeks ago, I really delved into concurrency in runpod, cause the docs didn't exist but after pestering the staff xD they were able to help me out on it~ and they got the doc pushed https://discord.com/channels/912829806415085598/1200525738449846342 Here the original thread on that if curious haha But if u dont have concurrency already, it would allow a single worker to handle multiple jobs at a time + that means if ur not fully utilizing the GPU u can increase the concurrency, or maybe even on a baseline gpu, if u know u can safely use xyz amt of parallel jobs might be good 🙂
AC_pill
AC_pill4mo ago
Thanks for the advices, I'll need to check it's TurboXL, I need to check the memory usage
justin
justin4mo ago
dang this is great to learn this exists, haha, sounds good 🙂
AC_pill
AC_pill4mo ago
yeah, but this is a heavy GPU consumer for the new models, I'm pretty sure there will be memory leaks, but can be a second research @justin [Not Staff] do you know if we can pull tasks from Serverless Queue line?
justin
justin4mo ago
No, unless you want to write your own circumvention logic or something Not possible as far as I can tell U could potentially highjack a worker at the end of a job, before it returns to check some circumvention queue / cache, and complete the job, and write back out
AC_pill
AC_pill4mo ago
so probably I'll need to wrap tasks together on the same run (so say 4 tasks) for 1 queue
justin
justin4mo ago
Ah yeah, or do the concurrency stuff and set it to 4 unless those jobs are specifically grouped tgt for other logical reasons Yeah batching jobs together, concurrency, or a circumvention infrastructure
AC_pill
AC_pill4mo ago
I saw that handler script, issue is my workflow network is complex and changes a lot, so that would be hard to let the handler do the work if the opposite and we can pull the JSON tasks, the async task handler would perform the best thanks for the reply that might help in the future
justin
justin4mo ago
Yeah, the complexity is higher, but what I tested before was to send empty requests to Runpod to spin up workers but have the worker find my own distribued queue i had on upstash to actually do the pulling job logic and then i write the answer to planetscale lol but it a bit of a crazy workaround only if u need such fine grain control + u wanna host ur own stuff To be honest, that can give you really fine grain controls, cause then, u could arbitrary return a value, ending the process and controlling when u want the worker to "terminate"; but all the surrounding infras is hosted by u
AC_pill
AC_pill4mo ago
yeah, I need to be pragmatical with the complexity here, team is small and mostly devoted to frontend so backend will lag maintanance if it goes up the roof
justin
justin4mo ago
Makes sense~ gl! 🙂 👁️, u sound like u got a cool project / business ongoing
AC_pill
AC_pill4mo ago
yes, it is, very AI driven like 99% of apps now 🙂 but it's cool I'll post news on how it's moving if it works could be a good case for Runpod @justin [Not Staff] yeap, not yet, memory leaks using ONNX models in concurrency: [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running FusedConv node. Name:'Conv_455' Status Message: /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:121 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char, const char, ERRTYPE, const char, const char, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:114 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char, const char, ERRTYPE, const char, const char, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] CUDA failure 2: out of memory ; GPU=0 ; hostname=c67b8afabaf8 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_allocator.cc ; line=47 ; expr=cudaMalloc((void**)&p, size); And memory is only 70% full with 3 instances In case you are using too
Want results from more Discord servers?
Add your server
More Posts
Runpodctl in container receiving 401Over the past few days, I have sometimes been getting a 401 response when attempting to stop pods wiIncreasing costs?guys last few days seems an increase in cost without a spike in active usage. do you have any idea wCannot establish connection for web terminal using Standard Diffusion podI'm able to connect to the Webui HTTP client. And I can connect via SSH from my local machine AND I [URGENT] EU-RO region endpoint currently only processing one request at a timeWe have a production endpoint running in the EU-RO region but despite us having 21 workers 'running'Runpod errors, all pods having same issue this morning. Important operationI got this error on all my pods today We have detected a critical error on this machine which may aHi, I have a problem with two of my very important services, and I received the following messageHi, I have a problem with two of my very important services, and I received the following message: Error while using vLLm in RTX A60002024-02-22T11:19:46.009303238Z /usr/bin/python3: Error while finding module specification for 'vllm.502 error when trying to connect to SD Pod HTTP Service on RunpodI've been following along with this tutorial - everything was going smoothly until it cam time to cocorrect way to call jupyter in templateI'm trying to learn how to create a template. I'm using FROM runpod/pytorch:2.1.1-py3.10-cuda12.1.Too many failed requestsHello. I've tried to run casperhansen/mixtral-instruct-awq (https://huggingface.co/casperhansen/mixt