Hello RunPod support/team,
We're on a serverless setup with flex workers (up to account limit), but no active/always-on workers configured yet.
We run small, quick GPU inference jobs (short TTS-style workloads) and want to minimize cold starts + queuing during bursts while preserving good concurrency.
From the docs:
Concurrent handlers let one worker process multiple requests simultaneously.
Active workers stay "always on" to eliminate cold starts (with the discount).
A few clarifications to help us tune optimally:
1. If I keep one always-on worker with higher concurrency, will jobs reliably go to it first, or can RunPod still cold-start new workers unnecessarily?
2. Is there a way to make jobs prefer an already active worker (e.g., max_workers=1 + high concurrency) to avoid unexpected scaling?
3 Can one always-on worker handle requests from multiple endpoints, or does each endpoint need its own dedicated worker?
4. Does autoscaling usually wait until an active worker is fully utilized before scaling, or can it scale early despite available concurrency?
In active + concurrent setups, how aggressive is autoscaling? Does it wait for actual saturation (considering concurrency), or scale prematurely on short bursts?
Goal: reliable low-latency "warm + concurrent" behavior with minimal unpredictable scaling/cold starts. We see occasional queuing when bursts hit the current worker limit.
Thanks for any details or best-practice tips — happy to share more about our handler setup if it helps!