Issue with unresponsive workers
We've just launched our model to production a few days ago... and we've had this problem happen to us two times.
Problem: Unresponsive workers, most of them are "ready" but are "idle" despite requests queuing up for MINUTES.
Expected Behavior: Idle workers should respond as soon as a request is not yet taken from the queue.
Actual Behavior: Workers stay idle, queue does not get processed and delayed for minutes.
New / Existing Problem: On our two day experience, this has happened twice.
Steps to Reproduce: It's up for chance when most RunPod GPUs are under heavy load, where all 3090s are "throttled".
Request ID:
Worker:
Latest Worker Log:
Other Workers:
- Using less in-demand GPUs such as
- Booting off some unresponsive workers (did nothing)
Problem: Unresponsive workers, most of them are "ready" but are "idle" despite requests queuing up for MINUTES.
Expected Behavior: Idle workers should respond as soon as a request is not yet taken from the queue.
Actual Behavior: Workers stay idle, queue does not get processed and delayed for minutes.
New / Existing Problem: On our two day experience, this has happened twice.
Steps to Reproduce: It's up for chance when most RunPod GPUs are under heavy load, where all 3090s are "throttled".
Relevant Logs:
Request ID:
1c90bd6a-0716-4b3c-8465-144d0b49d8be-u1Worker:
RTX A5000 - p5y3srv0gsjtjkLatest Worker Log:
Other Workers:
RTX A5000 - 217s1y508zuj48, RTX A5000 - vj8i7gy9eujei6RTX A5000 - 1ij40acwnngaxc, RTX A5000 - 3ysqauzbfjwd7hAttempted Solutions:
- Maxxing out the worker limit to5 (as suggested by support staff)- Using less in-demand GPUs such as
RTX A5000s- Booting off some unresponsive workers (did nothing)
