We've just launched our model to production a few days ago... and we've had this problem happen to us two times.
Problem: Unresponsive workers, most of them are "ready" but are "idle" despite requests queuing up for MINUTES. Expected Behavior: Idle workers should respond as soon as a request is not yet taken from the queue. Actual Behavior: Workers stay idle, queue does not get processed and delayed for minutes. New / Existing Problem: On our two day experience, this has happened twice. Steps to Reproduce: It's up for chance when most RunPod GPUs are under heavy load, where all 3090s are "throttled".
Relevant Logs:
Request ID:
1c90bd6a-0716-4b3c-8465-144d0b49d8be-u1
1c90bd6a-0716-4b3c-8465-144d0b49d8be-u1
Worker:
RTX A5000 - p5y3srv0gsjtjk
RTX A5000 - p5y3srv0gsjtjk
Latest Worker Log:
2023-12-24T21:16:46.461288541Z {"requestId": null, "message": "Failed to get job, status code: 500", "level": "ERROR"}
2023-12-24T21:16:46.461288541Z {"requestId": null, "message": "Failed to get job, status code: 500", "level": "ERROR"}