RunpodR
Runpod3y ago
23 replies
marshall

Issue with unresponsive workers

We've just launched our model to production a few days ago... and we've had this problem happen to us two times.

Problem: Unresponsive workers, most of them are "ready" but are "idle" despite requests queuing up for MINUTES.
Expected Behavior: Idle workers should respond as soon as a request is not yet taken from the queue.
Actual Behavior: Workers stay idle, queue does not get processed and delayed for minutes.
New / Existing Problem: On our two day experience, this has happened twice.
Steps to Reproduce: It's up for chance when most RunPod GPUs are under heavy load, where all 3090s are "throttled".

Relevant Logs:


Request ID:
1c90bd6a-0716-4b3c-8465-144d0b49d8be-u1

Worker:
RTX A5000 - p5y3srv0gsjtjk

Latest Worker Log:
2023-12-24T21:16:46.461288541Z {"requestId": null, "message": "Failed to get job, status code: 500", "level": "ERROR"}


Other Workers:

RTX A5000 - 217s1y508zuj48
,
RTX A5000 - vj8i7gy9eujei6

2023-12-24T04:39:48Z worker is ready
2023-12-24T04:39:48Z start container
2023-12-24T07:00:19Z stop container
2023-12-24T07:00:21Z remove container
2023-12-24T07:00:21Z remove network


RTX A5000 - 1ij40acwnngaxc
,
RTX A5000 - 3ysqauzbfjwd7h

2023-12-24T21:20:21Z worker is ready


Attempted Solutions:

- Maxxing out the worker limit to
5
(as suggested by support staff)
- Using less in-demand GPUs such as
RTX A5000
s
- Booting off some unresponsive workers (did nothing)
Was this page helpful?