Issue with unresponsive workers

We've just launched our model to production a few days ago... and we've had this problem happen to us two times. Problem: Unresponsive workers, most of them are "ready" but are "idle" despite requests queuing up for MINUTES. Expected Behavior: Idle workers should respond as soon as a request is not yet taken from the queue. Actual Behavior: Workers stay idle, queue does not get processed and delayed for minutes. New / Existing Problem: On our two day experience, this has happened twice. Steps to Reproduce: It's up for chance when most RunPod GPUs are under heavy load, where all 3090s are "throttled". Relevant Logs: Request ID: 1c90bd6a-0716-4b3c-8465-144d0b49d8be-u1 Worker: RTX A5000 - p5y3srv0gsjtjk Latest Worker Log:
2023-12-24T21:16:46.461288541Z {"requestId": null, "message": "Failed to get job, status code: 500", "level": "ERROR"}
2023-12-24T21:16:46.461288541Z {"requestId": null, "message": "Failed to get job, status code: 500", "level": "ERROR"}
Other Workers: RTX A5000 - 217s1y508zuj48, RTX A5000 - vj8i7gy9eujei6
2023-12-24T04:39:48Z worker is ready
2023-12-24T04:39:48Z start container
2023-12-24T07:00:19Z stop container
2023-12-24T07:00:21Z remove container
2023-12-24T07:00:21Z remove network
2023-12-24T04:39:48Z worker is ready
2023-12-24T04:39:48Z start container
2023-12-24T07:00:19Z stop container
2023-12-24T07:00:21Z remove container
2023-12-24T07:00:21Z remove network
RTX A5000 - 1ij40acwnngaxc, RTX A5000 - 3ysqauzbfjwd7h
2023-12-24T21:20:21Z worker is ready
2023-12-24T21:20:21Z worker is ready
Attempted Solutions: - Maxxing out the worker limit to 5 (as suggested by support staff) - Using less in-demand GPUs such as RTX A5000s - Booting off some unresponsive workers (did nothing)
14 Replies
marshall
marshallOP2y ago
:/ a request just got processed, but this failed job is still stuck... this seems like an issue on RunPod's job distribution system
🐧
🐧2y ago
are you using request count or queue delay ? I had similar issue when using request count and so did a few others. It was advised to use queue delay
marshall
marshallOP2y ago
We're using request count. But in that case I'll try queue delay... What settings were recommended?
🐧
🐧2y ago
that depends on your use case. I use LLMs so I just set it to 10s.
marshall
marshallOP2y ago
nevermind it was already in queue delay
🐧
🐧2y ago
Ah. so I guess the issue is indeed with their job system Have you tried with >= 1 active worker ? And then send several requests to test scaling.
marshall
marshallOP2y ago
Haven't stress-tested it that much, but here's our current settings: I might just write some code to automatically cancel jobs that take more than 1 minute it's a necessary fail-safe anyways I cancelled the job via API (curl) and it magically finished? wuttt It had a result and everything
flash-singh
flash-singh2y ago
whats the endpoint id?
marshall
marshallOP2y ago
Endpoint ID: isme01qeaw1yd4 Another case: Endpoint ID: isme01qeaw1yd4 Request ID: dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1 Worker ID: vj8i7gy9eujei6 Worker Logs:
2023-12-27T04:56:23.704924482Z {"requestId": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1", "message": "Failed to return job results. | Connection timeout to host https://api.runpod.ai/v2/isme01qeaw1yd4/job-stream/vj8i7gy9eujei6/dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1?gpu=NVIDIA+RTX+A5000", "level": "ERROR"}
2023-12-27T04:56:23.704980406Z {"requestId": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1", "message": "Finished", "level": "INFO"}
2023-12-27T04:56:25.707261692Z {"requestId": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1", "message": "Failed to return job results. | Connection timeout to host https://api.runpod.ai/v2/isme01qeaw1yd4/job-done/vj8i7gy9eujei6/dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1?gpu=NVIDIA+RTX+A5000", "level": "ERROR"}
2023-12-27T04:56:25.707312002Z {"requestId": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1", "message": "Finished.", "level": "INFO"}
2023-12-27T04:56:23.704924482Z {"requestId": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1", "message": "Failed to return job results. | Connection timeout to host https://api.runpod.ai/v2/isme01qeaw1yd4/job-stream/vj8i7gy9eujei6/dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1?gpu=NVIDIA+RTX+A5000", "level": "ERROR"}
2023-12-27T04:56:23.704980406Z {"requestId": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1", "message": "Finished", "level": "INFO"}
2023-12-27T04:56:25.707261692Z {"requestId": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1", "message": "Failed to return job results. | Connection timeout to host https://api.runpod.ai/v2/isme01qeaw1yd4/job-done/vj8i7gy9eujei6/dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1?gpu=NVIDIA+RTX+A5000", "level": "ERROR"}
2023-12-27T04:56:25.707312002Z {"requestId": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1", "message": "Finished.", "level": "INFO"}
Job Results (STATUS):
{
"delayTime": 1774,
"executionTime": 58332,
"id": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1",
"status": "CANCELLED"
}
{
"delayTime": 1774,
"executionTime": 58332,
"id": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1",
"status": "CANCELLED"
}
(automatically cancelled after 1 minute)
flash-singh
flash-singh2y ago
did you cancel it?
marshall
marshallOP2y ago
uh, yeah our systems now cancel jobs that take more than 1 minute (as a fail-safe)
flash-singh
flash-singh2y ago
got it, thanks
J.
J.2y ago
random: https://docs.runpod.io/docs/serverless-usage#--execution-policy If not already there is an execution policy it seems that i added to my request payloads bc this support ticket made me aware i should do it haha
RunPod
🖇️ | Using Your Endpoint
The method in which jobs are submitted and returned.
marshall
marshallOP2y ago
that's actually great to know, we might try that xd

Did you find this page helpful?