RunPod•17mo ago

Issue with unresponsive workers

We've just launched our model to production a few days ago... and we've had this problem happen to us two times. Problem: Unresponsive workers, most of them are "ready" but are "idle" despite requests queuing up for MINUTES. Expected Behavior: Idle workers should respond as soon as a request is not yet taken from the queue. Actual Behavior: Workers stay idle, queue does not get processed and delayed for minutes. New / Existing Problem: On our two day experience, this has happened twice. Steps to Reproduce: It's up for chance when most RunPod GPUs are under heavy load, where all 3090s are "throttled". Relevant Logs: Request ID: 1c90bd6a-0716-4b3c-8465-144d0b49d8be-u1 Worker: RTX A5000 - p5y3srv0gsjtjk Latest Worker Log:

2023-12-24T21:16:46.461288541Z {"requestId": null, "message": "Failed to get job, status code: 500", "level": "ERROR"}

2023-12-24T21:16:46.461288541Z {"requestId": null, "message": "Failed to get job, status code: 500", "level": "ERROR"}

Other Workers: RTX A5000 - 217s1y508zuj48, RTX A5000 - vj8i7gy9eujei6

2023-12-24T04:39:48Z worker is ready
2023-12-24T04:39:48Z start container
2023-12-24T07:00:19Z stop container
2023-12-24T07:00:21Z remove container
2023-12-24T07:00:21Z remove network

2023-12-24T04:39:48Z worker is ready
2023-12-24T04:39:48Z start container
2023-12-24T07:00:19Z stop container
2023-12-24T07:00:21Z remove container
2023-12-24T07:00:21Z remove network

RTX A5000 - 1ij40acwnngaxc, RTX A5000 - 3ysqauzbfjwd7h

2023-12-24T21:20:21Z worker is ready

2023-12-24T21:20:21Z worker is ready

Attempted Solutions: - Maxxing out the worker limit to 5 (as suggested by support staff) - Using less in-demand GPUs such as RTX A5000s - Booting off some unresponsive workers (did nothing)

14 Replies

marshallOP•17mo ago

:/ a request just got processed, but this failed job is still stuck... this seems like an issue on RunPod's job distribution system

🐧•17mo ago

are you using request count or queue delay ? I had similar issue when using request count and so did a few others. It was advised to use queue delay

marshallOP•17mo ago

We're using request count. But in that case I'll try queue delay... What settings were recommended?

🐧•17mo ago

that depends on your use case. I use LLMs so I just set it to 10s.

marshallOP•17mo ago

nevermind it was already in queue delay

🐧•17mo ago

Ah. so I guess the issue is indeed with their job system Have you tried with >= 1 active worker ? And then send several requests to test scaling.

marshallOP•17mo ago

Haven't stress-tested it that much, but here's our current settings: I might just write some code to automatically cancel jobs that take more than 1 minute it's a necessary fail-safe anyways I cancelled the job via API (curl) and it magically finished? wuttt It had a result and everything

flash-singh•17mo ago

whats the endpoint id?

marshallOP•17mo ago

Endpoint ID: isme01qeaw1yd4 Another case: Endpoint ID: isme01qeaw1yd4 Request ID: dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1 Worker ID: vj8i7gy9eujei6 Worker Logs:

2023-12-27T04:56:23.704924482Z {"requestId": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1", "message": "Failed to return job results. | Connection timeout to host https://api.runpod.ai/v2/isme01qeaw1yd4/job-stream/vj8i7gy9eujei6/dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1?gpu=NVIDIA+RTX+A5000", "level": "ERROR"}
2023-12-27T04:56:23.704980406Z {"requestId": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1", "message": "Finished", "level": "INFO"}
2023-12-27T04:56:25.707261692Z {"requestId": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1", "message": "Failed to return job results. | Connection timeout to host https://api.runpod.ai/v2/isme01qeaw1yd4/job-done/vj8i7gy9eujei6/dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1?gpu=NVIDIA+RTX+A5000", "level": "ERROR"}
2023-12-27T04:56:25.707312002Z {"requestId": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1", "message": "Finished.", "level": "INFO"}

2023-12-27T04:56:23.704924482Z {"requestId": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1", "message": "Failed to return job results. | Connection timeout to host https://api.runpod.ai/v2/isme01qeaw1yd4/job-stream/vj8i7gy9eujei6/dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1?gpu=NVIDIA+RTX+A5000", "level": "ERROR"}
2023-12-27T04:56:23.704980406Z {"requestId": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1", "message": "Finished", "level": "INFO"}
2023-12-27T04:56:25.707261692Z {"requestId": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1", "message": "Failed to return job results. | Connection timeout to host https://api.runpod.ai/v2/isme01qeaw1yd4/job-done/vj8i7gy9eujei6/dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1?gpu=NVIDIA+RTX+A5000", "level": "ERROR"}
2023-12-27T04:56:25.707312002Z {"requestId": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1", "message": "Finished.", "level": "INFO"}

Job Results (STATUS):

{
  "delayTime": 1774,
  "executionTime": 58332,
  "id": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1",
  "status": "CANCELLED"
}

{
  "delayTime": 1774,
  "executionTime": 58332,
  "id": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1",
  "status": "CANCELLED"
}

(automatically cancelled after 1 minute)

flash-singh•17mo ago

did you cancel it?

marshallOP•17mo ago

uh, yeah our systems now cancel jobs that take more than 1 minute (as a fail-safe)

flash-singh•17mo ago

got it, thanks

J.•17mo ago

random: https://docs.runpod.io/docs/serverless-usage#--execution-policy If not already there is an execution policy it seems that i added to my request payloads bc this support ticket made me aware i should do it haha

RunPod

🖇️ | Using Your Endpoint

The method in which jobs are submitted and returned.

marshallOP•17mo ago

that's actually great to know, we might try that xd

Gaming

Programming

Issue with unresponsive workers

Did you find this page helpful?