Runpod•3y ago•

23 replies

Issue with unresponsive workers

We've just launched our model to production a few days ago... and we've had this problem happen to us two times.

Problem: Unresponsive workers, most of them are "ready" but are "idle" despite requests queuing up for MINUTES.
Expected Behavior: Idle workers should respond as soon as a request is not yet taken from the queue.
Actual Behavior: Workers stay idle, queue does not get processed and delayed for minutes.
New / Existing Problem: On our two day experience, this has happened twice.
Steps to Reproduce: It's up for chance when most RunPod GPUs are under heavy load, where all 3090s are "throttled".

Relevant Logs:

Request ID:

1c90bd6a-0716-4b3c-8465-144d0b49d8be-u1

1c90bd6a-0716-4b3c-8465-144d0b49d8be-u1

Worker:

RTX A5000 - p5y3srv0gsjtjk

RTX A5000 - p5y3srv0gsjtjk

Latest Worker Log:

2023-12-24T21:16:46.461288541Z {"requestId": null, "message": "Failed to get job, status code: 500", "level": "ERROR"}

2023-12-24T21:16:46.461288541Z {"requestId": null, "message": "Failed to get job, status code: 500", "level": "ERROR"}

Other Workers:

RTX A5000 - 217s1y508zuj48

RTX A5000 - 217s1y508zuj48

RTX A5000 - vj8i7gy9eujei6

RTX A5000 - vj8i7gy9eujei6

2023-12-24T04:39:48Z worker is ready
2023-12-24T04:39:48Z start container
2023-12-24T07:00:19Z stop container
2023-12-24T07:00:21Z remove container
2023-12-24T07:00:21Z remove network

2023-12-24T04:39:48Z worker is ready
2023-12-24T04:39:48Z start container
2023-12-24T07:00:19Z stop container
2023-12-24T07:00:21Z remove container
2023-12-24T07:00:21Z remove network

2023-12-24T04:39:48Z worker is ready
2023-12-24T04:39:48Z start container
2023-12-24T07:00:19Z stop container
2023-12-24T07:00:21Z remove container
2023-12-24T07:00:21Z remove network

2023-12-24T04:39:48Z worker is ready
2023-12-24T04:39:48Z start container
2023-12-24T07:00:19Z stop container
2023-12-24T07:00:21Z remove container
2023-12-24T07:00:21Z remove network

RTX A5000 - 1ij40acwnngaxc

RTX A5000 - 1ij40acwnngaxc

RTX A5000 - 3ysqauzbfjwd7h

RTX A5000 - 3ysqauzbfjwd7h

2023-12-24T21:20:21Z worker is ready

2023-12-24T21:20:21Z worker is ready

Attempted Solutions:

- Maxxing out the worker limit to

(as suggested by support staff)
- Using less in-demand GPUs such as

RTX A5000

RTX A5000

s
- Booting off some unresponsive workers (did nothing)

marshallOP•12/24/23, 9:33 PM

marshallOP•12/24/23, 9:37 PM

a request just got processed, but this failed job is still stuck... this seems like an issue on RunPod's job distribution system

�

🐧•12/24/23, 9:38 PM

are you using request count or queue delay ? I had similar issue when using request count and so did a few others. It was advised to use queue delay

marshallOP•12/24/23, 9:39 PM

We're using request count. But in that case I'll try queue delay... What settings were recommended?

�

🐧•12/24/23, 9:40 PM

that depends on your use case. I use LLMs so I just set it to 10s.

marshallOP•12/24/23, 9:40 PM

nevermind it was already in queue delay

�

🐧•12/24/23, 9:40 PM

Ah. so I guess the issue is indeed with their job system

�

🐧•12/24/23, 9:41 PM

Have you tried with >= 1 active worker ? And then send several requests to test scaling.

marshallOP•12/24/23, 9:43 PM

Haven't stress-tested it that much, but here's our current settings:

marshallOP•12/24/23, 9:43 PM

I might just write some code to automatically cancel jobs that take more than 1 minute

marshallOP•12/24/23, 9:44 PM

it's a necessary fail-safe anyways

marshallOP•12/24/23, 9:49 PM

I cancelled the job via API (

curl

curl

) and it magically finished? wuttt

marshallOP•12/24/23, 9:50 PM

It had a result and everything

flash-singh•12/24/23, 10:43 PM

whats the endpoint id?

Fflash-singh whats the endpoint id?

marshallOP•12/25/23, 7:48 AM

Endpoint ID:

isme01qeaw1yd4

isme01qeaw1yd4

marshallOP•12/27/23, 5:08 AM

Another case:

Endpoint ID:

isme01qeaw1yd4

isme01qeaw1yd4

Request ID:

dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1

dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1

Worker ID:

vj8i7gy9eujei6

vj8i7gy9eujei6

Worker Logs:

2023-12-27T04:56:23.704924482Z {"requestId": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1", "message": "Failed to return job results. | Connection timeout to host https://api.runpod.ai/v2/isme01qeaw1yd4/job-stream/vj8i7gy9eujei6/dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1?gpu=NVIDIA+RTX+A5000", "level": "ERROR"}
2023-12-27T04:56:23.704980406Z {"requestId": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1", "message": "Finished", "level": "INFO"}
2023-12-27T04:56:25.707261692Z {"requestId": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1", "message": "Failed to return job results. | Connection timeout to host https://api.runpod.ai/v2/isme01qeaw1yd4/job-done/vj8i7gy9eujei6/dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1?gpu=NVIDIA+RTX+A5000", "level": "ERROR"}
2023-12-27T04:56:25.707312002Z {"requestId": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1", "message": "Finished.", "level": "INFO"}

2023-12-27T04:56:23.704924482Z {"requestId": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1", "message": "Failed to return job results. | Connection timeout to host https://api.runpod.ai/v2/isme01qeaw1yd4/job-stream/vj8i7gy9eujei6/dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1?gpu=NVIDIA+RTX+A5000", "level": "ERROR"}
2023-12-27T04:56:23.704980406Z {"requestId": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1", "message": "Finished", "level": "INFO"}
2023-12-27T04:56:25.707261692Z {"requestId": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1", "message": "Failed to return job results. | Connection timeout to host https://api.runpod.ai/v2/isme01qeaw1yd4/job-done/vj8i7gy9eujei6/dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1?gpu=NVIDIA+RTX+A5000", "level": "ERROR"}
2023-12-27T04:56:25.707312002Z {"requestId": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1", "message": "Finished.", "level": "INFO"}

Job Results (STATUS):

{
  "delayTime": 1774,
  "executionTime": 58332,
  "id": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1",
  "status": "CANCELLED"
}

{
  "delayTime": 1774,
  "executionTime": 58332,
  "id": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1",
  "status": "CANCELLED"
}

marshallOP•12/27/23, 5:08 AM

(automatically cancelled after 1 minute)

flash-singh•12/27/23, 4:27 PM

did you cancel it?

Fflash-singh did you cancel it?

marshallOP•12/27/23, 4:52 PM

uh, yeah our systems now cancel jobs that take more than 1 minute (as a fail-safe)

flash-singh•12/27/23, 4:58 PM

got it, thanks

Mmarshall uh, yeah our systems now cancel jobs that take more than 1 minute (as a fail-saf...

J.•12/27/23, 5:47 PM

random:

https://docs.runpod.io/docs/serverless-usage#--execution-policy

If not already there is an execution policy it seems that i added to my request payloads bc this support ticket made me aware i should do it haha

RunPod

🖇️ | Using Your Endpoint

The method in which jobs are submitted and returned.

JJ.random: https://docs.runpod.io/docs/serverless-usage#--execution-policy If not...

marshallOP•12/27/23, 5:49 PM

that's actually great to know, we might try that xd

Issue with unresponsive workers

Relevant Logs:

Attempted Solutions:

Another case:

Similar Threads

Similar Threads

Another case:

Similar Threads