Unhealthy worker state in serverless endpoint: remote error: tls: bad record MAC

I'm using a runpod serverless endpoint with worker limit 6. The endpoint performs well, except for one error: sometimes a worker gets "unhealthy" and HTTP requests fail with: request failed: Post "https://api.runpod.ai/v2/s3bxj20mra4dvp/runsync": remote error: tls: bad record MAC OR "request failed: Post "https://api.runpod.ai/v2/s3bxj20mra4dvp/runsync\": write tcp [2001:1c02:2c09:9100:7bab:2fba:21cc:6df1]:53732->[2606:4700::6812:9dd]:443: use of closed network connection" Observations: Request fails at network level, not 500 status Retry usually succeeds (load test: 10 concurrent requests, 5 workers online) Failures continue for some time, then 100% success rate returns Happening for months in production, handled via retries + fallback endpoints We're scaling up and want to consolidate our three fallback endpoints to one/two for better worker efficiency. Questions: Anyone recognize this pattern? Solutions/workarounds? Can I identify which worker was used per request to programmatically kill/restart it? Runpod eventually fixes this (internal /ping?), but takes too long, especially off-hours with few workers. How does runpod queueing work? Since HTTP fails at network level, is there actual redirection to worker infra vs API proxy returning 500? To runpod team; is this: - A load balancing/health management bug fixable on your end? - Infra limitation requiring retries? - Misconfiguration in my endpoint/worker image?
No description
1 Reply
Tom Huibers
Tom HuibersOP2w ago
Additional information of a debug attempt: - changed request queue param to 10 request to decrease # workers and increase change of failur - ran 50 requests with parallelism of 20, so 2 workers online - received network errors in the beginning of the run, stopped towards the end - polled the workers state using: https://rest.runpod.io/v1/endpoints/$ENDPOINT_ID?includeWorkers=true, then retrieve the health ping url and call that for workers According to the endpoints endpoint, the endpoints were never unhealthy, but the error did go away, this suggest that the error occurs not in the workers itself, but in the network routing in between. I included the logs for the load test, health check monitoring and the script for the health check monitoring in this comment.

Did you find this page helpful?