Unhealthy worker state in serverless endpoint: remote error: tls: bad record MAC
I'm using a runpod serverless endpoint with worker limit 6. The endpoint performs well, except for one error: sometimes a worker gets "unhealthy" and HTTP requests fail with: request failed: Post "https://api.runpod.ai/v2/s3bxj20mra4dvp/runsync": remote error: tls: bad record MAC OR "request failed: Post "https://api.runpod.ai/v2/s3bxj20mra4dvp/runsync\": write tcp [2001:1c02:2c09:9100:7bab:2fba:21cc:6df1]:53732->[2606:4700::6812:9dd]:443: use of closed network connection"
Observations: Request fails at network level, not 500 status Retry usually succeeds (load test: 10 concurrent requests, 5 workers online) Failures continue for some time, then 100% success rate returns Happening for months in production, handled via retries + fallback endpoints
We're scaling up and want to consolidate our three fallback endpoints to one/two for better worker efficiency.
Questions: Anyone recognize this pattern? Solutions/workarounds?
Can I identify which worker was used per request to programmatically kill/restart it? Runpod eventually fixes this (internal /ping?), but takes too long, especially off-hours with few workers.
How does runpod queueing work? Since HTTP fails at network level, is there actual redirection to worker infra vs API proxy returning 500?
To runpod team; is this: - A load balancing/health management bug fixable on your end? - Infra limitation requiring retries? - Misconfiguration in my endpoint/worker image?