R
RunPod•4mo ago
ssssteven

Failed to get job. | Error Type: ClientConnectorError

Hey all, I'm starting to receive this kind of error: 2024-02-26T21:49:02.442274586Z connectionpool.py :872 2024-02-26 21:49:02,441 Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fd718d52aa0>: Failed to resolve 'api.runpod.ai' ([Errno -3] Temporary failure in name resolution)")': /v2/d7n1ceeuq4swlp/ping/xkqvldjqlccihw?gpu=NVIDIA+A40&runpod_version=1.6.0 2024-02-26T21:49:12.459986454Z {"requestId": null, "message": "Failed to get job. | Error Type: ClientConnectorError | Error Message: Cannot connect to host api.runpod.ai:443 ssl:default [Temporary failure in name resolution]", "level": "ERROR"} It seems like the system is keep retrying to get the job for 40s and this time interval is included for the serverless billing time. what is going on? Thanks! request id: 0e0314f9-3a78-46bc-b708-969d86ec5b84-u1 worker id: xkqvldjqlccihw
8 Replies
ashleyk
ashleyk•4mo ago
Seems to be a DNS issue where it could not resolve api.runpod.ai. I had some of these errors on my endpoint as well.
ssssteven
ssssteven•4mo ago
this is happenning more and more often. this could last more than few mins and added to our bill 😦 request_id: 7a86e856-c03b-4dd7-adeb-24deaebf5de4-u1 worker_id: xkqvldjqlccihw @flash-singh is this a known issue? Thank you
flash-singh
flash-singh•4mo ago
i saw that one was done in 20s
ssssteven
ssssteven•4mo ago
thanks. is the DNS error nomal in the log? it took about 40s to resolve the task id then start the job
flash-singh
flash-singh•4mo ago
thats not normal, something we are looking to improve and catch faster
n8tzto
n8tzto•4mo ago
I have also encountered these errors. In recent days, there have been network connection issues within the serverless workers. I have noticed that endpoints occasionally encounter network connection problems. This impacts several processes within running jobs, such as downloading files from URLs, uploading files to S3, and sending HTTP update requests, causing them to fail or become extremely slow.
ashleyk
ashleyk•4mo ago
Yeah my workers are also getting DNS issues and connection timed out to the API.
ssssteven
ssssteven•4mo ago
+1 on task timeout... can we get an ETA on this? Thanks!