RunPod•15mo ago

Failed to get job. | Error Type: ClientConnectorError

Hey all, I'm starting to receive this kind of error: 2024-02-26T21:49:02.442274586Z connectionpool.py :872 2024-02-26 21:49:02,441 Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fd718d52aa0>: Failed to resolve 'api.runpod.ai' ([Errno -3] Temporary failure in name resolution)")': /v2/d7n1ceeuq4swlp/ping/xkqvldjqlccihw?gpu=NVIDIA+A40&runpod_version=1.6.0 2024-02-26T21:49:12.459986454Z {"requestId": null, "message": "Failed to get job. | Error Type: ClientConnectorError | Error Message: Cannot connect to host api.runpod.ai:443 ssl:default [Temporary failure in name resolution]", "level": "ERROR"} It seems like the system is keep retrying to get the job for 40s and this time interval is included for the serverless billing time. what is going on? Thanks! request id: 0e0314f9-3a78-46bc-b708-969d86ec5b84-u1 worker id: xkqvldjqlccihw

8 Replies

ashleyk•15mo ago

Seems to be a DNS issue where it could not resolve api.runpod.ai. I had some of these errors on my endpoint as well.

sssstevenOP•15mo ago

this is happenning more and more often. this could last more than few mins and added to our bill 😦 request_id: 7a86e856-c03b-4dd7-adeb-24deaebf5de4-u1 worker_id: xkqvldjqlccihw @flash-singh is this a known issue? Thank you

flash-singh•15mo ago

i saw that one was done in 20s

sssstevenOP•15mo ago

thanks. is the DNS error nomal in the log? it took about 40s to resolve the task id then start the job

flash-singh•15mo ago

thats not normal, something we are looking to improve and catch faster

n8tzto•15mo ago

I have also encountered these errors. In recent days, there have been network connection issues within the serverless workers. I have noticed that endpoints occasionally encounter network connection problems. This impacts several processes within running jobs, such as downloading files from URLs, uploading files to S3, and sending HTTP update requests, causing them to fail or become extremely slow.

ashleyk•15mo ago

Yeah my workers are also getting DNS issues and connection timed out to the API.

sssstevenOP•15mo ago

+1 on task timeout... can we get an ETA on this? Thanks!

Gaming

Programming

Failed to get job. | Error Type: ClientConnectorError

Did you find this page helpful?