Container images loading stuck in a loop when launching
I have workers that never actually launch after pulling containers. Have no idea how to debug this. Deleted and recreated the endpoint and get the same behavior. Any thoughts on how to resolve? It is extra aggrevating as I've had to spin this up because of the EU-SE-1 performance degradation and now getting hit with this issue.
endpoint id: d9b5s5qpbl0sfb
=== Snip from the logs. This is just looping repeatedly. Have the credentials set for dockerhub, the image is published and available, etc. ===
2025-05-07T19:34:02Z loading container image from cache
2025-05-07T19:34:50Z Loaded image: docker.io/REDACTED_NAMESPACE/REDACTED_IMAGE:REDACTED_TAG
2025-05-07T19:34:51Z 0.1.24-dev0 Pulling from docker.io/REDACTED_NAMESPACE/REDACTED_IMAGE:REDACTED_TAG
2025-05-07T19:34:51Z Digest: sha256:b066e7235b92701dca45b26a3da6437e1fdc3ca96f751fd5bd614cdb40f532bb
2025-05-07T19:34:51Z Status: Image is up to date for docker.io/REDACTED_NAMESPACE/REDACTED_IMAGE:REDACTED_TAG
2025-05-07T19:34:51Z worker is ready
2025-05-07T19:37:23Z create container docker.io/REDACTED_NAMESPACE/REDACTED_IMAGE:REDACTED_TAG
2025-05-07T19:37:23Z loading container image from cache
2025-05-07T19:37:31Z create container: still fetching image docker.io/REDACTED_NAMESPACE/REDACTED_IMAGE:REDACTED_TAG
2025-05-07T19:37:32Z create container docker.io/REDACTED_NAMESPACE/REDACTED_IMAGE:REDACTED_TAG
2025-05-07T19:37:32Z create container: still fetching image docker.io/REDACTED_NAMESPACE/REDACTED_IMAGE:REDACTED_TAG
2025-05-07T19:37:33Z Loaded image: docker.io/REDACTED_NAMESPACE/REDACTED_IMAGE:REDACTED_TAG
2025-05-07T19:37:34Z 0.1.24-dev0 Pulling from docker.io/REDACTED_NAMESPACE/REDACTED_IMAGE:REDACTED_TAG
2025-05-07T19:37:34Z Digest: sha256:b066e7235b92701dca45b26a3da6437e1fdc3ca96f751fd5bd614cdb40f532bb
2025-05-07T19:37:34Z Status: Image is up to date for docker.io/REDACTED_NAMESPACE/REDACTED_IMAGE:REDACTED_TAG
2025-05-07T19:37:34Z worker is ready
36 Replies
Yep same here
We also experience problems
Workers are running and we pay but they only stay in the queue and don't get processed...
same here.. I have stopped all of my workers, else it's charging me all the time
Same for us.
I am facing the same. Do you know how can we get this complaint escalated?
I'll tag @Dj to see if they can provide any insight to what is going on. Sounds like it is impacting quite a few of us.
Great
Okay one sec
I thought this was isolated to like 2-3 people
Working on this, let me make a lot more noise
Can I ask, are these public or private images?
in my case, this is/was happening with a private image
In my case public.
GHCR or Docker Hub?
DockerHub for me,.
DockerHub
DockerHub for me with a private image.
I'm working on getting this escalated, I can definitely see the pattern we're just isolating the problem
Same
Small update, this is being looked into.
Can you check your Container Registry Token you provided RunPod to ensure its valid?
You can check on Docker Hub directly to avoid messing with anything.
https://app.docker.com/settings/personal-access-tokens
We can release new endpoint with the same token. nothing change and it is working.
Yes, valid.
It works for me now.
Private
Been down for about 2 hours
Same image as the last 3 days, container is trying to load from runpod cache
away from kb right now but will check soon.
for me, now the whole workers are just gone away and start all over, assigned 40 and they all started over again.
The issue is to do with the cache
i'd imagine
https://uptime.runpod.io/ - Reporting no issues for serverless, can we get an update so we can relay to our customers
RunPod status
Welcome to RunPod status page for real-time and historical data on system performance.
yeah, it's tryng to pull the image from cache again and again, I see it's even said success.
We've identified the issue, I'm not sure why we won't fire an incident for it but it's being treated like one internally.
FWIW: I'm still having the same issues. I nuked the endpoint and recreated and still have the same problems.
This is pretty frustrating, there has been frequent issues with runpod serverless. This 3 hours has cost us once again
I totally understand, the people fixing this issue are actively using this thread to help move things forward. I'll join their call and listen for progress.
I hear you and share your frustration.
Thanks. Please do keep the updates coming.
Can you dm me your newest pod id?
@noahpantsparty Are you still pending a release?
Found it, it looks like you turned it off
I cancelled all the requests. I can start another if helpful
Should i create another release?
Did a new release on endpoint trky2t3f5nehnx and still seeing the same behavior.
We're deploying what should be a fix to production now
It's just a slow process
TY
I'm confident this issue should be fixed for all affected users.
Our solution is a band-aid and over the coming days we'll fix the issue permanently.
Looks like it is working for me now. Thanks @Dj for helping to resolve the issue.