Workers stuck forever in initializing state with "image ready, model not found"
Today, with no changes to my worker set up, most workers have been stuck at initializing. There seems to be at least 3 types of this failure with the following logs: - "image ready, model not found". This is most common. Example id: vie3xhgrxqf77d - "image ready, downloading model files", but nothing happening, can be stuck like this for 1+ hour whereas healthy workers download the model in minutes. Example id: aakzfbq19zcop6 - stuck at downloading worker image, stopping at random spot like: "e1ed8d486eca Waiting". Example id: tyewgxka0yoscl
In addition to this, around 30% of workers have been in throttled state today.
For extra context –I am using cached models with fallback to downloading the model manually in the worker in case the cached image cannot be found. I am using EU regions and RTX 5090 GPUs.
What is the root cause of this forever initializing worker state? Can I mitigate it from my side or is just a problem with Runpod reliability?Most days it's not a problem, but every once in a while, this problem pops up and makes Runpod unusuable and as a result, not a viable option for running a business.
Recent Announcements
Continue the conversation
Join the Discord to ask follow-up questions and connect with the community
R
Runpod
We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!