connect ECONNREFUSED 172.23.0.6:3003 - ML not starting
It seems like the machine_learning container is not starting. Here is my docker-compose block for that service (it previously worked before updating):
Here are the logs from the container:
Those logs aren't super helpful. Is there a way to get more verbose logs, and has anyone run into this? Previously I had an issue with the container not being allowed to access the internet and having to bind mount the models within it.
15 Replies
Setting the container to development mode did not give any helpful logs here. It seems like it may be a fastapi app, but I am unsure what the correct way to debug/triage is here
It seems like some models may have changes today, but in general models haven't changed for a bit: https://github.com/immich-app/immich/commit/165b91b068193db53b07cc4f265d11326530be3c
That PR hasn't made it into a release yet
The server won't be open to requests until you see
Application startup complete.
. If there are no errors, it might just be loading the models. Do you see any IO or CPU activity in the container?No CPU or IO usage after a few seconds. It's on a pretty large zfs array
Can you share the model cache directory on either a bind mount of the container or from a running container? I think the default now in the container might be /cache? It might be that there is a bug in the loader for at least bison_l where it does not raise an exception with missing files
The cache directory is set with MACHINE_LEARNING_CACHE_FOLDER and defaults to /cache. Are you setting this env?
there is a bug in the loader for at least bison_l where it does not raise an exception with missing filesIf there's an error while loading a model, the app will delete the folder associated with that model and try again. It will only error if this second attempt also fails. In general I don't recommend binding the model folders directly like this since there can be changes in the cache structure, and as mentioned, it can delete the contents of these folders on failure.
Ok, I need to run the containers offline and can not download them on startup. Bind mounting them directly as noted in the docker compose above worked. How can they be downloaded in an offline manner and mounted into the container?
And do exceptions not propagate to logs in the container? That's not totally clear to me
Exceptions definitely do show up in the logs
The model cache isn't terribly well documented at the moment, but if you can match the folder structure here then it should work. https://github.com/immich-app/immich/blob/60729a091ab0da18ec67d4a8d0ca8448715a91b6/machine-learning/app/config.py#L28C1-L29C69
(This cache structure will change in the upcoming release as a heads up)
GitHub
immich/machine-learning/app/config.py at 60729a091ab0da18ec67d4a8d0...
Self-hosted photo and video backup solution directly from your mobile phone. - immich-app/immich
Where
model_type.value
is image-classification
, clip
or facial-recognition
Previously the huggingface library was used I think? It did expect some subdirs, and would first try to serialize relative to the main script before loading from a configured folder
Kind of interesting behavior. I'll give that approach a try. Kind of interesting that there aren't any exceptions after 6ish hours
The first model that gets loaded is the HF image classification model. This model always outputs a log about configuring the feature extractor. Since you don't have that log, it's probably stuck trying to download it. But I'm not sure why it isn't erroring out
The feature extractor preprocesses the image for the model and is downloaded separately, so that might be causing you issues
clip, right?
Ah, CLIP also has two preprocessors: a tokenizer for the text model and a feature extractor for the vision model. But I was talking about the image classification model
Oh OK, cool this gives me a lot to work with. I'll debug my setup tomorrow and keep a better eye on the model structures
thanks!
np!