connect ECONNREFUSED 172.23.0.6:3003 - ML not starting

It seems like the machine_learning container is not starting. Here is my docker-compose block for that service (it previously worked before updating):
immich-machine-learning:
image: ghcr.io/immich-app/immich-machine-learning:release
volumes:
- ${UPLOAD_LOCATION}:/usr/src/app/upload
#- ./model-cache:/cache
- ./model-cache/yolos-tiny:/usr/src/app/hustvl/yolos-tiny
- ./model-cache/resnet-50:/usr/src/app/microsoft/resnet-50
- ./model-cache/clip-ViT-B-32:/usr/src/app/clip-ViT-B-32
env_file:
- .env
environment:
- NODE_ENV=production
depends_on:
- database
restart: always
immich-machine-learning:
image: ghcr.io/immich-app/immich-machine-learning:release
volumes:
- ${UPLOAD_LOCATION}:/usr/src/app/upload
#- ./model-cache:/cache
- ./model-cache/yolos-tiny:/usr/src/app/hustvl/yolos-tiny
- ./model-cache/resnet-50:/usr/src/app/microsoft/resnet-50
- ./model-cache/clip-ViT-B-32:/usr/src/app/clip-ViT-B-32
env_file:
- .env
environment:
- NODE_ENV=production
depends_on:
- database
restart: always
Here are the logs from the container:
docker logs immich-immich-machine-learning
INFO: Started server process [7]
INFO: Waiting for application startup.
docker logs immich-immich-machine-learning
INFO: Started server process [7]
INFO: Waiting for application startup.
Those logs aren't super helpful. Is there a way to get more verbose logs, and has anyone run into this? Previously I had an issue with the container not being allowed to access the internet and having to bind mount the models within it.
15 Replies
moniker
monikerOP2y ago
Setting the container to development mode did not give any helpful logs here. It seems like it may be a fastapi app, but I am unsure what the correct way to debug/triage is here
moniker
monikerOP2y ago
It seems like some models may have changes today, but in general models haven't changed for a bit: https://github.com/immich-app/immich/commit/165b91b068193db53b07cc4f265d11326530be3c
sogan
sogan2y ago
That PR hasn't made it into a release yet The server won't be open to requests until you see Application startup complete.. If there are no errors, it might just be loading the models. Do you see any IO or CPU activity in the container?
moniker
monikerOP2y ago
No CPU or IO usage after a few seconds. It's on a pretty large zfs array Can you share the model cache directory on either a bind mount of the container or from a running container? I think the default now in the container might be /cache? It might be that there is a bug in the loader for at least bison_l where it does not raise an exception with missing files
sogan
sogan2y ago
The cache directory is set with MACHINE_LEARNING_CACHE_FOLDER and defaults to /cache. Are you setting this env?
there is a bug in the loader for at least bison_l where it does not raise an exception with missing files
If there's an error while loading a model, the app will delete the folder associated with that model and try again. It will only error if this second attempt also fails. In general I don't recommend binding the model folders directly like this since there can be changes in the cache structure, and as mentioned, it can delete the contents of these folders on failure.
moniker
monikerOP2y ago
Ok, I need to run the containers offline and can not download them on startup. Bind mounting them directly as noted in the docker compose above worked. How can they be downloaded in an offline manner and mounted into the container? And do exceptions not propagate to logs in the container? That's not totally clear to me
sogan
sogan2y ago
Exceptions definitely do show up in the logs
sogan
sogan2y ago
The model cache isn't terribly well documented at the moment, but if you can match the folder structure here then it should work. https://github.com/immich-app/immich/blob/60729a091ab0da18ec67d4a8d0ca8448715a91b6/machine-learning/app/config.py#L28C1-L29C69 (This cache structure will change in the upcoming release as a heads up)
GitHub
immich/machine-learning/app/config.py at 60729a091ab0da18ec67d4a8d0...
Self-hosted photo and video backup solution directly from your mobile phone. - immich-app/immich
sogan
sogan2y ago
Where model_type.value is image-classification, clip or facial-recognition
moniker
monikerOP2y ago
Previously the huggingface library was used I think? It did expect some subdirs, and would first try to serialize relative to the main script before loading from a configured folder Kind of interesting behavior. I'll give that approach a try. Kind of interesting that there aren't any exceptions after 6ish hours
sogan
sogan2y ago
The first model that gets loaded is the HF image classification model. This model always outputs a log about configuring the feature extractor. Since you don't have that log, it's probably stuck trying to download it. But I'm not sure why it isn't erroring out The feature extractor preprocesses the image for the model and is downloaded separately, so that might be causing you issues
moniker
monikerOP2y ago
clip, right?
sogan
sogan2y ago
Ah, CLIP also has two preprocessors: a tokenizer for the text model and a feature extractor for the vision model. But I was talking about the image classification model
moniker
monikerOP2y ago
Oh OK, cool this gives me a lot to work with. I'll debug my setup tomorrow and keep a better eye on the model structures thanks!
sogan
sogan2y ago
np!

Did you find this page helpful?