Immich•2y ago

connect ECONNREFUSED 172.23.0.6:3003 - ML not starting

It seems like the machine_learning container is not starting. Here is my docker-compose block for that service (it previously worked before updating):

  immich-machine-learning:
    image: ghcr.io/immich-app/immich-machine-learning:release
    volumes:
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
      #- ./model-cache:/cache
      - ./model-cache/yolos-tiny:/usr/src/app/hustvl/yolos-tiny
      - ./model-cache/resnet-50:/usr/src/app/microsoft/resnet-50
      - ./model-cache/clip-ViT-B-32:/usr/src/app/clip-ViT-B-32
    env_file:
      - .env
    environment:
      - NODE_ENV=production
    depends_on:
      - database
    restart: always

  immich-machine-learning:
    image: ghcr.io/immich-app/immich-machine-learning:release
    volumes:
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
      #- ./model-cache:/cache
      - ./model-cache/yolos-tiny:/usr/src/app/hustvl/yolos-tiny
      - ./model-cache/resnet-50:/usr/src/app/microsoft/resnet-50
      - ./model-cache/clip-ViT-B-32:/usr/src/app/clip-ViT-B-32
    env_file:
      - .env
    environment:
      - NODE_ENV=production
    depends_on:
      - database
    restart: always

Here are the logs from the container:

docker logs immich-immich-machine-learning
INFO:     Started server process [7]
INFO:     Waiting for application startup.

docker logs immich-immich-machine-learning
INFO:     Started server process [7]
INFO:     Waiting for application startup.

Those logs aren't super helpful. Is there a way to get more verbose logs, and has anyone run into this? Previously I had an issue with the container not being allowed to access the internet and having to bind mount the models within it.

15 Replies

monikerOP•2y ago

Setting the container to development mode did not give any helpful logs here. It seems like it may be a fastapi app, but I am unsure what the correct way to debug/triage is here

monikerOP•2y ago

It seems like some models may have changes today, but in general models haven't changed for a bit: https://github.com/immich-app/immich/commit/165b91b068193db53b07cc4f265d11326530be3c

GitHub

feat(ml)!: switch image classification and CLIP models to ONNX (#38...

sogan•2y ago

That PR hasn't made it into a release yet The server won't be open to requests until you see Application startup complete.. If there are no errors, it might just be loading the models. Do you see any IO or CPU activity in the container?

monikerOP•2y ago

No CPU or IO usage after a few seconds. It's on a pretty large zfs array Can you share the model cache directory on either a bind mount of the container or from a running container? I think the default now in the container might be /cache? It might be that there is a bug in the loader for at least bison_l where it does not raise an exception with missing files

sogan•2y ago

The cache directory is set with MACHINE_LEARNING_CACHE_FOLDER and defaults to /cache. Are you setting this env?

there is a bug in the loader for at least bison_l where it does not raise an exception with missing files

If there's an error while loading a model, the app will delete the folder associated with that model and try again. It will only error if this second attempt also fails. In general I don't recommend binding the model folders directly like this since there can be changes in the cache structure, and as mentioned, it can delete the contents of these folders on failure.

monikerOP•2y ago

Ok, I need to run the containers offline and can not download them on startup. Bind mounting them directly as noted in the docker compose above worked. How can they be downloaded in an offline manner and mounted into the container? And do exceptions not propagate to logs in the container? That's not totally clear to me

sogan•2y ago

Exceptions definitely do show up in the logs

sogan•2y ago

The model cache isn't terribly well documented at the moment, but if you can match the folder structure here then it should work. https://github.com/immich-app/immich/blob/60729a091ab0da18ec67d4a8d0ca8448715a91b6/machine-learning/app/config.py#L28C1-L29C69 (This cache structure will change in the upcoming release as a heads up)

GitHub

immich/machine-learning/app/config.py at 60729a091ab0da18ec67d4a8d0...

Self-hosted photo and video backup solution directly from your mobile phone. - immich-app/immich

sogan•2y ago

Where model_type.value is image-classification, clip or facial-recognition

monikerOP•2y ago

Previously the huggingface library was used I think? It did expect some subdirs, and would first try to serialize relative to the main script before loading from a configured folder Kind of interesting behavior. I'll give that approach a try. Kind of interesting that there aren't any exceptions after 6ish hours

sogan•2y ago

The first model that gets loaded is the HF image classification model. This model always outputs a log about configuring the feature extractor. Since you don't have that log, it's probably stuck trying to download it. But I'm not sure why it isn't erroring out The feature extractor preprocesses the image for the model and is downloaded separately, so that might be causing you issues

monikerOP•2y ago

clip, right?

sogan•2y ago

Ah, CLIP also has two preprocessors: a tokenizer for the text model and a feature extractor for the vision model. But I was talking about the image classification model

monikerOP•2y ago

Oh OK, cool this gives me a lot to work with. I'll debug my setup tomorrow and keep a better eye on the model structures thanks!

sogan•2y ago

np!

Gaming

Programming

connect ECONNREFUSED 172.23.0.6:3003 - ML not starting

Did you find this page helpful?