Immich•2y ago

Machine learning jobs exiting/crashing

i keep getting below errors within couple of mins of starting my object detection or clip embeddings [Nest] 1 - 05/22/2023, 5:53:01 PM ERROR [SmartInfoService] Unable run object detection pipeline: 13c097e3-201c-4808-b841-e45dc2d5cea1 Error: connect ECONNREFUSED 172.25.0.8:3003 at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1494:16) [Nest] 1 - 05/22/2023, 5:53:23 PM ERROR [SmartInfoService] Unable to run image tagging pipeline: 13c24775-f2e8-4b9c-94d0-c220b543f6cb Error: connect ECONNREFUSED 172.25.0.8:3003 at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1494:16) [Nest] 1 - 05/22/2023, 5:53:44 PM ERROR [SmartInfoService] Unable run object detection pipeline: 13c24775-f2e8-4b9c-94d0-c220b543f6cb Error: connect ECONNREFUSED 172.25.0.8:3003 what could be the cause for this? my face detection job almost completed 31k/ 33k, before i started seeing above errors i am running on WD NAS, 4 core Intel CPU, 4GB RAM

57 Replies

RonniePOP•2y ago

checked machine learning logs, at the point where i see these errors in micro service, machine learning has OK INFO: 172.25.0.6:60130 - "POST /object-detection/detect-object HTTP/1.1" 200 OK INFO: 172.25.0.6:60118 - "POST /object-detection/detect-object HTTP/1.1" 200 OK INFO: 172.25.0.6:60170 - "POST /image-classifier/tag-image HTTP/1.1" 200 OK INFO: 172.25.0.6:60172 - "POST /object-detection/detect-object HTTP/1.1" 200 OK INFO: 172.25.0.6:60192 - "POST /image-classifier/tag-image HTTP/1.1" 200 OK INFO: 172.25.0.6:60214 - "POST /object-detection/detect-object HTTP/1.1" 200 OK INFO: 172.25.0.6:60222 - "POST /image-classifier/tag-image HTTP/1.1" 200 OK

Alex Tran•2y ago

Can you attach to your microservice container and run

apk add curl

apk add curl

then

curl immich_machine_learning:3003

curl immich_machine_learning:3003

RonniePOP•2y ago

/usr/src/app # apk add curl fetch https://dl-cdn.alpinelinux.org/alpine/v3.17/main/x86_64/APKINDEX.tar.gz fetch https://dl-cdn.alpinelinux.org/alpine/v3.17/community/x86_64/APKINDEX.tar.gz (1/1) Installing curl (8.1.0-r2) Executing busybox-1.35.0-r29.trigger OK: 180 MiB in 127 packages /usr/src/app # curl immich_machine_learning:3003 curl: (7) Failed to connect to immich_machine_learning port 3003 after 1 ms: Couldn't connect to server

Alex Tran•2y ago

Try restart the stack

RonniePOP•2y ago

i restarted each container individually multiple times but as soon as i start any ML job it would start giving above errors after some time on Saturday, my face detection stopped working after about 5k then i restarted and it completed 31k and started giving errors so i started this one more time and it again ended after few mins see full logs attached

RonniePOP•2y ago

cfd15c3152d9c76e7ffa...

RonniePOP•2y ago

25cbeeb278ccf1578c8f...

RonniePOP•2y ago

important logs, where socket resets or disconnects {"log":"Error: socket hang up\n","stream":"stderr","time":"2023-05-22T18:36:10.246829512Z"} {"log":" at connResetException (node:internal/errors:717:14)\n","stream":"stderr","time":"2023-05-22T18:36:10.246854212Z"} {"log":" at Socket.socketOnEnd (node:_http_client:526:23)\n","stream":"stderr","time":"2023-05-22T18:36:10.246874737Z"} {"log":" at Socket.emit (node:events:525:35)\n","stream":"stderr","time":"2023-05-22T18:36:10.24689625Z"} {"log":" at endReadableNT (node:internal/streams/readable:1359:12)\n","stream":"stderr","time":"2023-05-22T18:36:10.246916062Z"} {"log":" at process.processTicksAndRejections (node:internal/process/task_queues:82:21)\n","stream":"stderr","time":"2023-05-22T18:36:10.24693635Z"} {"log":"\u001b[31m[Nest] 1 - \u001b[39m05/22/2023, 6:36:10 PM \u001b[31m ERROR\u001b[39m \u001b[38;5;3m[SmartInfoService] \u001b[39m\u001b[31mUnable run object d{"log":"Error: socket hang up\n","stream":"stderr","time":"2023-05-22T18:36:10.251021587Z"} {"log":" at connResetException (node:internal/errors:717:14)\n","stream":"stderr","time":"2023-05-22T18:36:10.2510351Z"} {"log":" at Socket.socketOnEnd (node:_http_client:526:23)\n","stream":"stderr","time":"2023-05-22T18:36:10.251047612Z"} {"log":" at Socket.emit (node:events:525:35)\n","stream":"stderr","time":"2023-05-22T18:36:10.251059812Z"} {"log":" at endReadableNT (node:internal/streams/readable:1359:12)\n","stream":"stderr","time":"2023-05-22T18:36:10.251071962Z"}

Alex Tran•2y ago

Hmm which installation method do you use?

RonniePOP•2y ago

docker compose this is corresponding log from immich server [Nest] 1 - 05/22/2023, 6:21:53 PM LOG [NestApplication] Nest application successfully started +13932ms [Nest] 1 - 05/22/2023, 6:22:01 PM LOG [ImmichServer] Running Immich Server in PRODUCTION environment - version 1.56.2 - Listening on port: 3001 [Nest] 1 - 05/22/2023, 6:22:07 PM WARN [ImmichServer] Machine learning is enabled [Nest] 1 - 05/22/2023, 6:22:07 PM WARN [ImmichServer] Search is enabled [Nest] 1 - 05/22/2023, 6:23:51 PM ERROR [ExceptionsHandler] Connection terminated due to connection timeout Error: Connection terminated due to connection timeout at Connection.<anonymous> (/usr/src/app/node_modules/pg/lib/client.js:132:73) at Object.onceWrapper (node:events:627:28) at Connection.emit (node:events:513:28) at Socket.<anonymous> (/usr/src/app/node_modules/pg/lib/connection.js:62:12) at Socket.emit (node:events:513:28) at TCP.<anonymous> (node:net:322:12)

bo0tzz•2y ago

Seems like it's not just ML having problems but also your database It could be the whole system is being overloaded

Alex Tran•2y ago

What is the spec do you use for the server?

RonniePOP•2y ago

hmm i have restricted ML and micro service to 3 cores and 3.5gb

Alex Tran•2y ago

ML will need at least 4GB 🤔

RonniePOP•2y ago

server is 4 core 4gb ram i was monitoring ML when running object detection it was not going over 2gb same for facial recognition

Alex Tran•2y ago

that is strange

RonniePOP•2y ago

the first few tries i did not restrict ML and microservice containers at all but since it was crashing i thought it might be due to overload

Alex Tran•2y ago

so now if you perform docker compose down and docker compose up does it run into error state right away?

RonniePOP•2y ago

so restricted to 3 cores and 3.5gb no if you see the logs, i start around 6.35pm utc and it starts crashing around 6.37

jrasm91•2y ago

Error: Connection terminated due to connection timeout

I have only seen this when cpu is at 100% for too long and an attempt to update the database fails

RonniePOP•2y ago

should i restrict cpu to 2 cores?

jrasm91•2y ago

If CPU is maxed some things will be stalled for too long that they hit a timeout and are cancelled You could definitely try that.

RonniePOP•2y ago

all containers? or only ML

jrasm91•2y ago

Just ML should be fine

RonniePOP•2y ago

will try that

jrasm91•2y ago

Object detection is the only job running and you get this error?

RonniePOP•2y ago

yes that is the only one running it happened with facial recognition as well couple of days back, after completing about 95% of the images

jrasm91•2y ago

So close lol

RonniePOP•2y ago

and yes, i only run one job at a time

jrasm91•2y ago

Do you have a more powerful computer besides this one?

RonniePOP•2y ago

not currently

jrasm91•2y ago

What I've found success with is running ML on my desktop for some of the big jobs

RonniePOP•2y ago

can i then carry the data over to my NAS? i can try my work computer lol

jrasm91•2y ago

Lol.

RonniePOP•2y ago

how would directory structure be maintained if i run ML on some other computer?

jrasm91•2y ago

You need to mount the same volume to that computer I run something like this:

docker run -it --rm -p 3003:3003 --volume="/media/jrasm91/ImmichLibrary/Photos/Library:/usr/src/app/upload" --volume="model-cache:/cache" altran1502/immich-machine-learning:release

docker run -it --rm -p 3003:3003 --volume="/media/jrasm91/ImmichLibrary/Photos/Library:/usr/src/app/upload" --volume="model-cache:/cache" altran1502/immich-machine-learning:release

Related - we've fixed some bugs with memory management for large jobs as well as lower default concurrency settings. Those should be in the next release, so that would probably help here as well.

RonniePOP•2y ago

ok, for now will try out different resource limitation on the containers and report and if it doesn't help, will try out new release worst case will try out on different computer for ML

jrasm91•2y ago

Yup, sounds like a good plan. If you prevent the ML container from using all the CPU I think you'll avoid those timeout/connection problems and hopefully you can finish processing everything. You should be able to just run "missing" on these subsequent attempts as well.

RonniePOP•2y ago

any limitations on RAM that i should try?

jrasm91•2y ago

You need a minimum amount of ram for stuff to work, but I don't think it needs to be limited anywhere for your situation

RonniePOP•2y ago

[Nest] 1 - 05/22/2023, 9:21:15 PM LOG [InstanceLoader] MicroservicesModule dependencies initialized +1ms [Nest] 1 - 05/22/2023, 9:21:16 PM LOG [NestApplication] Nest application successfully started +208ms [Nest] 1 - 05/22/2023, 9:21:16 PM LOG [ImmichMicroservice] Running Immich Microservices in PRODUCTION environment - version 1.56.2 - Listening on port: 3002 [Nest] 1 - 05/22/2023, 9:21:17 PM LOG [MetadataExtractionProcessor] Reverse Geocoding Initialized [Nest] 1 - 05/22/2023, 9:25:09 PM ERROR [SmartInfoService] Unable run clip encoding pipeline: 0000aa2c-be9f-48b7-86f7-31ac7245ad05 Error: Connection terminated due to connection timeout at Connection.<anonymous> (/usr/src/app/node_modules/pg/lib/client.js:132:73) at Object.onceWrapper (node:events:627:28) at Connection.emit (node:events:513:28) at Socket.<anonymous> (/usr/src/app/node_modules/pg/lib/connection.js:62:12) at Socket.emit (node:events:513:28) at TCP.<anonymous> (node:net:322:12) [Nest] 1 - 05/22/2023, 9:25:56 PM ERROR [SmartInfoService] Unable run clip encoding pipeline: 00026c2d-5a75-4690-bfd0-a259c8ba6530 Error: Connection terminated due to connection timeout at Connection.<anonymous> (/usr/src/app/node_modules/pg/lib/client.js:132:73) at Object.onceWrapper (node:events:627:28) at Connection.emit (node:events:513:28) at Socket.<anonymous> (/usr/src/app/node_modules/pg/lib/connection.js:62:12) at Socket.emit (node:events:513:28) at TCP.<anonymous> (node:net:322:12)

RonniePOP•2y ago

RonniePOP•2y ago

@jrasm91 @Alex so trying out different limitations, it doesn't look like it's CPU or memory bound when i hit this error of disconnection see the above log, it disconnects at 9.25, at that point none of these have high enough CPU or RAM usage

Alex Tran•2y ago

This is the timeout with the database, maybe the CPU max out causing the handshake with the database get dropped?

RonniePOP•2y ago

CPU isn't maxed out it's 4 cores should go to 400% i have posted the screenshot for db, ML and microservice containers CPU and RAM utilization so, i was able to get CLIP completed after multiple restarts it would fail within 2 to 3 min without CPU or memory being fully loaded, sometimes immediately without starting single image i have been unable to get tag objects to finish even 5 % after 5 days

Alex Tran•2y ago

damn this is rough

ahbeng•2y ago

Hey - maybe not exactly the same but I'm experiencing the same for Object Tagging. Granted I'm tagging over 350K objects but my redis container blew up to over 4GB and my ML CPU were like at least 180% usage before my entire VM got stuck and had to do a hard reboot. so it not just CPU but also Memory management too I got 3CPUs and 8GB assigned to the VM for Immich and seems to be holding up - all in all the whole Immich stack is like using 6.8 GB. Added this to my Microservice and ML docker compose: cpu_count: 2 cpu_percent: 90 cpus: 0.90 cpu_shares: 90 cpuset: 0,1 - for the ms cpuset: 1,2 - for the ml And added this to my redis: mem_limit: 4GB good news for me - its been running for 2 days now without crashing i could increase it but this one seems stable enough... 350K+ objects to tag ~ ETA at 8 days if I'm counting correctly

jrasm91•2y ago

Do you have a faster PC?

ahbeng•2y ago

unfortunately nope

jrasm91•2y ago

Sad day hah

ahbeng•2y ago

as long as it works. its a 1 time chore so i'll live

jrasm91•2y ago

Honestly, having CLIP and facial recognition done are really nice for those associated features (search & people). Object detection and classification isn't quite as cool. The models we use aren't super great and the added functionality they add (a list of "things" on the explore page") isn't too useful IMO. Maybe read through this discussion? https://github.com/immich-app/immich/discussions/2524 I'd be real lame for you to finally finish 1-2 weeks later and then we've re-vamped those models or something lol

ahbeng•2y ago

lol! ouch. hmm. I guess I can run CLIP and Facial first then GPU would go a long way. But I need to get my containers out of Hyper-V to do that. Right now, have my reasons for Windows but I might have to buy a new mini-pc just for Immich + Proxmox

jrasm91•2y ago

Sounds like a plan. It sounds like cuda support could come sooner than later.

ahbeng•2y ago

tho.. i read there's a way for Intel GPUs to use Nvidia CUDA, not sure what hack I'll run into for that. Sorry but most mini-pc are using Intel iGPUs. don't plan on getting an nvidia card soon

RonniePOP•2y ago

does that go in the docker compose file? didn't help me, still seeing the same issue will just skip on object tagging

ahbeng•2y ago

Yep that goes into the docker compose file: Example for the immich-microservices:

immich-microservices:
    container_name: immich_microservices
    image: ghcr.io/immich-app/immich-server:release
    entrypoint: [ "/bin/sh", "./start-microservices.sh" ]
    volumes:
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
    env_file:
      - stack.env
    environment:
      - NODE_ENV=production
    cpu_count: 2
    cpu_percent: 80
    cpus: 0.80
    cpu_shares: 80
    cpuset: 1,2
    restart: unless-stopped

Gaming

Programming

Machine learning jobs exiting/crashing

Did you find this page helpful?