Machine learning jobs exiting/crashing
i keep getting below errors within couple of mins of starting my object detection or clip embeddings
[Nest] 1 - 05/22/2023, 5:53:01 PM ERROR [SmartInfoService] Unable run object detection pipeline: 13c097e3-201c-4808-b841-e45dc2d5cea1
Error: connect ECONNREFUSED 172.25.0.8:3003
at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1494:16)
[Nest] 1 - 05/22/2023, 5:53:23 PM ERROR [SmartInfoService] Unable to run image tagging pipeline: 13c24775-f2e8-4b9c-94d0-c220b543f6cb
Error: connect ECONNREFUSED 172.25.0.8:3003
at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1494:16)
[Nest] 1 - 05/22/2023, 5:53:44 PM ERROR [SmartInfoService] Unable run object detection pipeline: 13c24775-f2e8-4b9c-94d0-c220b543f6cb
Error: connect ECONNREFUSED 172.25.0.8:3003
what could be the cause for this?
my face detection job almost completed 31k/ 33k, before i started seeing above errors
i am running on WD NAS, 4 core Intel CPU, 4GB RAM
57 Replies
checked machine learning logs, at the point where i see these errors in micro service, machine learning has OK
INFO: 172.25.0.6:60130 - "POST /object-detection/detect-object HTTP/1.1" 200 OK
INFO: 172.25.0.6:60118 - "POST /object-detection/detect-object HTTP/1.1" 200 OK
INFO: 172.25.0.6:60170 - "POST /image-classifier/tag-image HTTP/1.1" 200 OK
INFO: 172.25.0.6:60172 - "POST /object-detection/detect-object HTTP/1.1" 200 OK
INFO: 172.25.0.6:60192 - "POST /image-classifier/tag-image HTTP/1.1" 200 OK
INFO: 172.25.0.6:60214 - "POST /object-detection/detect-object HTTP/1.1" 200 OK
INFO: 172.25.0.6:60222 - "POST /image-classifier/tag-image HTTP/1.1" 200 OK
Can you attach to your microservice container and run
then
/usr/src/app # apk add curl
fetch https://dl-cdn.alpinelinux.org/alpine/v3.17/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/v3.17/community/x86_64/APKINDEX.tar.gz
(1/1) Installing curl (8.1.0-r2)
Executing busybox-1.35.0-r29.trigger
OK: 180 MiB in 127 packages
/usr/src/app # curl immich_machine_learning:3003
curl: (7) Failed to connect to immich_machine_learning port 3003 after 1 ms: Couldn't connect to server
Try restart the stack
i restarted each container individually multiple times
but as soon as i start any ML job it would start giving above errors after some time
on Saturday, my face detection stopped working after about 5k
then i restarted and it completed 31k and started giving errors
so i started this one more time and it again ended after few mins
see full logs attached
important logs, where socket resets or disconnects
{"log":"Error: socket hang up\n","stream":"stderr","time":"2023-05-22T18:36:10.246829512Z"} {"log":" at connResetException (node:internal/errors:717:14)\n","stream":"stderr","time":"2023-05-22T18:36:10.246854212Z"} {"log":" at Socket.socketOnEnd (node:_http_client:526:23)\n","stream":"stderr","time":"2023-05-22T18:36:10.246874737Z"} {"log":" at Socket.emit (node:events:525:35)\n","stream":"stderr","time":"2023-05-22T18:36:10.24689625Z"} {"log":" at endReadableNT (node:internal/streams/readable:1359:12)\n","stream":"stderr","time":"2023-05-22T18:36:10.246916062Z"} {"log":" at process.processTicksAndRejections (node:internal/process/task_queues:82:21)\n","stream":"stderr","time":"2023-05-22T18:36:10.24693635Z"} {"log":"\u001b[31m[Nest] 1 - \u001b[39m05/22/2023, 6:36:10 PM \u001b[31m ERROR\u001b[39m \u001b[38;5;3m[SmartInfoService] \u001b[39m\u001b[31mUnable run object d{"log":"Error: socket hang up\n","stream":"stderr","time":"2023-05-22T18:36:10.251021587Z"} {"log":" at connResetException (node:internal/errors:717:14)\n","stream":"stderr","time":"2023-05-22T18:36:10.2510351Z"} {"log":" at Socket.socketOnEnd (node:_http_client:526:23)\n","stream":"stderr","time":"2023-05-22T18:36:10.251047612Z"} {"log":" at Socket.emit (node:events:525:35)\n","stream":"stderr","time":"2023-05-22T18:36:10.251059812Z"} {"log":" at endReadableNT (node:internal/streams/readable:1359:12)\n","stream":"stderr","time":"2023-05-22T18:36:10.251071962Z"}
Hmm which installation method do you use?
docker compose
this is corresponding log from immich server
[Nest] 1 - 05/22/2023, 6:21:53 PM LOG [NestApplication] Nest application successfully started +13932ms
[Nest] 1 - 05/22/2023, 6:22:01 PM LOG [ImmichServer] Running Immich Server in PRODUCTION environment - version 1.56.2 - Listening on port: 3001
[Nest] 1 - 05/22/2023, 6:22:07 PM WARN [ImmichServer] Machine learning is enabled
[Nest] 1 - 05/22/2023, 6:22:07 PM WARN [ImmichServer] Search is enabled
[Nest] 1 - 05/22/2023, 6:23:51 PM ERROR [ExceptionsHandler] Connection terminated due to connection timeout
Error: Connection terminated due to connection timeout
at Connection.<anonymous> (/usr/src/app/node_modules/pg/lib/client.js:132:73)
at Object.onceWrapper (node:events:627:28)
at Connection.emit (node:events:513:28)
at Socket.<anonymous> (/usr/src/app/node_modules/pg/lib/connection.js:62:12)
at Socket.emit (node:events:513:28)
at TCP.<anonymous> (node:net:322:12)
Seems like it's not just ML having problems but also your database
It could be the whole system is being overloaded
What is the spec do you use for the server?
hmm i have restricted ML and micro service to 3 cores and 3.5gb
ML will need at least 4GB 🤔
server is 4 core 4gb ram
i was monitoring ML when running object detection it was not going over 2gb
same for facial recognition
that is strange
the first few tries i did not restrict ML and microservice containers at all
but since it was crashing i thought it might be due to overload
so now if you perform
docker compose down
and docker compose up
does it run into error state right away?so restricted to 3 cores and 3.5gb
no if you see the logs, i start around 6.35pm utc and it starts crashing around 6.37
Error: Connection terminated due to connection timeoutI have only seen this when cpu is at 100% for too long and an attempt to update the database fails
should i restrict cpu to 2 cores?
If CPU is maxed some things will be stalled for too long that they hit a timeout and are cancelled
You could definitely try that.
all containers? or only ML
Just ML should be fine
will try that
Object detection is the only job running and you get this error?
yes that is the only one running
it happened with facial recognition as well couple of days back, after completing about 95% of the images
So close lol
and yes, i only run one job at a time
Do you have a more powerful computer besides this one?
not currently
What I've found success with is running ML on my desktop for some of the big jobs
can i then carry the data over to my NAS?
i can try my work computer lol
Lol.
how would directory structure be maintained if i run ML on some other computer?
You need to mount the same volume to that computer
I run something like this:
Related - we've fixed some bugs with memory management for large jobs as well as lower default concurrency settings. Those should be in the next release, so that would probably help here as well.
ok, for now will try out different resource limitation on the containers and report and if it doesn't help, will try out new release
worst case will try out on different computer for ML
Yup, sounds like a good plan.
If you prevent the ML container from using all the CPU I think you'll avoid those timeout/connection problems and hopefully you can finish processing everything. You should be able to just run "missing" on these subsequent attempts as well.
any limitations on RAM that i should try?
You need a minimum amount of ram for stuff to work, but I don't think it needs to be limited anywhere for your situation
[Nest] 1 - 05/22/2023, 9:21:15 PM LOG [InstanceLoader] MicroservicesModule dependencies initialized +1ms
[Nest] 1 - 05/22/2023, 9:21:16 PM LOG [NestApplication] Nest application successfully started +208ms
[Nest] 1 - 05/22/2023, 9:21:16 PM LOG [ImmichMicroservice] Running Immich Microservices in PRODUCTION environment - version 1.56.2 - Listening on port: 3002
[Nest] 1 - 05/22/2023, 9:21:17 PM LOG [MetadataExtractionProcessor] Reverse Geocoding Initialized
[Nest] 1 - 05/22/2023, 9:25:09 PM ERROR [SmartInfoService] Unable run clip encoding pipeline: 0000aa2c-be9f-48b7-86f7-31ac7245ad05
Error: Connection terminated due to connection timeout
at Connection.<anonymous> (/usr/src/app/node_modules/pg/lib/client.js:132:73)
at Object.onceWrapper (node:events:627:28)
at Connection.emit (node:events:513:28)
at Socket.<anonymous> (/usr/src/app/node_modules/pg/lib/connection.js:62:12)
at Socket.emit (node:events:513:28)
at TCP.<anonymous> (node:net:322:12)
[Nest] 1 - 05/22/2023, 9:25:56 PM ERROR [SmartInfoService] Unable run clip encoding pipeline: 00026c2d-5a75-4690-bfd0-a259c8ba6530
Error: Connection terminated due to connection timeout
at Connection.<anonymous> (/usr/src/app/node_modules/pg/lib/client.js:132:73)
at Object.onceWrapper (node:events:627:28)
at Connection.emit (node:events:513:28)
at Socket.<anonymous> (/usr/src/app/node_modules/pg/lib/connection.js:62:12)
at Socket.emit (node:events:513:28)
at TCP.<anonymous> (node:net:322:12)




@jrasm91 @Alex
so trying out different limitations, it doesn't look like it's CPU or memory bound when i hit this error of disconnection
see the above log, it disconnects at 9.25, at that point none of these have high enough CPU or RAM usage
This is the timeout with the database, maybe the CPU max out causing the handshake with the database get dropped?
CPU isn't maxed out
it's 4 cores should go to 400%
i have posted the screenshot for db, ML and microservice containers CPU and RAM utilization
so, i was able to get CLIP completed after multiple restarts
it would fail within 2 to 3 min without CPU or memory being fully loaded, sometimes immediately without starting single image
i have been unable to get tag objects to finish even 5 % after 5 days
damn
this is rough
Hey - maybe not exactly the same but I'm experiencing the same for Object Tagging. Granted I'm tagging over 350K objects but my redis container blew up to over 4GB and my ML CPU were like at least 180% usage before my entire VM got stuck and had to do a hard reboot.
so it not just CPU but also Memory management too
I got 3CPUs and 8GB assigned to the VM for Immich and seems to be holding up - all in all the whole Immich stack is like using 6.8 GB.
Added this to my Microservice and ML docker compose:
cpu_count: 2
cpu_percent: 90
cpus: 0.90
cpu_shares: 90
cpuset: 0,1 - for the ms
cpuset: 1,2 - for the ml
And added this to my redis:
mem_limit: 4GB
good news for me - its been running for 2 days now without crashing
i could increase it but this one seems stable enough...
350K+ objects to tag ~ ETA at 8 days if I'm counting correctly
Do you have a faster PC?
unfortunately nope
Sad day hah
as long as it works. its a 1 time chore so i'll live
Honestly, having CLIP and facial recognition done are really nice for those associated features (search & people). Object detection and classification isn't quite as cool. The models we use aren't super great and the added functionality they add (a list of "things" on the explore page") isn't too useful IMO.
Maybe read through this discussion?
https://github.com/immich-app/immich/discussions/2524
I'd be real lame for you to finally finish 1-2 weeks later and then we've re-vamped those models or something lol
lol! ouch. hmm. I guess I can run CLIP and Facial first then
GPU would go a long way. But I need to get my containers out of Hyper-V to do that. Right now, have my reasons for Windows but I might have to buy a new mini-pc just for Immich + Proxmox
Sounds like a plan. It sounds like cuda support could come sooner than later.
tho.. i read there's a way for Intel GPUs to use Nvidia CUDA, not sure what hack I'll run into for that. Sorry but most mini-pc are using Intel iGPUs.
don't plan on getting an nvidia card soon
does that go in the docker compose file?
didn't help me, still seeing the same issue
will just skip on object tagging
Yep that goes into the docker compose file:
Example for the immich-microservices:
immich-microservices:
container_name: immich_microservices
image: ghcr.io/immich-app/immich-server:release
entrypoint: [ "/bin/sh", "./start-microservices.sh" ]
volumes:
- ${UPLOAD_LOCATION}:/usr/src/app/upload
env_file:
- stack.env
environment:
- NODE_ENV=production
cpu_count: 2
cpu_percent: 80
cpus: 0.80
cpu_shares: 80
cpuset: 1,2
restart: unless-stopped