I
Immich•2y ago
RonnieP

Machine learning jobs exiting/crashing

i keep getting below errors within couple of mins of starting my object detection or clip embeddings [Nest] 1 - 05/22/2023, 5:53:01 PM ERROR [SmartInfoService] Unable run object detection pipeline: 13c097e3-201c-4808-b841-e45dc2d5cea1 Error: connect ECONNREFUSED 172.25.0.8:3003 at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1494:16) [Nest] 1 - 05/22/2023, 5:53:23 PM ERROR [SmartInfoService] Unable to run image tagging pipeline: 13c24775-f2e8-4b9c-94d0-c220b543f6cb Error: connect ECONNREFUSED 172.25.0.8:3003 at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1494:16) [Nest] 1 - 05/22/2023, 5:53:44 PM ERROR [SmartInfoService] Unable run object detection pipeline: 13c24775-f2e8-4b9c-94d0-c220b543f6cb Error: connect ECONNREFUSED 172.25.0.8:3003 what could be the cause for this? my face detection job almost completed 31k/ 33k, before i started seeing above errors i am running on WD NAS, 4 core Intel CPU, 4GB RAM
57 Replies
RonnieP
RonniePOP•2y ago
checked machine learning logs, at the point where i see these errors in micro service, machine learning has OK INFO: 172.25.0.6:60130 - "POST /object-detection/detect-object HTTP/1.1" 200 OK INFO: 172.25.0.6:60118 - "POST /object-detection/detect-object HTTP/1.1" 200 OK INFO: 172.25.0.6:60170 - "POST /image-classifier/tag-image HTTP/1.1" 200 OK INFO: 172.25.0.6:60172 - "POST /object-detection/detect-object HTTP/1.1" 200 OK INFO: 172.25.0.6:60192 - "POST /image-classifier/tag-image HTTP/1.1" 200 OK INFO: 172.25.0.6:60214 - "POST /object-detection/detect-object HTTP/1.1" 200 OK INFO: 172.25.0.6:60222 - "POST /image-classifier/tag-image HTTP/1.1" 200 OK
Alex Tran
Alex Tran•2y ago
Can you attach to your microservice container and run
apk add curl
apk add curl
then
curl immich_machine_learning:3003
curl immich_machine_learning:3003
RonnieP
RonniePOP•2y ago
/usr/src/app # apk add curl fetch https://dl-cdn.alpinelinux.org/alpine/v3.17/main/x86_64/APKINDEX.tar.gz fetch https://dl-cdn.alpinelinux.org/alpine/v3.17/community/x86_64/APKINDEX.tar.gz (1/1) Installing curl (8.1.0-r2) Executing busybox-1.35.0-r29.trigger OK: 180 MiB in 127 packages /usr/src/app # curl immich_machine_learning:3003 curl: (7) Failed to connect to immich_machine_learning port 3003 after 1 ms: Couldn't connect to server
Alex Tran
Alex Tran•2y ago
Try restart the stack
RonnieP
RonniePOP•2y ago
i restarted each container individually multiple times but as soon as i start any ML job it would start giving above errors after some time on Saturday, my face detection stopped working after about 5k then i restarted and it completed 31k and started giving errors so i started this one more time and it again ended after few mins see full logs attached
RonnieP
RonniePOP•2y ago
important logs, where socket resets or disconnects {"log":"Error: socket hang up\n","stream":"stderr","time":"2023-05-22T18:36:10.246829512Z"} {"log":" at connResetException (node:internal/errors:717:14)\n","stream":"stderr","time":"2023-05-22T18:36:10.246854212Z"} {"log":" at Socket.socketOnEnd (node:_http_client:526:23)\n","stream":"stderr","time":"2023-05-22T18:36:10.246874737Z"} {"log":" at Socket.emit (node:events:525:35)\n","stream":"stderr","time":"2023-05-22T18:36:10.24689625Z"} {"log":" at endReadableNT (node:internal/streams/readable:1359:12)\n","stream":"stderr","time":"2023-05-22T18:36:10.246916062Z"} {"log":" at process.processTicksAndRejections (node:internal/process/task_queues:82:21)\n","stream":"stderr","time":"2023-05-22T18:36:10.24693635Z"} {"log":"\u001b[31m[Nest] 1 - \u001b[39m05/22/2023, 6:36:10 PM \u001b[31m ERROR\u001b[39m \u001b[38;5;3m[SmartInfoService] \u001b[39m\u001b[31mUnable run object d{"log":"Error: socket hang up\n","stream":"stderr","time":"2023-05-22T18:36:10.251021587Z"} {"log":" at connResetException (node:internal/errors:717:14)\n","stream":"stderr","time":"2023-05-22T18:36:10.2510351Z"} {"log":" at Socket.socketOnEnd (node:_http_client:526:23)\n","stream":"stderr","time":"2023-05-22T18:36:10.251047612Z"} {"log":" at Socket.emit (node:events:525:35)\n","stream":"stderr","time":"2023-05-22T18:36:10.251059812Z"} {"log":" at endReadableNT (node:internal/streams/readable:1359:12)\n","stream":"stderr","time":"2023-05-22T18:36:10.251071962Z"}
Alex Tran
Alex Tran•2y ago
Hmm which installation method do you use?
RonnieP
RonniePOP•2y ago
docker compose this is corresponding log from immich server [Nest] 1 - 05/22/2023, 6:21:53 PM LOG [NestApplication] Nest application successfully started +13932ms [Nest] 1 - 05/22/2023, 6:22:01 PM LOG [ImmichServer] Running Immich Server in PRODUCTION environment - version 1.56.2 - Listening on port: 3001 [Nest] 1 - 05/22/2023, 6:22:07 PM WARN [ImmichServer] Machine learning is enabled [Nest] 1 - 05/22/2023, 6:22:07 PM WARN [ImmichServer] Search is enabled [Nest] 1 - 05/22/2023, 6:23:51 PM ERROR [ExceptionsHandler] Connection terminated due to connection timeout Error: Connection terminated due to connection timeout at Connection.<anonymous> (/usr/src/app/node_modules/pg/lib/client.js:132:73) at Object.onceWrapper (node:events:627:28) at Connection.emit (node:events:513:28) at Socket.<anonymous> (/usr/src/app/node_modules/pg/lib/connection.js:62:12) at Socket.emit (node:events:513:28) at TCP.<anonymous> (node:net:322:12)
bo0tzz
bo0tzz•2y ago
Seems like it's not just ML having problems but also your database It could be the whole system is being overloaded
Alex Tran
Alex Tran•2y ago
What is the spec do you use for the server?
RonnieP
RonniePOP•2y ago
hmm i have restricted ML and micro service to 3 cores and 3.5gb
Alex Tran
Alex Tran•2y ago
ML will need at least 4GB 🤔
RonnieP
RonniePOP•2y ago
server is 4 core 4gb ram i was monitoring ML when running object detection it was not going over 2gb same for facial recognition
Alex Tran
Alex Tran•2y ago
that is strange
RonnieP
RonniePOP•2y ago
the first few tries i did not restrict ML and microservice containers at all but since it was crashing i thought it might be due to overload
Alex Tran
Alex Tran•2y ago
so now if you perform docker compose down and docker compose up does it run into error state right away?
RonnieP
RonniePOP•2y ago
so restricted to 3 cores and 3.5gb no if you see the logs, i start around 6.35pm utc and it starts crashing around 6.37
jrasm91
jrasm91•2y ago
Error: Connection terminated due to connection timeout
I have only seen this when cpu is at 100% for too long and an attempt to update the database fails
RonnieP
RonniePOP•2y ago
should i restrict cpu to 2 cores?
jrasm91
jrasm91•2y ago
If CPU is maxed some things will be stalled for too long that they hit a timeout and are cancelled You could definitely try that.
RonnieP
RonniePOP•2y ago
all containers? or only ML
jrasm91
jrasm91•2y ago
Just ML should be fine
RonnieP
RonniePOP•2y ago
will try that
jrasm91
jrasm91•2y ago
Object detection is the only job running and you get this error?
RonnieP
RonniePOP•2y ago
yes that is the only one running it happened with facial recognition as well couple of days back, after completing about 95% of the images
jrasm91
jrasm91•2y ago
So close lol
RonnieP
RonniePOP•2y ago
and yes, i only run one job at a time
jrasm91
jrasm91•2y ago
Do you have a more powerful computer besides this one?
RonnieP
RonniePOP•2y ago
not currently
jrasm91
jrasm91•2y ago
What I've found success with is running ML on my desktop for some of the big jobs
RonnieP
RonniePOP•2y ago
can i then carry the data over to my NAS? i can try my work computer lol
jrasm91
jrasm91•2y ago
Lol.
RonnieP
RonniePOP•2y ago
how would directory structure be maintained if i run ML on some other computer?
jrasm91
jrasm91•2y ago
You need to mount the same volume to that computer I run something like this:
docker run -it --rm -p 3003:3003 --volume="/media/jrasm91/ImmichLibrary/Photos/Library:/usr/src/app/upload" --volume="model-cache:/cache" altran1502/immich-machine-learning:release
docker run -it --rm -p 3003:3003 --volume="/media/jrasm91/ImmichLibrary/Photos/Library:/usr/src/app/upload" --volume="model-cache:/cache" altran1502/immich-machine-learning:release
Related - we've fixed some bugs with memory management for large jobs as well as lower default concurrency settings. Those should be in the next release, so that would probably help here as well.
RonnieP
RonniePOP•2y ago
ok, for now will try out different resource limitation on the containers and report and if it doesn't help, will try out new release worst case will try out on different computer for ML
jrasm91
jrasm91•2y ago
Yup, sounds like a good plan. If you prevent the ML container from using all the CPU I think you'll avoid those timeout/connection problems and hopefully you can finish processing everything. You should be able to just run "missing" on these subsequent attempts as well.
RonnieP
RonniePOP•2y ago
any limitations on RAM that i should try?
jrasm91
jrasm91•2y ago
You need a minimum amount of ram for stuff to work, but I don't think it needs to be limited anywhere for your situation
RonnieP
RonniePOP•2y ago
[Nest] 1 - 05/22/2023, 9:21:15 PM LOG [InstanceLoader] MicroservicesModule dependencies initialized +1ms [Nest] 1 - 05/22/2023, 9:21:16 PM LOG [NestApplication] Nest application successfully started +208ms [Nest] 1 - 05/22/2023, 9:21:16 PM LOG [ImmichMicroservice] Running Immich Microservices in PRODUCTION environment - version 1.56.2 - Listening on port: 3002 [Nest] 1 - 05/22/2023, 9:21:17 PM LOG [MetadataExtractionProcessor] Reverse Geocoding Initialized [Nest] 1 - 05/22/2023, 9:25:09 PM ERROR [SmartInfoService] Unable run clip encoding pipeline: 0000aa2c-be9f-48b7-86f7-31ac7245ad05 Error: Connection terminated due to connection timeout at Connection.<anonymous> (/usr/src/app/node_modules/pg/lib/client.js:132:73) at Object.onceWrapper (node:events:627:28) at Connection.emit (node:events:513:28) at Socket.<anonymous> (/usr/src/app/node_modules/pg/lib/connection.js:62:12) at Socket.emit (node:events:513:28) at TCP.<anonymous> (node:net:322:12) [Nest] 1 - 05/22/2023, 9:25:56 PM ERROR [SmartInfoService] Unable run clip encoding pipeline: 00026c2d-5a75-4690-bfd0-a259c8ba6530 Error: Connection terminated due to connection timeout at Connection.<anonymous> (/usr/src/app/node_modules/pg/lib/client.js:132:73) at Object.onceWrapper (node:events:627:28) at Connection.emit (node:events:513:28) at Socket.<anonymous> (/usr/src/app/node_modules/pg/lib/connection.js:62:12) at Socket.emit (node:events:513:28) at TCP.<anonymous> (node:net:322:12)
RonnieP
RonniePOP•2y ago
No description
No description
No description
No description
RonnieP
RonniePOP•2y ago
@jrasm91 @Alex so trying out different limitations, it doesn't look like it's CPU or memory bound when i hit this error of disconnection see the above log, it disconnects at 9.25, at that point none of these have high enough CPU or RAM usage
Alex Tran
Alex Tran•2y ago
This is the timeout with the database, maybe the CPU max out causing the handshake with the database get dropped?
RonnieP
RonniePOP•2y ago
CPU isn't maxed out it's 4 cores should go to 400% i have posted the screenshot for db, ML and microservice containers CPU and RAM utilization so, i was able to get CLIP completed after multiple restarts it would fail within 2 to 3 min without CPU or memory being fully loaded, sometimes immediately without starting single image i have been unable to get tag objects to finish even 5 % after 5 days
Alex Tran
Alex Tran•2y ago
damn this is rough
ahbeng
ahbeng•2y ago
Hey - maybe not exactly the same but I'm experiencing the same for Object Tagging. Granted I'm tagging over 350K objects but my redis container blew up to over 4GB and my ML CPU were like at least 180% usage before my entire VM got stuck and had to do a hard reboot. so it not just CPU but also Memory management too I got 3CPUs and 8GB assigned to the VM for Immich and seems to be holding up - all in all the whole Immich stack is like using 6.8 GB. Added this to my Microservice and ML docker compose: cpu_count: 2 cpu_percent: 90 cpus: 0.90 cpu_shares: 90 cpuset: 0,1 - for the ms cpuset: 1,2 - for the ml And added this to my redis: mem_limit: 4GB good news for me - its been running for 2 days now without crashing i could increase it but this one seems stable enough... 350K+ objects to tag ~ ETA at 8 days if I'm counting correctly
jrasm91
jrasm91•2y ago
Do you have a faster PC?
ahbeng
ahbeng•2y ago
unfortunately nope
jrasm91
jrasm91•2y ago
Sad day hah
ahbeng
ahbeng•2y ago
as long as it works. its a 1 time chore so i'll live
jrasm91
jrasm91•2y ago
Honestly, having CLIP and facial recognition done are really nice for those associated features (search & people). Object detection and classification isn't quite as cool. The models we use aren't super great and the added functionality they add (a list of "things" on the explore page") isn't too useful IMO. Maybe read through this discussion? https://github.com/immich-app/immich/discussions/2524 I'd be real lame for you to finally finish 1-2 weeks later and then we've re-vamped those models or something lol
ahbeng
ahbeng•2y ago
lol! ouch. hmm. I guess I can run CLIP and Facial first then GPU would go a long way. But I need to get my containers out of Hyper-V to do that. Right now, have my reasons for Windows but I might have to buy a new mini-pc just for Immich + Proxmox
jrasm91
jrasm91•2y ago
Sounds like a plan. It sounds like cuda support could come sooner than later.
ahbeng
ahbeng•2y ago
tho.. i read there's a way for Intel GPUs to use Nvidia CUDA, not sure what hack I'll run into for that. Sorry but most mini-pc are using Intel iGPUs. don't plan on getting an nvidia card soon
RonnieP
RonniePOP•2y ago
does that go in the docker compose file? didn't help me, still seeing the same issue will just skip on object tagging
ahbeng
ahbeng•2y ago
Yep that goes into the docker compose file: Example for the immich-microservices: immich-microservices: container_name: immich_microservices image: ghcr.io/immich-app/immich-server:release entrypoint: [ "/bin/sh", "./start-microservices.sh" ] volumes: - ${UPLOAD_LOCATION}:/usr/src/app/upload env_file: - stack.env environment: - NODE_ENV=production cpu_count: 2 cpu_percent: 80 cpus: 0.80 cpu_shares: 80 cpuset: 1,2 restart: unless-stopped

Did you find this page helpful?