Orange Pi 5 Plus Rockchip RK3588 Hardware ML Acceleration segfault help wanted
AI have an Orange Pi 5 Plus that I'm trying to setup for running Immich: I want to enjoy hardware accelerated machine learning. I've been having quite a bit of trouble getting the driver to work.
I am confident that my Immich setup is correct and that I have given the container proper access to the hardware. Why? Because when I run clinfo in the machine it can see the OpenCL hardware.
Furthermore https://immich.app/docs/features/ml-hardware-acceleration/ says
In the case of ARM NN, the absence of a Could not load ANN shared libraries log entry means it loaded successfully.And I do NOT get the load error. I've tried the https://github.com/JeffyCN/mirrors/raw/libmali/firmware/g610/mali_csffw.binq and equivalent from https://github.com/tsukumijima/libmali-rockchip/releases I got the firmware from https://github.com/JeffyCN/mirrors/raw/libmali/firmware/g610/mali_csffw.binq but it does not make any difference no-matter if I include that file or not. Between each test I've completely reset the system nuking all images and deleting storage. I get a segfault when the machine learning algo runs Relevant logs attached, and part with error highlighted below:
36 Replies
:wave: Hey @eKristensen,
Thanks for reaching out to us. Please follow the recommended actions below; this will help us be more effective in our support effort and leave more time for building Immich :immich:.
References
- Container Logs:
docker compose logs
docs
- Container Status: docker compose ps
docs
- Reverse Proxy: https://immich.app/docs/administration/reverse-proxy
Checklist
1. :ballot_box_with_check: I have verified I'm on the latest release(note that mobile app releases may take some time).
2. :ballot_box_with_check: I have read applicable release notes.
3. :ballot_box_with_check: I have reviewed the FAQs for known issues.
4. :ballot_box_with_check: I have reviewed Github for known issues.
5. :ballot_box_with_check: I have tried accessing Immich via local ip (without a custom reverse proxy).
6. :ballot_box_with_check: I have uploaded the relevant logs, docker compose, and .env files, making sure to use code formatting.
7. :ballot_box_with_check: I have tried an incognito window, disabled extensions, cleared mobile app cache, logged out and back in, different browsers, etc. as applicable
(an item can be marked as "complete" by reacting with the appropriate number)
If this ticket can be closed you can use the /close
command, and re-open it later if needed.
Successfully submitted, a tag has been added to inform contributors. :white_check_mark:Image versions:
Container run command:
run --name=immich-machine-learning --replace --rm --network=systemd-immich-net --device=/dev/mali0:/dev/mali0 -v /usr/lib/aarch64-linux-gnu/libmali.so:/usr/lib/libmali.so:ro -v model-cache:/cache --label io.containers.autoupdate=registry --env-file /home/immich/.env --group-add keep-groups ghcr.io/immich-app/immich-machine-learning:release-armnn
My .env
To get clinfo I do this:
Enter container and:
My clinfo out is attached
Related, but not quite Github Issues:
- https://github.com/immich-app/immich/issues/11019
- https://github.com/immich-app/immich/discussions/10183
I only pass in the relevant files. I am up to date and I reset the cache volume each time. I get no errors related to downloading the models, the chip even says it loads the models.
OS is Armbian Ubuntu Noble Minimal / CLI
I think it is a driver issue, but I am not sure what to do about it.Does the segfault still happen if you set concurrency to 1?
Also, the error is coming from skimage, not the armnn session itself. It might be that there’s something wrong with the output from armnn and trying to use it is what causes the segfault
I set
MACHINE_LEARNING_REQUEST_THREADS=1
. New log is attched. Stil segfault, but no info. How to get info about where the segfault comes from?The first time it segfaults in those logs, it loaded the CLIP model first without issue, then segfaulted when it loaded the detection model. In the second time, it loaded the detection model fine and segfaulted when it loaded the CLIP model. Can you try only running one task?
How to run just one task? Upload one picture and then ?
Also how would i know if it succeeded doing something?
Thx for your help. I properly won't see your messages for a while. I'll get back to it as soon as I have time again 🙂
You can disable one of the tasks temporarily so it doesn’t run when the asset is uploaded
Generally, if there are no errors, the jobs finished running after the upload, and you see a model was loaded in the ML logs, everything is good. There are debug logs for face detection and recognition in the server as well.
I think running just a single job worked. I do not see segfault in my log, see attached log. I went to the settings and disabled machine learning face detection. I left smart search on and uploaded a single picture. I could not disable duplicate detection.
Of course everything was fresh. I reset everything between all tests to remove variables.
If you upload multiple assets, does it still work?
With the same settings?
I got a
[08/01/24 22:52:06] ERROR Worker (pid:5) was sent SIGINT!
But I'm not sure what that is about
This is the chunk
I'll try to upload more pictures and see what happens. Without resettingThis is normal. The process kills itself after 5 minutes of idling to release RAM
Ok.
I uploaded 5 more and still no segfault or anything else that looks out of place
Just
Within a lot of check for inactivity
should I upload many pictures and see?
or?
So now if you enable facial recognition and run face detection on all assets through the job panel, does that work?
It looks like it works
At least I cannot spot anything wrong in the log and the GUI says it is done
Should i try the duplicate detection then or?
maybe upload more pictures with both of them on as they are now?
This would be the next thing to test
Uploaded one more image and it worked fine. I uploaded 5 and got sigsegv. I have removed check for inactivity lines. Here is the log output.
The first is the one picture that worked fine, and then the error later
But the error does not repeat like it did before when i started this ticket
I uploaded many pictures though so maybe if I upload many pictures it will go wrong again
This time, can you disable one of the tasks again, but set the concurrency higher? I think there might be a race condition in the ANN session, but not sure if it only applies to separate models or if it also happens for a single model.
Oh, and upload multiple assets at once for this
Should I reset or simply restart with another concurrency setting? What would be preferred?
or can it be changed in the web gui?
You need to comment out the request threads env if it’s still there. If you haven’t changed the job concurrency in the admin settings, you don’t need to touch it
After that, the ML service just needs to be restarted
MACHINE_LEARNING_REQUEST_THREADS
<--- this one is the "request threads" ?
I'll assume it is and do it.Ok. I uploaded 8 pictures. Only smart search active. Again i filtered the check inactivity out. Log output attached.
Oh, that’s interesting. So it seems like it can’t handle concurrency at all
How about if you upload just one image, then upload multiple at once maybe a minute later?
Right.
Looks like that works. Got no errors
Uploded one, then waited a minute and uploded 4 more
With just smart search active and without the
MACHINE_LEARNING_REQUEST_THREADS
env variableThat narrows it down quite a bit. @zody Do you have an idea for what might be causing this? The segfault happens if multiple requests come in at once when the session loads, but if there’s only one initial request, it can handle concurrency after that.
I am thankful for the help that I am getting here.
When does Fynn usually check out Discord? I do not want to sound impatient, but I would also like to do something rather than just waiting if possible. I have a friend with the same hardware. They also have problems, but their environment is too different to tell if it is the same issue we experience.
Would it be better if I opened a ticket on Github?
If I start with the CPU-version of machine learning can I then switch to hardware ml acceleration later? Is it the same models used ?
If I start with the CPU-version of machine learning can I then switch to hardware ml acceleration later? Is it the same models used ?Yes Three things you could try to remedy the issue for now: 1. Set
MACHINE_LEARNING_PRELOAD__CLIP=ViT-B-32__openai
and MACHINE_LEARNING_PRELOAD__FACIAL_RECOGNITION=buffalo_l
.
- This will make it load these models at startup instead of when it first receives a request. Since the issue seems to be time-sensitive, distancing the model loading from the request might help.
2. Set MACHINE_LEARNING_ANN_TUNING_LEVEL
to 1 or 3 (1: rapid, 2: normal, 3: exhaustive, default is 2).
- 1 might help with timing
- 3 might help make it choose a different kernel that doesn't have this issue (if this issue is caused by a particular kernel)
3. Set MACHINE_LEARNING_ANN_FP16_TURBO=true
- Just something to try, no clue if it will help
Also, ideally only try one approach at a time unless none of them workOkay, thanks for giving me something to try.
I changed the way I tested slightly: I started from fresh again, but I did not reset between tests. I uploaded som pictures and left the smart and face ml on. When I changed test i asked it to reindex instead of uploading more images. This is enough to trigger the ml process and the errors.
1) Preloading makes no difference for buffalo, but the vit model completely fails to load and then it failover to CPU.
2) The tuning levels makes no difference that I can observe
3) I get a new error when i turn on fp16 turbo:
It seems that it can process slowly, because it did manage to detect some faces.
Something a bit interesting though: I did a sanity check. Without any special option (except for the debug option) i tried to run face detect and recognition (buffalo) on the pictures already uploaded. I told it to run again on all images and checked that the existing data was gone (no people in the people view). I ran without any issues at all and now the there are people in the people view.
Then I did the smart search (vit) model on all pictures. And now it makes segfault. (SIGSEGV). Eventually it did fail back to cpu and then it did work somewhat.
So it seems that the first model it loads can work fine. In previous attempts I did the smart search first, and then the face search. It seemed to work fine with smart search and then fail for faces. where it did segfault with the face model when the vit model had run before.
Idk wheter this info can be used for anything. But something is working that much is certain
Log when it fails to load vit model, looks almost the same for preload and when it fails with "dynamic" loading:
What I mean is that it could reprocess the around 10 pictures that I had uploaded to the test server without any issues, and I could clearly observe that the process worked (People not present before run, people present after run).
A mini-update from me: I've started to use Immich without hardware acceleration. I intend to try hardware acceleration again sometime later, but I also realize that I'll have to checkout what happens myself and dive into the source code to do so. It does not sound like there is any easy fix at the moment.
Luckily CPU processing isn't too slow so not too bad. In fact is is fast compared to how long it took to make thumbnails.
If anyone has any news or anything that I could try please write 🙂
Thanks for the help even if the issue did not get resolved this time around.
There’s been some discussion in this ticket around using a different libmali.so driver with success https://discord.com/channels/979116623879368755/1245331532533465118
Ok, interesting. I did not know there was a related thread.
What is the difference between the dummy and gbm versions ? Do immich need the "gbm"?
GBM is a Mesa thing that we don't need
It doesn't hurt to use it, but you shouldn't need it
I am getting a bunch of kernel errors. not sure if I got them before because I didn't check. The errors in the immich-machine-learning container is different now that I use v1.9-1-2d267b0. Right now there are no visable errors in the machine learning container, but I get a bunch of kernel errors with this power transition
Okay I got the really long error again
It is a really long python trace.
I think I even found the maybe just a difference between g6p0 and g13p0. One is OpenCL 2.1 the other 3.0 source: https://www.roselladb.com/install-opencl-orangepi5-debian-ubuntu.htmvia https://github.com/Joshua-Riek/ubuntu-rockchip/issues/879 so to demystifying the name libmali <-- Driver name valhall-g610 <-- Hardware identifier. Orange Pi 5 Plus has ARM Mali-G610 MP4 and the codename is Valhall g6p0 / g13p0 <-- OpenCL version dummy / gbm / etc <-- Driver variant/addons If I am getting it right which versin of opencl does immich expect ? Does the orange pi need to be restarted to use a different driver ? I rebooted. I can see more details about the segfault I am getting in the kernel log So this one is the g13p0 variant (without a reboot) This one is the g6p0 installed before a reboot
I found this, it maybe be relevant? https://discuss.tvm.apache.org/t/most-tasks-failed-with-autoscheduler-on-mali-g610-gpu/16139/7
Apache TVM Discuss
Most tasks failed with AutoScheduler on Mali G610 GPU
Update: After some tracing effort, the second issue can be resolved: TVM defaults to run at least 1000ms for every task measurment for non-CPU target, but for some fast tasks the task is repeated too many times, for OpenCL target it means too many kernel launch command is enqueued by clEnqueueNDRangeKernel, causes the out of memory error. Addi...
I am also seeing this:
It works now, no kernel errors, BUT I also see high CPU consumption when I ask the machine learning container to perform ml tasks
It does not say it failed over to CPU
I am using libmali-valhall-g610-g13p0-dummy_v1.9-1-2d267b0_arm64.deb (i replaced the version in the filename to keep track of the variants) WITHOUT passing the firmware to the container
passing firmware made no difference
so
libmali-valhall-g610-g13p0-dummy_v1.9-1-2d267b0_arm64.deb <-- I get no errors, but I think it might be secretly be using the CPU anyways (because I see 50-150% CPU consumption when I use the container stats command)
libmali-valhall-g610-g13p0-gbm_v1.9-1-2d267b0_arm64.deb <-- Kernel error firmware reloading long python error msg
libmali-valhall-g610-g6p0-dummy_v1.9-1-2d267b0_arm64.deb and libmali-valhall-g610-g6p0-gbm_v1.9-1-2d267b0_arm64.deb <-- segfault
I'll wrap up for today. (I'll try to do some more rebooting of my device in between the different drivers tomorrow, maybe that helps)
I would like to know what version of OpenCL I should target. Thx 🙂