Orange Pi 5 Plus Rockchip RK3588 Hardware ML Acceleration segfault help wanted

AI have an Orange Pi 5 Plus that I'm trying to setup for running Immich: I want to enjoy hardware accelerated machine learning. I've been having quite a bit of trouble getting the driver to work. I am confident that my Immich setup is correct and that I have given the container proper access to the hardware. Why? Because when I run clinfo in the machine it can see the OpenCL hardware. Furthermore https://immich.app/docs/features/ml-hardware-acceleration/ says
In the case of ARM NN, the absence of a Could not load ANN shared libraries log entry means it loaded successfully.
And I do NOT get the load error. I've tried the https://github.com/JeffyCN/mirrors/raw/libmali/firmware/g610/mali_csffw.binq and equivalent from https://github.com/tsukumijima/libmali-rockchip/releases I got the firmware from https://github.com/JeffyCN/mirrors/raw/libmali/firmware/g610/mali_csffw.binq but it does not make any difference no-matter if I include that file or not. Between each test I've completely reset the system nuking all images and deleting storage. I get a segfault when the machine learning algo runs Relevant logs attached, and part with error highlighted below:
/opt/venv/lib/python3.11/site-packages/skimage/transform/_geometric.py:160: RuntimeWarning: divide by zero encountered in divide
scale = 1.0 / src_demean.var(axis=0).sum() * (S @ d)
[07/31/24 22:06:51] DEBUG Checking for inactivity...
/opt/venv/lib/python3.11/site-packages/skimage/transform/_geometric.py:165: RuntimeWarning: invalid value encountered in multiply
T[:dim, :dim] *= scale
[07/31/24 22:07:01] ERROR Worker (pid:5) was sent SIGSEGV!
/opt/venv/lib/python3.11/site-packages/skimage/transform/_geometric.py:160: RuntimeWarning: divide by zero encountered in divide
scale = 1.0 / src_demean.var(axis=0).sum() * (S @ d)
[07/31/24 22:06:51] DEBUG Checking for inactivity...
/opt/venv/lib/python3.11/site-packages/skimage/transform/_geometric.py:165: RuntimeWarning: invalid value encountered in multiply
T[:dim, :dim] *= scale
[07/31/24 22:07:01] ERROR Worker (pid:5) was sent SIGSEGV!
36 Replies
Immich
Immich10mo ago
:wave: Hey @eKristensen, Thanks for reaching out to us. Please follow the recommended actions below; this will help us be more effective in our support effort and leave more time for building Immich :immich:. References - Container Logs: docker compose logs docs - Container Status: docker compose ps docs - Reverse Proxy: https://immich.app/docs/administration/reverse-proxy Checklist 1. :ballot_box_with_check: I have verified I'm on the latest release(note that mobile app releases may take some time). 2. :ballot_box_with_check: I have read applicable release notes. 3. :ballot_box_with_check: I have reviewed the FAQs for known issues. 4. :ballot_box_with_check: I have reviewed Github for known issues. 5. :ballot_box_with_check: I have tried accessing Immich via local ip (without a custom reverse proxy). 6. :ballot_box_with_check: I have uploaded the relevant logs, docker compose, and .env files, making sure to use code formatting. 7. :ballot_box_with_check: I have tried an incognito window, disabled extensions, cleared mobile app cache, logged out and back in, different browsers, etc. as applicable (an item can be marked as "complete" by reacting with the appropriate number) If this ticket can be closed you can use the /close command, and re-open it later if needed. Successfully submitted, a tag has been added to inform contributors. :white_check_mark:
eKristensen
eKristensenOP10mo ago
Image versions:
REPOSITORY TAG IMAGE ID CREATED SIZE
ghcr.io/immich-app/immich-machine-learning release-armnn daa9fc743e8a 25 hours ago 1.21 GB
ghcr.io/immich-app/immich-server release b3d108041948 25 hours ago 1.47 GB
docker.io/library/redis 6.2-alpine a8fd49b68365 2 months ago 31.9 MB
docker.io/tensorchord/pgvecto-rs pg14-v0.2.0 e5e5f64f3e66 6 months ago 943 MB
REPOSITORY TAG IMAGE ID CREATED SIZE
ghcr.io/immich-app/immich-machine-learning release-armnn daa9fc743e8a 25 hours ago 1.21 GB
ghcr.io/immich-app/immich-server release b3d108041948 25 hours ago 1.47 GB
docker.io/library/redis 6.2-alpine a8fd49b68365 2 months ago 31.9 MB
docker.io/tensorchord/pgvecto-rs pg14-v0.2.0 e5e5f64f3e66 6 months ago 943 MB
Container run command: run --name=immich-machine-learning --replace --rm --network=systemd-immich-net --device=/dev/mali0:/dev/mali0 -v /usr/lib/aarch64-linux-gnu/libmali.so:/usr/lib/libmali.so:ro -v model-cache:/cache --label io.containers.autoupdate=registry --env-file /home/immich/.env --group-add keep-groups ghcr.io/immich-app/immich-machine-learning:release-armnn My .env
# You can find documentation for all the supported env variables at https://immich.app/docs/install/environment-variables

# The location where your uploaded files are stored
UPLOAD_LOCATION=./library
# The location where your database files are stored
DB_DATA_LOCATION=./postgres

# To set a timezone, uncomment the next line and change Etc/UTC to a TZ identifier from this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones#List
TZ=Europe/Copenhagen

# The Immich version to use. You can pin this to a specific version like "v1.71.0"
IMMICH_VERSION=release

# Connection secret for postgres. You should change it to a random password
DB_PASSWORD=<PWD HIDDEN>

# The values below this line do not need to be changed
###################################################################################
DB_USERNAME=postgres
DB_DATABASE_NAME=immich

IMMICH_LOG_LEVEL=debug
# You can find documentation for all the supported env variables at https://immich.app/docs/install/environment-variables

# The location where your uploaded files are stored
UPLOAD_LOCATION=./library
# The location where your database files are stored
DB_DATA_LOCATION=./postgres

# To set a timezone, uncomment the next line and change Etc/UTC to a TZ identifier from this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones#List
TZ=Europe/Copenhagen

# The Immich version to use. You can pin this to a specific version like "v1.71.0"
IMMICH_VERSION=release

# Connection secret for postgres. You should change it to a random password
DB_PASSWORD=<PWD HIDDEN>

# The values below this line do not need to be changed
###################################################################################
DB_USERNAME=postgres
DB_DATABASE_NAME=immich

IMMICH_LOG_LEVEL=debug
To get clinfo I do this: Enter container and:
# apt update && apt install clinfo
# clinfo
# apt update && apt install clinfo
# clinfo
My clinfo out is attached Related, but not quite Github Issues: - https://github.com/immich-app/immich/issues/11019 - https://github.com/immich-app/immich/discussions/10183 I only pass in the relevant files. I am up to date and I reset the cache volume each time. I get no errors related to downloading the models, the chip even says it loads the models. OS is Armbian Ubuntu Noble Minimal / CLI I think it is a driver issue, but I am not sure what to do about it.
sogan
sogan10mo ago
Does the segfault still happen if you set concurrency to 1? Also, the error is coming from skimage, not the armnn session itself. It might be that there’s something wrong with the output from armnn and trying to use it is what causes the segfault
eKristensen
eKristensenOP10mo ago
I set MACHINE_LEARNING_REQUEST_THREADS=1. New log is attched. Stil segfault, but no info. How to get info about where the segfault comes from?
sogan
sogan10mo ago
The first time it segfaults in those logs, it loaded the CLIP model first without issue, then segfaulted when it loaded the detection model. In the second time, it loaded the detection model fine and segfaulted when it loaded the CLIP model. Can you try only running one task?
eKristensen
eKristensenOP10mo ago
How to run just one task? Upload one picture and then ? Also how would i know if it succeeded doing something? Thx for your help. I properly won't see your messages for a while. I'll get back to it as soon as I have time again 🙂
sogan
sogan10mo ago
You can disable one of the tasks temporarily so it doesn’t run when the asset is uploaded Generally, if there are no errors, the jobs finished running after the upload, and you see a model was loaded in the ML logs, everything is good. There are debug logs for face detection and recognition in the server as well.
eKristensen
eKristensenOP10mo ago
I think running just a single job worked. I do not see segfault in my log, see attached log. I went to the settings and disabled machine learning face detection. I left smart search on and uploaded a single picture. I could not disable duplicate detection. Of course everything was fresh. I reset everything between all tests to remove variables.
sogan
sogan10mo ago
If you upload multiple assets, does it still work?
eKristensen
eKristensenOP10mo ago
With the same settings? I got a [08/01/24 22:52:06] ERROR Worker (pid:5) was sent SIGINT! But I'm not sure what that is about This is the chunk
[08/01/24 22:51:51] DEBUG Checking for inactivity...
[08/01/24 22:52:00] INFO Unloaded ANN model 0
[08/01/24 22:52:03] DEBUG Checking for inactivity...
[08/01/24 22:52:03] INFO Shutting down due to inactivity.
[08/01/24 22:52:03] INFO Shutting down
[08/01/24 22:52:03] INFO Waiting for application shutdown.
[08/01/24 22:52:04] INFO Application shutdown complete.
[08/01/24 22:52:04] INFO Finished server process [5]
[08/01/24 22:52:06] ERROR Worker (pid:5) was sent SIGINT!
[08/01/24 22:52:06] INFO Booting worker with pid: 37
[08/01/24 22:52:13] INFO Started server process [37]
[08/01/24 22:52:13] INFO Waiting for application startup.
[08/01/24 22:52:13] INFO Created in-memory cache with unloading after 300s
of inactivity.
[08/01/24 22:52:13] INFO Initialized request thread pool with 1 threads.
[08/01/24 22:52:13] DEBUG Checking for inactivity...
[08/01/24 22:52:13] INFO Application startup complete.
[08/01/24 22:52:23] DEBUG Checking for inactivity...
[08/01/24 22:51:51] DEBUG Checking for inactivity...
[08/01/24 22:52:00] INFO Unloaded ANN model 0
[08/01/24 22:52:03] DEBUG Checking for inactivity...
[08/01/24 22:52:03] INFO Shutting down due to inactivity.
[08/01/24 22:52:03] INFO Shutting down
[08/01/24 22:52:03] INFO Waiting for application shutdown.
[08/01/24 22:52:04] INFO Application shutdown complete.
[08/01/24 22:52:04] INFO Finished server process [5]
[08/01/24 22:52:06] ERROR Worker (pid:5) was sent SIGINT!
[08/01/24 22:52:06] INFO Booting worker with pid: 37
[08/01/24 22:52:13] INFO Started server process [37]
[08/01/24 22:52:13] INFO Waiting for application startup.
[08/01/24 22:52:13] INFO Created in-memory cache with unloading after 300s
of inactivity.
[08/01/24 22:52:13] INFO Initialized request thread pool with 1 threads.
[08/01/24 22:52:13] DEBUG Checking for inactivity...
[08/01/24 22:52:13] INFO Application startup complete.
[08/01/24 22:52:23] DEBUG Checking for inactivity...
I'll try to upload more pictures and see what happens. Without resetting
sogan
sogan10mo ago
This is normal. The process kills itself after 5 minutes of idling to release RAM
eKristensen
eKristensenOP10mo ago
Ok. I uploaded 5 more and still no segfault or anything else that looks out of place Just
[08/01/24 23:11:35] DEBUG Setting model format to armnn
[08/01/24 23:11:35] INFO Loading visual model 'ViT-B-32__openai' to memory
[08/01/24 23:11:35] DEBUG Loading visual preprocessing config for CLIP model
'ViT-B-32__openai'
[08/01/24 23:11:35] DEBUG Loaded visual preprocessing config for CLIP model
'ViT-B-32__openai'
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '5'.
[08/01/24 23:11:35] INFO Loading ANN model
/cache/clip/ViT-B-32__openai/visual/model.armnn ...
[08/01/24 23:11:37] INFO Loaded ANN model with ID 0
[08/01/24 23:11:35] DEBUG Setting model format to armnn
[08/01/24 23:11:35] INFO Loading visual model 'ViT-B-32__openai' to memory
[08/01/24 23:11:35] DEBUG Loading visual preprocessing config for CLIP model
'ViT-B-32__openai'
[08/01/24 23:11:35] DEBUG Loaded visual preprocessing config for CLIP model
'ViT-B-32__openai'
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '5'.
[08/01/24 23:11:35] INFO Loading ANN model
/cache/clip/ViT-B-32__openai/visual/model.armnn ...
[08/01/24 23:11:37] INFO Loaded ANN model with ID 0
Within a lot of check for inactivity should I upload many pictures and see? or?
sogan
sogan10mo ago
So now if you enable facial recognition and run face detection on all assets through the job panel, does that work?
eKristensen
eKristensenOP10mo ago
It looks like it works At least I cannot spot anything wrong in the log and the GUI says it is done Should i try the duplicate detection then or? maybe upload more pictures with both of them on as they are now?
sogan
sogan10mo ago
This would be the next thing to test
eKristensen
eKristensenOP10mo ago
Uploaded one more image and it worked fine. I uploaded 5 and got sigsegv. I have removed check for inactivity lines. Here is the log output.
eKristensen
eKristensenOP10mo ago
The first is the one picture that worked fine, and then the error later But the error does not repeat like it did before when i started this ticket I uploaded many pictures though so maybe if I upload many pictures it will go wrong again
sogan
sogan10mo ago
This time, can you disable one of the tasks again, but set the concurrency higher? I think there might be a race condition in the ANN session, but not sure if it only applies to separate models or if it also happens for a single model. Oh, and upload multiple assets at once for this
eKristensen
eKristensenOP10mo ago
Should I reset or simply restart with another concurrency setting? What would be preferred? or can it be changed in the web gui?
sogan
sogan10mo ago
You need to comment out the request threads env if it’s still there. If you haven’t changed the job concurrency in the admin settings, you don’t need to touch it After that, the ML service just needs to be restarted
eKristensen
eKristensenOP10mo ago
MACHINE_LEARNING_REQUEST_THREADS <--- this one is the "request threads" ? I'll assume it is and do it.
eKristensen
eKristensenOP10mo ago
Ok. I uploaded 8 pictures. Only smart search active. Again i filtered the check inactivity out. Log output attached.
sogan
sogan10mo ago
Oh, that’s interesting. So it seems like it can’t handle concurrency at all How about if you upload just one image, then upload multiple at once maybe a minute later?
eKristensen
eKristensenOP10mo ago
Right. Looks like that works. Got no errors Uploded one, then waited a minute and uploded 4 more With just smart search active and without the MACHINE_LEARNING_REQUEST_THREADS env variable
sogan
sogan10mo ago
That narrows it down quite a bit. @zody Do you have an idea for what might be causing this? The segfault happens if multiple requests come in at once when the session loads, but if there’s only one initial request, it can handle concurrency after that.
eKristensen
eKristensenOP9mo ago
I am thankful for the help that I am getting here. When does Fynn usually check out Discord? I do not want to sound impatient, but I would also like to do something rather than just waiting if possible. I have a friend with the same hardware. They also have problems, but their environment is too different to tell if it is the same issue we experience. Would it be better if I opened a ticket on Github? If I start with the CPU-version of machine learning can I then switch to hardware ml acceleration later? Is it the same models used ?
sogan
sogan9mo ago
If I start with the CPU-version of machine learning can I then switch to hardware ml acceleration later? Is it the same models used ?
Yes Three things you could try to remedy the issue for now: 1. Set MACHINE_LEARNING_PRELOAD__CLIP=ViT-B-32__openai and MACHINE_LEARNING_PRELOAD__FACIAL_RECOGNITION=buffalo_l. - This will make it load these models at startup instead of when it first receives a request. Since the issue seems to be time-sensitive, distancing the model loading from the request might help. 2. Set MACHINE_LEARNING_ANN_TUNING_LEVEL to 1 or 3 (1: rapid, 2: normal, 3: exhaustive, default is 2). - 1 might help with timing - 3 might help make it choose a different kernel that doesn't have this issue (if this issue is caused by a particular kernel) 3. Set MACHINE_LEARNING_ANN_FP16_TURBO=true - Just something to try, no clue if it will help Also, ideally only try one approach at a time unless none of them work
eKristensen
eKristensenOP9mo ago
Okay, thanks for giving me something to try. I changed the way I tested slightly: I started from fresh again, but I did not reset between tests. I uploaded som pictures and left the smart and face ml on. When I changed test i asked it to reindex instead of uploading more images. This is enough to trigger the ml process and the errors. 1) Preloading makes no difference for buffalo, but the vit model completely fails to load and then it failover to CPU. 2) The tuning levels makes no difference that I can observe 3) I get a new error when i turn on fp16 turbo:
Warning: ERROR: Layer of type Cast is not supported on any preferred backend [GpuAcc ]
Warning: WARNING: Layer of type Cast is not supported on requested backend GpuAcc for input data type Float16 and output data type Float16 (reason: in validate_arguments src/gpu/cl/kernels/ClCastKernel.cpp:62: src and dst data types must be different), falling back to the next backend.
Warning: ERROR: Layer of type Cast is not supported on any preferred backend [GpuAcc ]
Warning: WARNING: Layer of type Cast is not supported on requested backend GpuAcc for input data type Float16 and output data type Float16 (reason: in validate_arguments src/gpu/cl/kernels/ClCastKernel.cpp:62: src and dst data types must be different), falling back to the next backend.
It seems that it can process slowly, because it did manage to detect some faces. Something a bit interesting though: I did a sanity check. Without any special option (except for the debug option) i tried to run face detect and recognition (buffalo) on the pictures already uploaded. I told it to run again on all images and checked that the existing data was gone (no people in the people view). I ran without any issues at all and now the there are people in the people view. Then I did the smart search (vit) model on all pictures. And now it makes segfault. (SIGSEGV). Eventually it did fail back to cpu and then it did work somewhat. So it seems that the first model it loads can work fine. In previous attempts I did the smart search first, and then the face search. It seemed to work fine with smart search and then fail for faces. where it did segfault with the face model when the vit model had run before. Idk wheter this info can be used for anything. But something is working that much is certain
eKristensen
eKristensenOP9mo ago
Log when it fails to load vit model, looks almost the same for preload and when it fails with "dynamic" loading:
eKristensen
eKristensenOP9mo ago
What I mean is that it could reprocess the around 10 pictures that I had uploaded to the test server without any issues, and I could clearly observe that the process worked (People not present before run, people present after run). A mini-update from me: I've started to use Immich without hardware acceleration. I intend to try hardware acceleration again sometime later, but I also realize that I'll have to checkout what happens myself and dive into the source code to do so. It does not sound like there is any easy fix at the moment. Luckily CPU processing isn't too slow so not too bad. In fact is is fast compared to how long it took to make thumbnails. If anyone has any news or anything that I could try please write 🙂 Thanks for the help even if the issue did not get resolved this time around.
sogan
sogan9mo ago
There’s been some discussion in this ticket around using a different libmali.so driver with success https://discord.com/channels/979116623879368755/1245331532533465118
eKristensen
eKristensenOP9mo ago
Ok, interesting. I did not know there was a related thread. What is the difference between the dummy and gbm versions ? Do immich need the "gbm"?
sogan
sogan9mo ago
GBM is a Mesa thing that we don't need It doesn't hurt to use it, but you shouldn't need it
eKristensen
eKristensenOP9mo ago
I am getting a bunch of kernel errors. not sure if I got them before because I didn't check. The errors in the immich-machine-learning container is different now that I use v1.9-1-2d267b0. Right now there are no visable errors in the machine learning container, but I get a bunch of kernel errors with this power transition
ek-arm kernel: mali fb000000.gpu: Power transition timed out unexpectedly
ek-arm kernel: mali fb000000.gpu: MCU desired = 1
ek-arm kernel: mali fb000000.gpu: MCU sw state = 2
ek-arm kernel: mali fb000000.gpu: Current state :
ek-arm kernel: mali fb000000.gpu: Shader=0000000000000000
ek-arm kernel: mali fb000000.gpu: Tiler =0000000000000000
ek-arm kernel: mali fb000000.gpu: L2 =0000000000000001
ek-arm kernel: mali fb000000.gpu: MCU status = 2
ek-arm kernel: mali fb000000.gpu: Cores transitioning :
ek-arm kernel: mali fb000000.gpu: Shader=0000000000000000
ek-arm kernel: mali fb000000.gpu: Tiler =0000000000000000
ek-arm kernel: mali fb000000.gpu: L2 =0000000000000000
ek-arm kernel: mali fb000000.gpu: Sending reset to GPU - all running jobs will be lost
ek-arm kernel: mali fb000000.gpu: Preparing to soft-reset GPU
ek-arm kernel: mali fb000000.gpu: Wait for MCU power on failed on scheduling tick/tock
ek-arm kernel: mali fb000000.gpu: Resetting GPU (allowing up to 500 ms)
ek-arm kernel: mali fb000000.gpu: Register state:
ek-arm kernel: mali fb000000.gpu: GPU_IRQ_RAWSTAT=0x00000000 GPU_STATUS=0x00000000 MCU_STATUS=0x00000002
ek-arm kernel: mali fb000000.gpu: JOB_IRQ_RAWSTAT=0x00000000 MMU_IRQ_RAWSTAT=0x00000000 GPU_FAULTSTATUS=0x00000000
ek-arm kernel: mali fb000000.gpu: GPU_IRQ_MASK=0x00000000 JOB_IRQ_MASK=0x00000000 MMU_IRQ_MASK=0x00000000
ek-arm kernel: mali fb000000.gpu: PWR_OVERRIDE0=0x00000000 PWR_OVERRIDE1=0x00000000
ek-arm kernel: mali fb000000.gpu: SHADER_CONFIG=0x00000000 L2_MMU_CONFIG=0x00000000 TILER_CONFIG=0x00000000
ek-arm kernel: mali fb000000.gpu: reloading firmware
ek-arm kernel: mali fb000000.gpu: Reset complete
ek-arm kernel: mali fb000000.gpu: Power transition timed out unexpectedly
ek-arm kernel: mali fb000000.gpu: MCU desired = 1
ek-arm kernel: mali fb000000.gpu: MCU sw state = 2
ek-arm kernel: mali fb000000.gpu: Current state :
ek-arm kernel: mali fb000000.gpu: Shader=0000000000000000
ek-arm kernel: mali fb000000.gpu: Tiler =0000000000000000
ek-arm kernel: mali fb000000.gpu: L2 =0000000000000001
ek-arm kernel: mali fb000000.gpu: MCU status = 2
ek-arm kernel: mali fb000000.gpu: Cores transitioning :
ek-arm kernel: mali fb000000.gpu: Shader=0000000000000000
ek-arm kernel: mali fb000000.gpu: Tiler =0000000000000000
ek-arm kernel: mali fb000000.gpu: L2 =0000000000000000
ek-arm kernel: mali fb000000.gpu: Sending reset to GPU - all running jobs will be lost
ek-arm kernel: mali fb000000.gpu: Preparing to soft-reset GPU
ek-arm kernel: mali fb000000.gpu: Wait for MCU power on failed on scheduling tick/tock
ek-arm kernel: mali fb000000.gpu: Resetting GPU (allowing up to 500 ms)
ek-arm kernel: mali fb000000.gpu: Register state:
ek-arm kernel: mali fb000000.gpu: GPU_IRQ_RAWSTAT=0x00000000 GPU_STATUS=0x00000000 MCU_STATUS=0x00000002
ek-arm kernel: mali fb000000.gpu: JOB_IRQ_RAWSTAT=0x00000000 MMU_IRQ_RAWSTAT=0x00000000 GPU_FAULTSTATUS=0x00000000
ek-arm kernel: mali fb000000.gpu: GPU_IRQ_MASK=0x00000000 JOB_IRQ_MASK=0x00000000 MMU_IRQ_MASK=0x00000000
ek-arm kernel: mali fb000000.gpu: PWR_OVERRIDE0=0x00000000 PWR_OVERRIDE1=0x00000000
ek-arm kernel: mali fb000000.gpu: SHADER_CONFIG=0x00000000 L2_MMU_CONFIG=0x00000000 TILER_CONFIG=0x00000000
ek-arm kernel: mali fb000000.gpu: reloading firmware
ek-arm kernel: mali fb000000.gpu: Reset complete
Okay I got the really long error again It is a really long python trace.
I think I even found the maybe just a difference between g6p0 and g13p0. One is OpenCL 2.1 the other 3.0 source: https://www.roselladb.com/install-opencl-orangepi5-debian-ubuntu.htm
via https://github.com/Joshua-Riek/ubuntu-rockchip/issues/879 so to demystifying the name libmali <-- Driver name valhall-g610 <-- Hardware identifier. Orange Pi 5 Plus has ARM Mali-G610 MP4 and the codename is Valhall g6p0 / g13p0 <-- OpenCL version dummy / gbm / etc <-- Driver variant/addons If I am getting it right which versin of opencl does immich expect ? Does the orange pi need to be restarted to use a different driver ? I rebooted. I can see more details about the segfault I am getting in the kernel log
Aug 08 00:44:53 ek-arm kernel: mali fb000000.gpu: Ctx 5_0 Group 0 CSG 0 CSI: 0
CS_FATAL.EXCEPTION_TYPE: 0x40 (CS_CONFIG_FAULT)
CS_FATAL.EXCEPTION_DATA: 0x0
CS_FATAL_INFO.EXCEPTION_DATA: 0x0
Aug 08 00:44:53 ek-arm kernel: mali fb000000.gpu: Ctx 5_0 Group 0 CSG 0 CSI: 0
CS_FATAL.EXCEPTION_TYPE: 0x40 (CS_CONFIG_FAULT)
CS_FATAL.EXCEPTION_DATA: 0x0
CS_FATAL_INFO.EXCEPTION_DATA: 0x0
So this one is the g13p0 variant (without a reboot) This one is the g6p0 installed before a reboot
eKristensen
eKristensenOP9mo ago
Apache TVM Discuss
Most tasks failed with AutoScheduler on Mali G610 GPU
Update: After some tracing effort, the second issue can be resolved: TVM defaults to run at least 1000ms for every task measurment for non-CPU target, but for some fast tasks the task is repeated too many times, for OpenCL target it means too many kernel launch command is enqueued by clEnqueueNDRangeKernel, causes the out of memory error. Addi...
eKristensen
eKristensenOP9mo ago
I am also seeing this:
warning: det_size is already set in detection model, ignore
Error: An error occurred attempting to execute a workload: CL error: clGetEventProfileInfo. Error code: -58 at function Execute [/devenv/armnn/src/backends/cl/workloads/ClMultiplicationWorkload.cpp:82]
Error: An error occurred attempting to execute a workload: CL error: clGetEventProfileInfo. Error code: -58 at function Execute [/devenv/armnn/src/backends/cl/workloads/ClMultiplicationWorkload.cpp:82]
Error: An error occurred attempting to execute a workload: CL error: clGetEventProfileInfo. Error code: -58 at function Execute [/devenv/armnn/src/backends/cl/workloads/ClConvolution2dWorkload.cpp:163]
/opt/venv/lib/python3.11/site-packages/insightface/model_zoo/retinaface.py:160: RuntimeWarning: overflow encountered in multiply
bbox_preds = bbox_preds * stride
/opt/venv/lib/python3.11/site-packages/insightface/model_zoo/retinaface.py:160: RuntimeWarning: invalid value encountered in multiply
bbox_preds = bbox_preds * stride
/opt/venv/lib/python3.11/site-packages/insightface/model_zoo/retinaface.py:162: RuntimeWarning: overflow encountered in multiply
kps_preds = net_outs[idx+fmc*2] * stride
/opt/venv/lib/python3.11/site-packages/insightface/model_zoo/retinaface.py:162: RuntimeWarning: invalid value encountered in multiply
kps_preds = net_outs[idx+fmc*2] * stride
warning: det_size is already set in detection model, ignore
Error: An error occurred attempting to execute a workload: CL error: clGetEventProfileInfo. Error code: -58 at function Execute [/devenv/armnn/src/backends/cl/workloads/ClMultiplicationWorkload.cpp:82]
Error: An error occurred attempting to execute a workload: CL error: clGetEventProfileInfo. Error code: -58 at function Execute [/devenv/armnn/src/backends/cl/workloads/ClMultiplicationWorkload.cpp:82]
Error: An error occurred attempting to execute a workload: CL error: clGetEventProfileInfo. Error code: -58 at function Execute [/devenv/armnn/src/backends/cl/workloads/ClConvolution2dWorkload.cpp:163]
/opt/venv/lib/python3.11/site-packages/insightface/model_zoo/retinaface.py:160: RuntimeWarning: overflow encountered in multiply
bbox_preds = bbox_preds * stride
/opt/venv/lib/python3.11/site-packages/insightface/model_zoo/retinaface.py:160: RuntimeWarning: invalid value encountered in multiply
bbox_preds = bbox_preds * stride
/opt/venv/lib/python3.11/site-packages/insightface/model_zoo/retinaface.py:162: RuntimeWarning: overflow encountered in multiply
kps_preds = net_outs[idx+fmc*2] * stride
/opt/venv/lib/python3.11/site-packages/insightface/model_zoo/retinaface.py:162: RuntimeWarning: invalid value encountered in multiply
kps_preds = net_outs[idx+fmc*2] * stride
It works now, no kernel errors, BUT I also see high CPU consumption when I ask the machine learning container to perform ml tasks It does not say it failed over to CPU I am using libmali-valhall-g610-g13p0-dummy_v1.9-1-2d267b0_arm64.deb (i replaced the version in the filename to keep track of the variants) WITHOUT passing the firmware to the container passing firmware made no difference so libmali-valhall-g610-g13p0-dummy_v1.9-1-2d267b0_arm64.deb <-- I get no errors, but I think it might be secretly be using the CPU anyways (because I see 50-150% CPU consumption when I use the container stats command) libmali-valhall-g610-g13p0-gbm_v1.9-1-2d267b0_arm64.deb <-- Kernel error firmware reloading long python error msg libmali-valhall-g610-g6p0-dummy_v1.9-1-2d267b0_arm64.deb and libmali-valhall-g610-g6p0-gbm_v1.9-1-2d267b0_arm64.deb <-- segfault I'll wrap up for today. (I'll try to do some more rebooting of my device in between the different drivers tomorrow, maybe that helps) I would like to know what version of OpenCL I should target. Thx 🙂

Did you find this page helpful?