Immich•14mo ago

Orange Pi 5 Plus Rockchip RK3588 Hardware ML Acceleration segfault help wanted

AI have an Orange Pi 5 Plus that I'm trying to setup for running Immich: I want to enjoy hardware accelerated machine learning. I've been having quite a bit of trouble getting the driver to work. I am confident that my Immich setup is correct and that I have given the container proper access to the hardware. Why? Because when I run clinfo in the machine it can see the OpenCL hardware. Furthermore https://immich.app/docs/features/ml-hardware-acceleration/ says

In the case of ARM NN, the absence of a Could not load ANN shared libraries log entry means it loaded successfully.

And I do NOT get the load error. I've tried the https://github.com/JeffyCN/mirrors/raw/libmali/firmware/g610/mali_csffw.binq and equivalent from https://github.com/tsukumijima/libmali-rockchip/releases I got the firmware from https://github.com/JeffyCN/mirrors/raw/libmali/firmware/g610/mali_csffw.binq but it does not make any difference no-matter if I include that file or not. Between each test I've completely reset the system nuking all images and deleting storage. I get a segfault when the machine learning algo runs Relevant logs attached, and part with error highlighted below:

/opt/venv/lib/python3.11/site-packages/skimage/transform/_geometric.py:160: RuntimeWarning: divide by zero encountered in divide
  scale = 1.0 / src_demean.var(axis=0).sum() * (S @ d)
[07/31/24 22:06:51] DEBUG    Checking for inactivity...                         
/opt/venv/lib/python3.11/site-packages/skimage/transform/_geometric.py:165: RuntimeWarning: invalid value encountered in multiply
  T[:dim, :dim] *= scale
[07/31/24 22:07:01] ERROR    Worker (pid:5) was sent SIGSEGV!

/opt/venv/lib/python3.11/site-packages/skimage/transform/_geometric.py:160: RuntimeWarning: divide by zero encountered in divide
  scale = 1.0 / src_demean.var(axis=0).sum() * (S @ d)
[07/31/24 22:06:51] DEBUG    Checking for inactivity...                         
/opt/venv/lib/python3.11/site-packages/skimage/transform/_geometric.py:165: RuntimeWarning: invalid value encountered in multiply
  T[:dim, :dim] *= scale
[07/31/24 22:07:01] ERROR    Worker (pid:5) was sent SIGSEGV!

message.txt

36 Replies

Immich•14mo ago

:wave: Hey @eKristensen, Thanks for reaching out to us. Please follow the recommended actions below; this will help us be more effective in our support effort and leave more time for building Immich :immich:. References - Container Logs: docker compose logs docs - Container Status: docker compose ps docs - Reverse Proxy: https://immich.app/docs/administration/reverse-proxy Checklist 1. :ballot_box_with_check: I have verified I'm on the latest release(note that mobile app releases may take some time). 2. :ballot_box_with_check: I have read applicable release notes. 3. :ballot_box_with_check: I have reviewed the FAQs for known issues. 4. :ballot_box_with_check: I have reviewed Github for known issues. 5. :ballot_box_with_check: I have tried accessing Immich via local ip (without a custom reverse proxy). 6. :ballot_box_with_check: I have uploaded the relevant logs, docker compose, and .env files, making sure to use code formatting. 7. :ballot_box_with_check: I have tried an incognito window, disabled extensions, cleared mobile app cache, logged out and back in, different browsers, etc. as applicable (an item can be marked as "complete" by reacting with the appropriate number) If this ticket can be closed you can use the /close command, and re-open it later if needed. Successfully submitted, a tag has been added to inform contributors. :white_check_mark:

eKristensenOP•14mo ago

Image versions:

REPOSITORY                                  TAG            IMAGE ID      CREATED       SIZE
ghcr.io/immich-app/immich-machine-learning  release-armnn  daa9fc743e8a  25 hours ago  1.21 GB
ghcr.io/immich-app/immich-server            release        b3d108041948  25 hours ago  1.47 GB
docker.io/library/redis                     6.2-alpine     a8fd49b68365  2 months ago  31.9 MB
docker.io/tensorchord/pgvecto-rs            pg14-v0.2.0    e5e5f64f3e66  6 months ago  943 MB

REPOSITORY                                  TAG            IMAGE ID      CREATED       SIZE
ghcr.io/immich-app/immich-machine-learning  release-armnn  daa9fc743e8a  25 hours ago  1.21 GB
ghcr.io/immich-app/immich-server            release        b3d108041948  25 hours ago  1.47 GB
docker.io/library/redis                     6.2-alpine     a8fd49b68365  2 months ago  31.9 MB
docker.io/tensorchord/pgvecto-rs            pg14-v0.2.0    e5e5f64f3e66  6 months ago  943 MB

Container run command:

run --name=immich-machine-learning --replace --rm  --network=systemd-immich-net  --device=/dev/mali0:/dev/mali0 -v /usr/lib/aarch64-linux-gnu/libmali.so:/usr/lib/libmali.so:ro -v model-cache:/cache --label io.containers.autoupdate=registry --env-file /home/immich/.env --group-add keep-groups ghcr.io/immich-app/immich-machine-learning:release-armnn

My .env

# You can find documentation for all the supported env variables at https://immich.app/docs/install/environment-variables

# The location where your uploaded files are stored
UPLOAD_LOCATION=./library
# The location where your database files are stored
DB_DATA_LOCATION=./postgres

# To set a timezone, uncomment the next line and change Etc/UTC to a TZ identifier from this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones#List
TZ=Europe/Copenhagen

# The Immich version to use. You can pin this to a specific version like "v1.71.0"
IMMICH_VERSION=release

# Connection secret for postgres. You should change it to a random password
DB_PASSWORD=<PWD HIDDEN>

# The values below this line do not need to be changed
###################################################################################
DB_USERNAME=postgres
DB_DATABASE_NAME=immich

IMMICH_LOG_LEVEL=debug

# You can find documentation for all the supported env variables at https://immich.app/docs/install/environment-variables

# The location where your uploaded files are stored
UPLOAD_LOCATION=./library
# The location where your database files are stored
DB_DATA_LOCATION=./postgres

# To set a timezone, uncomment the next line and change Etc/UTC to a TZ identifier from this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones#List
TZ=Europe/Copenhagen

# The Immich version to use. You can pin this to a specific version like "v1.71.0"
IMMICH_VERSION=release

# Connection secret for postgres. You should change it to a random password
DB_PASSWORD=<PWD HIDDEN>

# The values below this line do not need to be changed
###################################################################################
DB_USERNAME=postgres
DB_DATABASE_NAME=immich

IMMICH_LOG_LEVEL=debug

To get clinfo I do this: Enter container and:

  # apt update && apt install clinfo
  # clinfo

  # apt update && apt install clinfo
  # clinfo

My clinfo out is attached Related, but not quite Github Issues: - https://github.com/immich-app/immich/issues/11019 - https://github.com/immich-app/immich/discussions/10183 I only pass in the relevant files. I am up to date and I reset the cache volume each time. I get no errors related to downloading the models, the chip even says it loads the models. OS is Armbian Ubuntu Noble Minimal / CLI I think it is a driver issue, but I am not sure what to do about it.

mertalev•14mo ago

Does the segfault still happen if you set concurrency to 1? Also, the error is coming from skimage, not the armnn session itself. It might be that there’s something wrong with the output from armnn and trying to use it is what causes the segfault

eKristensenOP•14mo ago

I set MACHINE_LEARNING_REQUEST_THREADS=1. New log is attched. Stil segfault, but no info. How to get info about where the segfault comes from?

message.txt

mertalev•14mo ago

The first time it segfaults in those logs, it loaded the CLIP model first without issue, then segfaulted when it loaded the detection model. In the second time, it loaded the detection model fine and segfaulted when it loaded the CLIP model. Can you try only running one task?

eKristensenOP•14mo ago

How to run just one task? Upload one picture and then ? Also how would i know if it succeeded doing something? Thx for your help. I properly won't see your messages for a while. I'll get back to it as soon as I have time again 🙂

mertalev•14mo ago

You can disable one of the tasks temporarily so it doesn’t run when the asset is uploaded Generally, if there are no errors, the jobs finished running after the upload, and you see a model was loaded in the ML logs, everything is good. There are debug logs for face detection and recognition in the server as well.

eKristensenOP•14mo ago

I think running just a single job worked. I do not see segfault in my log, see attached log. I went to the settings and disabled machine learning face detection. I left smart search on and uploaded a single picture. I could not disable duplicate detection. Of course everything was fresh. I reset everything between all tests to remove variables.

message.txt

mertalev•14mo ago

If you upload multiple assets, does it still work?

eKristensenOP•14mo ago

With the same settings? I got a [08/01/24 22:52:06] ERROR Worker (pid:5) was sent SIGINT! But I'm not sure what that is about This is the chunk

[08/01/24 22:51:51] DEBUG    Checking for inactivity...                         
[08/01/24 22:52:00] INFO     Unloaded ANN model 0                               
[08/01/24 22:52:03] DEBUG    Checking for inactivity...                         
[08/01/24 22:52:03] INFO     Shutting down due to inactivity.                   
[08/01/24 22:52:03] INFO     Shutting down                                      
[08/01/24 22:52:03] INFO     Waiting for application shutdown.                  
[08/01/24 22:52:04] INFO     Application shutdown complete.                     
[08/01/24 22:52:04] INFO     Finished server process [5]                        
[08/01/24 22:52:06] ERROR    Worker (pid:5) was sent SIGINT!                    
[08/01/24 22:52:06] INFO     Booting worker with pid: 37                        
[08/01/24 22:52:13] INFO     Started server process [37]                        
[08/01/24 22:52:13] INFO     Waiting for application startup.                   
[08/01/24 22:52:13] INFO     Created in-memory cache with unloading after 300s  
                             of inactivity.                                     
[08/01/24 22:52:13] INFO     Initialized request thread pool with 1 threads.    
[08/01/24 22:52:13] DEBUG    Checking for inactivity...                         
[08/01/24 22:52:13] INFO     Application startup complete.                      
[08/01/24 22:52:23] DEBUG    Checking for inactivity...

[08/01/24 22:51:51] DEBUG    Checking for inactivity...                         
[08/01/24 22:52:00] INFO     Unloaded ANN model 0                               
[08/01/24 22:52:03] DEBUG    Checking for inactivity...                         
[08/01/24 22:52:03] INFO     Shutting down due to inactivity.                   
[08/01/24 22:52:03] INFO     Shutting down                                      
[08/01/24 22:52:03] INFO     Waiting for application shutdown.                  
[08/01/24 22:52:04] INFO     Application shutdown complete.                     
[08/01/24 22:52:04] INFO     Finished server process [5]                        
[08/01/24 22:52:06] ERROR    Worker (pid:5) was sent SIGINT!                    
[08/01/24 22:52:06] INFO     Booting worker with pid: 37                        
[08/01/24 22:52:13] INFO     Started server process [37]                        
[08/01/24 22:52:13] INFO     Waiting for application startup.                   
[08/01/24 22:52:13] INFO     Created in-memory cache with unloading after 300s  
                             of inactivity.                                     
[08/01/24 22:52:13] INFO     Initialized request thread pool with 1 threads.    
[08/01/24 22:52:13] DEBUG    Checking for inactivity...                         
[08/01/24 22:52:13] INFO     Application startup complete.                      
[08/01/24 22:52:23] DEBUG    Checking for inactivity...

I'll try to upload more pictures and see what happens. Without resetting

mertalev•14mo ago

This is normal. The process kills itself after 5 minutes of idling to release RAM

eKristensenOP•14mo ago

Ok. I uploaded 5 more and still no segfault or anything else that looks out of place Just

[08/01/24 23:11:35] DEBUG    Setting model format to armnn                      
[08/01/24 23:11:35] INFO     Loading visual model 'ViT-B-32__openai' to memory  
[08/01/24 23:11:35] DEBUG    Loading visual preprocessing config for CLIP model 
                             'ViT-B-32__openai'                                 
[08/01/24 23:11:35] DEBUG    Loaded visual preprocessing config for CLIP model  
                             'ViT-B-32__openai'                                 
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '5'.
[08/01/24 23:11:35] INFO     Loading ANN model                                  
                             /cache/clip/ViT-B-32__openai/visual/model.armnn ...
[08/01/24 23:11:37] INFO     Loaded ANN model with ID 0

[08/01/24 23:11:35] DEBUG    Setting model format to armnn                      
[08/01/24 23:11:35] INFO     Loading visual model 'ViT-B-32__openai' to memory  
[08/01/24 23:11:35] DEBUG    Loading visual preprocessing config for CLIP model 
                             'ViT-B-32__openai'                                 
[08/01/24 23:11:35] DEBUG    Loaded visual preprocessing config for CLIP model  
                             'ViT-B-32__openai'                                 
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '5'.
[08/01/24 23:11:35] INFO     Loading ANN model                                  
                             /cache/clip/ViT-B-32__openai/visual/model.armnn ...
[08/01/24 23:11:37] INFO     Loaded ANN model with ID 0

Within a lot of check for inactivity should I upload many pictures and see? or?

mertalev•14mo ago

So now if you enable facial recognition and run face detection on all assets through the job panel, does that work?

eKristensenOP•14mo ago

It looks like it works At least I cannot spot anything wrong in the log and the GUI says it is done Should i try the duplicate detection then or? maybe upload more pictures with both of them on as they are now?

mertalev•14mo ago

This would be the next thing to test

eKristensenOP•14mo ago

Uploaded one more image and it worked fine. I uploaded 5 and got sigsegv. I have removed check for inactivity lines. Here is the log output.

message.txt

eKristensenOP•14mo ago

The first is the one picture that worked fine, and then the error later But the error does not repeat like it did before when i started this ticket I uploaded many pictures though so maybe if I upload many pictures it will go wrong again

mertalev•14mo ago

This time, can you disable one of the tasks again, but set the concurrency higher? I think there might be a race condition in the ANN session, but not sure if it only applies to separate models or if it also happens for a single model. Oh, and upload multiple assets at once for this

eKristensenOP•14mo ago

Should I reset or simply restart with another concurrency setting? What would be preferred? or can it be changed in the web gui?

mertalev•14mo ago

You need to comment out the request threads env if it’s still there. If you haven’t changed the job concurrency in the admin settings, you don’t need to touch it After that, the ML service just needs to be restarted

eKristensenOP•14mo ago

MACHINE_LEARNING_REQUEST_THREADS <--- this one is the "request threads" ? I'll assume it is and do it.

eKristensenOP•14mo ago

Ok. I uploaded 8 pictures. Only smart search active. Again i filtered the check inactivity out. Log output attached.

log.txt

mertalev•14mo ago

Oh, that’s interesting. So it seems like it can’t handle concurrency at all How about if you upload just one image, then upload multiple at once maybe a minute later?

eKristensenOP•14mo ago

Right. Looks like that works. Got no errors Uploded one, then waited a minute and uploded 4 more With just smart search active and without the MACHINE_LEARNING_REQUEST_THREADS env variable

mertalev•14mo ago

That narrows it down quite a bit. @zody Do you have an idea for what might be causing this? The segfault happens if multiple requests come in at once when the session loads, but if there’s only one initial request, it can handle concurrency after that.

eKristensenOP•14mo ago

I am thankful for the help that I am getting here. When does Fynn usually check out Discord? I do not want to sound impatient, but I would also like to do something rather than just waiting if possible. I have a friend with the same hardware. They also have problems, but their environment is too different to tell if it is the same issue we experience. Would it be better if I opened a ticket on Github? If I start with the CPU-version of machine learning can I then switch to hardware ml acceleration later? Is it the same models used ?

mertalev•14mo ago

If I start with the CPU-version of machine learning can I then switch to hardware ml acceleration later? Is it the same models used ?

Yes Three things you could try to remedy the issue for now: 1. Set MACHINE_LEARNING_PRELOAD__CLIP=ViT-B-32__openai and MACHINE_LEARNING_PRELOAD__FACIAL_RECOGNITION=buffalo_l. - This will make it load these models at startup instead of when it first receives a request. Since the issue seems to be time-sensitive, distancing the model loading from the request might help. 2. Set MACHINE_LEARNING_ANN_TUNING_LEVEL to 1 or 3 (1: rapid, 2: normal, 3: exhaustive, default is 2). - 1 might help with timing - 3 might help make it choose a different kernel that doesn't have this issue (if this issue is caused by a particular kernel) 3. Set MACHINE_LEARNING_ANN_FP16_TURBO=true - Just something to try, no clue if it will help Also, ideally only try one approach at a time unless none of them work

eKristensenOP•14mo ago

Okay, thanks for giving me something to try. I changed the way I tested slightly: I started from fresh again, but I did not reset between tests. I uploaded som pictures and left the smart and face ml on. When I changed test i asked it to reindex instead of uploading more images. This is enough to trigger the ml process and the errors. 1) Preloading makes no difference for buffalo, but the vit model completely fails to load and then it failover to CPU. 2) The tuning levels makes no difference that I can observe 3) I get a new error when i turn on fp16 turbo:

Warning: ERROR: Layer of type Cast is not supported on any preferred backend [GpuAcc ]
Warning: WARNING: Layer of type Cast is not supported on requested backend GpuAcc for input data type Float16 and output data type Float16 (reason: in validate_arguments src/gpu/cl/kernels/ClCastKernel.cpp:62: src and dst data types must be different), falling back to the next backend.

Warning: ERROR: Layer of type Cast is not supported on any preferred backend [GpuAcc ]
Warning: WARNING: Layer of type Cast is not supported on requested backend GpuAcc for input data type Float16 and output data type Float16 (reason: in validate_arguments src/gpu/cl/kernels/ClCastKernel.cpp:62: src and dst data types must be different), falling back to the next backend.

It seems that it can process slowly, because it did manage to detect some faces. Something a bit interesting though: I did a sanity check. Without any special option (except for the debug option) i tried to run face detect and recognition (buffalo) on the pictures already uploaded. I told it to run again on all images and checked that the existing data was gone (no people in the people view). I ran without any issues at all and now the there are people in the people view. Then I did the smart search (vit) model on all pictures. And now it makes segfault. (SIGSEGV). Eventually it did fail back to cpu and then it did work somewhat. So it seems that the first model it loads can work fine. In previous attempts I did the smart search first, and then the face search. It seemed to work fine with smart search and then fail for faces. where it did segfault with the face model when the vit model had run before. Idk wheter this info can be used for anything. But something is working that much is certain

eKristensenOP•14mo ago

Log when it fails to load vit model, looks almost the same for preload and when it fails with "dynamic" loading:

message.txt

eKristensenOP•14mo ago

What I mean is that it could reprocess the around 10 pictures that I had uploaded to the test server without any issues, and I could clearly observe that the process worked (People not present before run, people present after run). A mini-update from me: I've started to use Immich without hardware acceleration. I intend to try hardware acceleration again sometime later, but I also realize that I'll have to checkout what happens myself and dive into the source code to do so. It does not sound like there is any easy fix at the moment. Luckily CPU processing isn't too slow so not too bad. In fact is is fast compared to how long it took to make thumbnails. If anyone has any news or anything that I could try please write 🙂 Thanks for the help even if the issue did not get resolved this time around.

mertalev•14mo ago

There’s been some discussion in this ticket around using a different libmali.so driver with success https://discord.com/channels/979116623879368755/1245331532533465118

eKristensenOP•14mo ago

Ok, interesting. I did not know there was a related thread. What is the difference between the dummy and gbm versions ? Do immich need the "gbm"?

mertalev•14mo ago

GBM is a Mesa thing that we don't need It doesn't hurt to use it, but you shouldn't need it

eKristensenOP•14mo ago

I am getting a bunch of kernel errors. not sure if I got them before because I didn't check. The errors in the immich-machine-learning container is different now that I use v1.9-1-2d267b0. Right now there are no visable errors in the machine learning container, but I get a bunch of kernel errors with this power transition

ek-arm kernel: mali fb000000.gpu: Power transition timed out unexpectedly
ek-arm kernel: mali fb000000.gpu:         MCU desired = 1
ek-arm kernel: mali fb000000.gpu:         MCU sw state = 2
ek-arm kernel: mali fb000000.gpu: Current state :
ek-arm kernel: mali fb000000.gpu:         Shader=0000000000000000
ek-arm kernel: mali fb000000.gpu:         Tiler =0000000000000000
ek-arm kernel: mali fb000000.gpu:         L2    =0000000000000001
ek-arm kernel: mali fb000000.gpu:         MCU status = 2
ek-arm kernel: mali fb000000.gpu: Cores transitioning :
ek-arm kernel: mali fb000000.gpu:         Shader=0000000000000000
ek-arm kernel: mali fb000000.gpu:         Tiler =0000000000000000
ek-arm kernel: mali fb000000.gpu:         L2    =0000000000000000
ek-arm kernel: mali fb000000.gpu: Sending reset to GPU - all running jobs will be lost
ek-arm kernel: mali fb000000.gpu: Preparing to soft-reset GPU
ek-arm kernel: mali fb000000.gpu: Wait for MCU power on failed on scheduling tick/tock
ek-arm kernel: mali fb000000.gpu: Resetting GPU (allowing up to 500 ms)
ek-arm kernel: mali fb000000.gpu: Register state:
ek-arm kernel: mali fb000000.gpu:   GPU_IRQ_RAWSTAT=0x00000000   GPU_STATUS=0x00000000  MCU_STATUS=0x00000002
ek-arm kernel: mali fb000000.gpu:   JOB_IRQ_RAWSTAT=0x00000000   MMU_IRQ_RAWSTAT=0x00000000   GPU_FAULTSTATUS=0x00000000
ek-arm kernel: mali fb000000.gpu:   GPU_IRQ_MASK=0x00000000   JOB_IRQ_MASK=0x00000000   MMU_IRQ_MASK=0x00000000
ek-arm kernel: mali fb000000.gpu:   PWR_OVERRIDE0=0x00000000   PWR_OVERRIDE1=0x00000000
ek-arm kernel: mali fb000000.gpu:   SHADER_CONFIG=0x00000000   L2_MMU_CONFIG=0x00000000   TILER_CONFIG=0x00000000
ek-arm kernel: mali fb000000.gpu: reloading firmware
ek-arm kernel: mali fb000000.gpu: Reset complete

ek-arm kernel: mali fb000000.gpu: Power transition timed out unexpectedly
ek-arm kernel: mali fb000000.gpu:         MCU desired = 1
ek-arm kernel: mali fb000000.gpu:         MCU sw state = 2
ek-arm kernel: mali fb000000.gpu: Current state :
ek-arm kernel: mali fb000000.gpu:         Shader=0000000000000000
ek-arm kernel: mali fb000000.gpu:         Tiler =0000000000000000
ek-arm kernel: mali fb000000.gpu:         L2    =0000000000000001
ek-arm kernel: mali fb000000.gpu:         MCU status = 2
ek-arm kernel: mali fb000000.gpu: Cores transitioning :
ek-arm kernel: mali fb000000.gpu:         Shader=0000000000000000
ek-arm kernel: mali fb000000.gpu:         Tiler =0000000000000000
ek-arm kernel: mali fb000000.gpu:         L2    =0000000000000000
ek-arm kernel: mali fb000000.gpu: Sending reset to GPU - all running jobs will be lost
ek-arm kernel: mali fb000000.gpu: Preparing to soft-reset GPU
ek-arm kernel: mali fb000000.gpu: Wait for MCU power on failed on scheduling tick/tock
ek-arm kernel: mali fb000000.gpu: Resetting GPU (allowing up to 500 ms)
ek-arm kernel: mali fb000000.gpu: Register state:
ek-arm kernel: mali fb000000.gpu:   GPU_IRQ_RAWSTAT=0x00000000   GPU_STATUS=0x00000000  MCU_STATUS=0x00000002
ek-arm kernel: mali fb000000.gpu:   JOB_IRQ_RAWSTAT=0x00000000   MMU_IRQ_RAWSTAT=0x00000000   GPU_FAULTSTATUS=0x00000000
ek-arm kernel: mali fb000000.gpu:   GPU_IRQ_MASK=0x00000000   JOB_IRQ_MASK=0x00000000   MMU_IRQ_MASK=0x00000000
ek-arm kernel: mali fb000000.gpu:   PWR_OVERRIDE0=0x00000000   PWR_OVERRIDE1=0x00000000
ek-arm kernel: mali fb000000.gpu:   SHADER_CONFIG=0x00000000   L2_MMU_CONFIG=0x00000000   TILER_CONFIG=0x00000000
ek-arm kernel: mali fb000000.gpu: reloading firmware
ek-arm kernel: mali fb000000.gpu: Reset complete

Okay I got the really long error again It is a really long python trace.

I think I even found the maybe just a difference between g6p0 and g13p0. One is OpenCL 2.1 the other 3.0 source: https://www.roselladb.com/install-opencl-orangepi5-debian-ubuntu.htm

via https://github.com/Joshua-Riek/ubuntu-rockchip/issues/879 so to demystifying the name libmali <-- Driver name valhall-g610 <-- Hardware identifier. Orange Pi 5 Plus has ARM Mali-G610 MP4 and the codename is Valhall g6p0 / g13p0 <-- OpenCL version dummy / gbm / etc <-- Driver variant/addons If I am getting it right which versin of opencl does immich expect ? Does the orange pi need to be restarted to use a different driver ? I rebooted. I can see more details about the segfault I am getting in the kernel log

Aug 08 00:44:53 ek-arm kernel: mali fb000000.gpu: Ctx 5_0 Group 0 CSG 0 CSI: 0
                               CS_FATAL.EXCEPTION_TYPE: 0x40 (CS_CONFIG_FAULT)
                               CS_FATAL.EXCEPTION_DATA: 0x0
                               CS_FATAL_INFO.EXCEPTION_DATA: 0x0

Aug 08 00:44:53 ek-arm kernel: mali fb000000.gpu: Ctx 5_0 Group 0 CSG 0 CSI: 0
                               CS_FATAL.EXCEPTION_TYPE: 0x40 (CS_CONFIG_FAULT)
                               CS_FATAL.EXCEPTION_DATA: 0x0
                               CS_FATAL_INFO.EXCEPTION_DATA: 0x0

So this one is the g13p0 variant (without a reboot) This one is the g6p0 installed before a reboot

eKristensenOP•14mo ago

I found this, it maybe be relevant? https://discuss.tvm.apache.org/t/most-tasks-failed-with-autoscheduler-on-mali-g610-gpu/16139/7

Apache TVM Discuss

Most tasks failed with AutoScheduler on Mali G610 GPU

Update: After some tracing effort, the second issue can be resolved: TVM defaults to run at least 1000ms for every task measurment for non-CPU target, but for some fast tasks the task is repeated too many times, for OpenCL target it means too many kernel launch command is enqueued by clEnqueueNDRangeKernel, causes the out of memory error. Addi...

eKristensenOP•14mo ago

I am also seeing this:

warning: det_size is already set in detection model, ignore
Error: An error occurred attempting to execute a workload: CL error: clGetEventProfileInfo. Error code: -58 at function Execute [/devenv/armnn/src/backends/cl/workloads/ClMultiplicationWorkload.cpp:82]
Error: An error occurred attempting to execute a workload: CL error: clGetEventProfileInfo. Error code: -58 at function Execute [/devenv/armnn/src/backends/cl/workloads/ClMultiplicationWorkload.cpp:82]
Error: An error occurred attempting to execute a workload: CL error: clGetEventProfileInfo. Error code: -58 at function Execute [/devenv/armnn/src/backends/cl/workloads/ClConvolution2dWorkload.cpp:163]
/opt/venv/lib/python3.11/site-packages/insightface/model_zoo/retinaface.py:160: RuntimeWarning: overflow encountered in multiply
  bbox_preds = bbox_preds * stride
/opt/venv/lib/python3.11/site-packages/insightface/model_zoo/retinaface.py:160: RuntimeWarning: invalid value encountered in multiply
  bbox_preds = bbox_preds * stride
/opt/venv/lib/python3.11/site-packages/insightface/model_zoo/retinaface.py:162: RuntimeWarning: overflow encountered in multiply
  kps_preds = net_outs[idx+fmc*2] * stride
/opt/venv/lib/python3.11/site-packages/insightface/model_zoo/retinaface.py:162: RuntimeWarning: invalid value encountered in multiply
  kps_preds = net_outs[idx+fmc*2] * stride

warning: det_size is already set in detection model, ignore
Error: An error occurred attempting to execute a workload: CL error: clGetEventProfileInfo. Error code: -58 at function Execute [/devenv/armnn/src/backends/cl/workloads/ClMultiplicationWorkload.cpp:82]
Error: An error occurred attempting to execute a workload: CL error: clGetEventProfileInfo. Error code: -58 at function Execute [/devenv/armnn/src/backends/cl/workloads/ClMultiplicationWorkload.cpp:82]
Error: An error occurred attempting to execute a workload: CL error: clGetEventProfileInfo. Error code: -58 at function Execute [/devenv/armnn/src/backends/cl/workloads/ClConvolution2dWorkload.cpp:163]
/opt/venv/lib/python3.11/site-packages/insightface/model_zoo/retinaface.py:160: RuntimeWarning: overflow encountered in multiply
  bbox_preds = bbox_preds * stride
/opt/venv/lib/python3.11/site-packages/insightface/model_zoo/retinaface.py:160: RuntimeWarning: invalid value encountered in multiply
  bbox_preds = bbox_preds * stride
/opt/venv/lib/python3.11/site-packages/insightface/model_zoo/retinaface.py:162: RuntimeWarning: overflow encountered in multiply
  kps_preds = net_outs[idx+fmc*2] * stride
/opt/venv/lib/python3.11/site-packages/insightface/model_zoo/retinaface.py:162: RuntimeWarning: invalid value encountered in multiply
  kps_preds = net_outs[idx+fmc*2] * stride

It works now, no kernel errors, BUT I also see high CPU consumption when I ask the machine learning container to perform ml tasks It does not say it failed over to CPU I am using libmali-valhall-g610-g13p0-dummy_v1.9-1-2d267b0_arm64.deb (i replaced the version in the filename to keep track of the variants) WITHOUT passing the firmware to the container passing firmware made no difference so libmali-valhall-g610-g13p0-dummy_v1.9-1-2d267b0_arm64.deb <-- I get no errors, but I think it might be secretly be using the CPU anyways (because I see 50-150% CPU consumption when I use the container stats command) libmali-valhall-g610-g13p0-gbm_v1.9-1-2d267b0_arm64.deb <-- Kernel error firmware reloading long python error msg libmali-valhall-g610-g6p0-dummy_v1.9-1-2d267b0_arm64.deb and libmali-valhall-g610-g6p0-gbm_v1.9-1-2d267b0_arm64.deb <-- segfault I'll wrap up for today. (I'll try to do some more rebooting of my device in between the different drivers tomorrow, maybe that helps) I would like to know what version of OpenCL I should target. Thx 🙂

Gaming

Programming

Orange Pi 5 Plus Rockchip RK3588 Hardware ML Acceleration segfault help wanted

Did you find this page helpful?