R
RunPodβ€’5mo ago
TomS

Error 804: forward compatibility was attempted on non supported HW

Writing to the online chat bounces the messages, despite me being obviously connected.
No description
120 Replies
TomS
TomSβ€’5mo ago
The messages written are actually the main issue I wanted to solve, but since I ran into this, I'm also submitting it.
Madiator2011
Madiator2011β€’5mo ago
It looks like PyTorch issue
TomS
TomSβ€’5mo ago
The chat messages bouncing or the issue written in the chat?
Madiator2011
Madiator2011β€’5mo ago
I mean your error message
TomS
TomSβ€’5mo ago
It seems that way but the usual reason is version mismatch and the solution is restarting (which I obviously can't do): https://github.com/pytorch/pytorch/issues/40671, https://stackoverflow.com/questions/43022843/nvidia-nvml-driver-library-version-mismatch/45319156#45319156
GitHub
Issues Β· pytorch/pytorch
Tensors and Dynamic neural networks in Python with strong GPU acceleration - Issues Β· pytorch/pytorch
Stack Overflow
Nvidia NVML Driver/library version mismatch
When I run nvidia-smi, I get the following message: Failed to initialize NVML: Driver/library version mismatch An hour ago I received the same message and uninstalled my CUDA library and I was ab...
justin
justinβ€’5mo ago
u can use a pytorch template by runpod or if u know the cuda version u can filter gpu pods by cuda versions
TomS
TomSβ€’5mo ago
I did that, the problem is that the machines have different drivers. All of them are 12.2 but only some of them actually work. Looking at the Pytorch issue, it really seems that it's because some of the machines have outdated drivers (535.x vs. 525.x). I can provide more information when I run into the issue again, but it's extremely weird that only the machines with older drivers exhibit this error and suggests that it's not the issue with the image I'm using.
justin
justinβ€’5mo ago
Yeah. definitely somethign to flag staff about that is certaintly weird / strange 😒
TomS
TomSβ€’5mo ago
In that case I'll update the issue when I run into it again. Which information should I provide? I presume some way to identify the machine?
justin
justinβ€’5mo ago
Yeah, I think a pod identifier, and you can stop the pod so u aren't burning money and just @ one of the active staff they are generally in the US time
TomS
TomSβ€’5mo ago
Thanks, will do πŸ™‚ Just ran into the same issue. The pod ID is efm8o6l8qebm1y, the nvidia-smi output is the following:
Mon Jan 15 15:58:33 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:81:00.0 Off | Off |
| 0% 30C P8 22W / 450W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Mon Jan 15 15:58:33 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:81:00.0 Off | Off |
| 0% 30C P8 22W / 450W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Pinging @Madiator2011 as you suggested. Can't pause the Pod (I think because I have a volume mounted). The full message is the following:
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling
NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on
non supported HW
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling
NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on
non supported HW
MRE:
import torch

torch.cuda.is_available() # raises the error
import torch

torch.cuda.is_available() # raises the error
Pytorch is installed with pip install torch torchaudio torchvision. I am using Python 3.10.13 installed with pyenv.
Madiator2011
Madiator2011β€’5mo ago
probably outdated pytorch version
TomS
TomSβ€’5mo ago
@Madiator2011 this only seems to happen on certain machines though (more specifically certain 4090s). The image has stayed the same.
Madiator2011
Madiator2011β€’5mo ago
new models had same issues with H100 @TomS what output do you get from nvcc --version
TomS
TomSβ€’5mo ago
I don't seem to have that command available
Madiator2011
Madiator2011β€’5mo ago
what docker image are you using?
TomS
TomSβ€’5mo ago
I am using Nvidia's nvidia/cuda:12.2.0-devel-ubuntu22.04 image
Madiator2011
Madiator2011β€’5mo ago
wierd nvcc mostly comes with CUDA.
ashleyk
ashleykβ€’5mo ago
Probably just not in the path, probably have to run something like /usr/local/cuda/bin/nvcc --version
TomS
TomSβ€’5mo ago
You're right, my bad!
root@5efe6f9cf8f9:/usr/local/cuda-12.2/bin# ./nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
root@5efe6f9cf8f9:/usr/local/cuda-12.2/bin# ./nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
Madiator2011
Madiator2011β€’5mo ago
mind trying
pip install install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121
pip install install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121
TomS
TomSβ€’5mo ago
Getting the same error.
TomS
TomSβ€’5mo ago
The package versions are attached (just for completeness).
TomS
TomSβ€’5mo ago
Also tried this for cu118 but no luck. Terminating the pod.
ashleyk
ashleykβ€’5mo ago
Was it community cloud or secure cloud?
TomS
TomSβ€’5mo ago
Secure cloud. (all of the ones I tried - both the non-affected 4090s and the affected ones) Is it possible that the issue is with outdated drivers on certain machines, like the Pytorch GitHub issue suggests? (some are 525.x and some 535.x, like I mentioned?)
ashleyk
ashleykβ€’5mo ago
@flash-singh / @Justin / @JM this is an issue for people using my templates as well, can you do something to fix these broken drivers please. This issue is specific to 4090. They are more expensive than 3090, A5000 etc but their drivers are broken making them completely unusable.
JM
JMβ€’5mo ago
@TomS Could you provide me the pod ID of one of those machines that you are facing this problem?
ashleyk
ashleykβ€’5mo ago
Person having issues with my template was using 4090 in RO @Finn do you have the pod id?
JM
JMβ€’5mo ago
Also, to clarify, do you have a hard requirement with CUDA 12.2+?
ashleyk
ashleykβ€’5mo ago
My requirement is CUDA 12.1+
Finn
Finnβ€’5mo ago
9gi3jqiqlts2ou
ashleyk
ashleykβ€’5mo ago
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:82:00.0 Off | Off |
| 0% 29C P8 11W / 450W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:82:00.0 Off | Off |
| 0% 29C P8 11W / 450W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------
Finn
Finnβ€’5mo ago
jvkvnd5uu2crj2 oonjzmqb2rw7qj qswdrg5ltpr0v1 9gi3jqiqlts2ou I tried with 4 different 4090s
ashleyk
ashleykβ€’5mo ago
It has the correct version of CUDA, but also 525.x driver version like Tom and not 535.x.
JM
JMβ€’5mo ago
That's a 12.0 Cuda machine. Make sure to filter using the UI
No description
ashleyk
ashleykβ€’5mo ago
Tom also gave his pod id above: efm8o6l8qebm1y
TomS
TomSβ€’5mo ago
Sorry, was gone for a bit. Yes, the specified ID was filtered to be 12.2.
Finn
Finnβ€’5mo ago
How do I do that?
TomS
TomSβ€’5mo ago
nvidia-smi correctly showed CUDA 12.2
ashleyk
ashleykβ€’5mo ago
Yeah this issue is happening on machines that have the correct CUDA version As shown in the screenshot above Its in the filters at the top of the page.
JM
JMβ€’5mo ago
- 9gi3jqiqlts2ou: 12.0 - jvkvnd5uu2crj2: 12.2 - oonjzmqb2rw7qj: 12.2 - qswdrg5ltpr0v1: 12.0 - 9gi3jqiqlts2ou: 12.0
ashleyk
ashleykβ€’5mo ago
So 2 of these should work, but they don't πŸ€·β€β™‚οΈ
JM
JMβ€’5mo ago
Correct. But that's a start right? CUDA is very good for backward compatibility, but horrible for forward. Those versions I provided as the CUDA installed on the BareMetal machines. That being said, I see some are running on Ubuntu 20.04 and Ubuntu 22.04. Do you know if your image also has some kernel requirements? I know some require 5.15+ for example.
ashleyk
ashleykβ€’5mo ago
This must be jvkvnd5uu2crj2 and its 12.1 but it throws the Error 804: forward compatibility was attempted on non supported HW error.
JM
JMβ€’5mo ago
If that's the case, that would be extremely valuable information
TomS
TomSβ€’5mo ago
My issue occurred on efm8o6l8qebm1y, which is flagged as 12.2 but whose drivers are older than those of other 12.2 machines where this issue didn't arise What info should I provide that will help debug this?
ashleyk
ashleykβ€’5mo ago
Yeah I think you were onto something with the 525.x and 535.x.
JM
JMβ€’5mo ago
- jvkvnd5uu2crj2: Ubuntu 22.04.3 LTS, 6.2.0-36-generic, CUDA 12.2
ashleyk
ashleykβ€’5mo ago
Oh but you said its 12.1 in your list of pod ids?
JM
JMβ€’5mo ago
Edited. Alright, let me investigate, that's weird
ashleyk
ashleykβ€’5mo ago
Yeah its also weird that when @Finn ran nvidia-smi it showed CUDA 12.1 and none of your list of pod ids has 12.1 πŸ€·β€β™‚οΈ
TomS
TomSβ€’5mo ago
@JM let me know if you need something that will help you debug, can e.g. share the Dockerfile
ashleyk
ashleykβ€’5mo ago
@TomS which region was your pod in?
TomS
TomSβ€’5mo ago
@ashleyk EU-RO-1
JM
JMβ€’5mo ago
That's most likely a driver + library mismatch
ashleyk
ashleykβ€’5mo ago
Finn's pods were also in EU-RO-1.
JM
JMβ€’5mo ago
I will sort this out. Thanks a lot for uncovering this
Finn
Finnβ€’5mo ago
We're unable to access any ports from the 5 GPUs we've now spun up what can we do as our real-time service is currently down
ashleyk
ashleykβ€’5mo ago
Did you try 3090?
JM
JMβ€’5mo ago
@Finn In your case, make sure to filter for you required CUDA version too! You have several not meeting your requirements
ashleyk
ashleykβ€’5mo ago
Are you using network storage?
Finn
Finnβ€’5mo ago
Log looks to be in a loop
No description
ashleyk
ashleykβ€’5mo ago
Is this in CZ region?
Finn
Finnβ€’5mo ago
yes Idk what you mean We have two and neither of them are working
ashleyk
ashleykβ€’5mo ago
Only 4090 is working in CZ, the others are broken, I mentioned this to flash-singh but didn't get a response. @JM can you look into why A5000 and 3090 are broken in CZ as well? Do you need secure cloud specifically? If not, I suggest using an A5000 in community cloud in SK region. I always use those and never have issues.
JM
JMβ€’5mo ago
Sure, I can sort this out as well. What do you mean by ''broken''?
ashleyk
ashleykβ€’5mo ago
See screenshot above from @Finn , gets into a loop and container doesn't start. A few people had this issue today including me.
JM
JMβ€’5mo ago
POD id please?
ashleyk
ashleykβ€’5mo ago
I could start 4090 in CZ but not A5000 or 3090. Host machine out of disk space or something probably
Finn
Finnβ€’5mo ago
gmue2eh0wj8ybu
JM
JMβ€’5mo ago
Have you tried US-OR and EU-IS for 4090s as well?
ashleyk
ashleykβ€’5mo ago
No, I was trying to reproduce the issue other people reported and ran into the same issue they did.
JM
JMβ€’5mo ago
Ok thanks. Most likely a driver issue would be my guess, but those are being tested on as well speak.
Finn
Finnβ€’5mo ago
Looks like all the GPUs are broken
JM
JMβ€’5mo ago
All? If that's the case there might be something bigger
Finn
Finnβ€’5mo ago
I can't get a single one to work, even after filtering for 12.2 this is a mess
ashleyk
ashleykβ€’5mo ago
Pod id: y7yvgvzcaoeld1 (A5000) Pod id: 9bcyhnm2hqpbme (3090) Did you try A5000 in SK region in Community Cloud?
JM
JMβ€’5mo ago
And is it working with other images, or all images do not work
Finn
Finnβ€’5mo ago
I can try
JM
JMβ€’5mo ago
@Justin Could you give me a hand please? let me know if you are available
Finn
Finnβ€’5mo ago
I haven't tried with other images Isn't Community Cloud less reliable?
JM
JMβ€’5mo ago
@TomS That Image will only work on Cuda 12.2+ AND on a specific kernel of Ubuntu 22.04 It's not super compatible @ashleyk in your case, which template were you using?
Finn
Finnβ€’5mo ago
Trying now...
ashleyk
ashleykβ€’5mo ago
Supposedly but secure cloud is less reliable these days, outages in CZ, SE etc. RunPod PyTorch 2.1 I am currently testing 4090 in US-OR-1 region as well.
JM
JMβ€’5mo ago
That's not normal
ashleyk
ashleykβ€’5mo ago
Whats not? @Finn 4090 in US-OR-1 is fine. All ports are up, even on the 1.9.3 image.
JM
JMβ€’5mo ago
I believe @Finn issue is different than yours He was not using same image earlier, except if he uses a different one now
ashleyk
ashleykβ€’5mo ago
I am helping @Finn , I found a solution for him
Finn
Finnβ€’5mo ago
what's the solution?
ashleyk
ashleykβ€’5mo ago
use 4090 in US-OR-1 region
ashleyk
ashleykβ€’5mo ago
@Finn ^^ all ports working
Finn
Finnβ€’5mo ago
Look good @ashleyk ?
No description
Finn
Finnβ€’5mo ago
Trying with OR
ashleyk
ashleykβ€’5mo ago
@TomS maybe you can try 4090 in US-OR-1 as well and see if it solves the issue for you too.
JM
JMβ€’5mo ago
- Update: cz cannot pull any image. Will sort this out.
ashleyk
ashleykβ€’5mo ago
Thanks, its a different issue to the main thread here, but we ran into it when trying to use a different GPU type while trying to solve the main issue here πŸ˜†
Finn
Finnβ€’5mo ago
This solved it! RO is trash That wasted us a few hours can you guys please add some quality control? This is not the first time I've had issues with RO It's really detrimental to our end service
JM
JMβ€’5mo ago
Yep! Here are a couple things: - There might be a mismatch driver in RO (waiting on confirmation). - Second thing, I previously saw that there was attempt to use later CUDA with older CUDA. Remember to use the filter if you have requirements! - We will update everything on the platform to be 12.0+ in the next 2 months. - Last thing is, if you use Nvidia images, there might be a lots of requirements to make it compatible, including Kernel version. Those are not plug and play everywhere.
ashleyk
ashleykβ€’5mo ago
Would be better to update to 12.1+ rather than 12.0+ because oobabooga now requires 12.1 minimum
JM
JMβ€’5mo ago
Not all pods you provided were broken. RO has been incredible so far; both in terms of deployment, but also speed of service. Uptime average is above 99.9% πŸ™‚ Please note of the above, as this is important to make sure your deployments are as smooth as possible. As for CZ, DM me and I can provide some credits; this networking redesign has been quite challenging. We will sort this out asap. We do those in batches to maintain availability, but we will be working toward 12.0+, then 12.1+, then even 12.2+.
ashleyk
ashleykβ€’5mo ago
By the way, CUDA 12.3 still hasn't been added to the filter.
Finn
Finnβ€’5mo ago
There were 4 RO 4090s I tested they had 12.2 none of these worked not to mention the ones running 12.1
ashleyk
ashleykβ€’5mo ago
Something wrong here then
9gi3jqiqlts2ou: 12.0
jvkvnd5uu2crj2: 12.2
oonjzmqb2rw7qj: 12.2
qswdrg5ltpr0v1: 12.0
9gi3jqiqlts2ou: 12.0
jvkvnd5uu2crj2: 12.2
oonjzmqb2rw7qj: 12.2
qswdrg5ltpr0v1: 12.0
It only lists 12.0 and 12.2 and not 12.1 I got a 4090 in RO with 535x driver and CUDA 12.2 and its fine:
root@7d563a2fc889:~# nvidia-smi
Mon Jan 15 23:27:11 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:02:00.0 Off | Off |
| 0% 30C P8 18W / 450W | 6MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
root@7d563a2fc889:~# nvidia-smi
Mon Jan 15 23:27:11 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:02:00.0 Off | Off |
| 0% 30C P8 18W / 450W | 6MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
JM
JMβ€’5mo ago
Why is it wrong? I am pulling those info from db Also update: we have uncovered the culprit. The docker caching at the new location was the problem. We are fixing it Yep, same; the 12.2 ones on RO appears to be fine from what I have tested? I believe it might be an isolated issue with one or 2 servers Were those the pod IDs you provided earlier? Because only 2 had 12.2 @ashleyk @Finn Should be solved in CZ πŸ™‚ Goal for q1-q2 of this year is to have pristine, state-of-the-art standards. Keep us updated with anything you find, and we can knock it out
NERDDISCO
NERDDISCOβ€’5mo ago
so if I get this thread, we can only use CUDA <= 12.2 right now or?
ashleyk
ashleykβ€’5mo ago
No thats not correct, depends on the template you are using, oobabooga requires 12.1 or higher. But the main issue in this thread is that there are 4090's in EU-RO-1 with broken drivers.
NERDDISCO
NERDDISCOβ€’5mo ago
oh ok sorry, then I will open a new one
Pierre Nicolas
Pierre Nicolasβ€’4mo ago
And if we uncheck EU-RO the 4090's in unavaible on serverless , @JM when you have a solution can you post it in gΓ©nΓ©ral information if we need to adapt docker image or add a serverless params for check the driver ?
JM
JMβ€’4mo ago
@Pierre Nicolas hey! Actually, cuda filter is out now πŸ™‚
JM
JMβ€’4mo ago
Did you guys notice that yet?
No description
JM
JMβ€’4mo ago
Depending on what docker you use, it might be good pratice to select 12.0+, or even 12.1+
justin
justinβ€’4mo ago
OMG YASSS THANK U!!!
Pierre Nicolas
Pierre Nicolasβ€’4mo ago
Ok thanks you we try it tomorrow
ashleyk
ashleykβ€’4mo ago
I don't see the ability to filter CUDA versions under advanced settings in Serverless.
ashleyk
ashleykβ€’4mo ago
Mine still shows this
No description
JM
JMβ€’4mo ago
Ah damn, didn't realize that it was a private beta release. Expect this feature very very soon; that means it's being tested 😊 😊 😊
Pierre Nicolas
Pierre Nicolasβ€’4mo ago
ok coming soon πŸ˜€
NERDDISCO
NERDDISCOβ€’4mo ago
Thanks for doing this! Looking forward to try it out πŸ™