R
RunPod•4mo ago
annah_do

Pod is unable to find/use GPU in python

Hi, I'm trying to connect to this pod: RunPod Pytorch 2.2.10 ID: zgel6p985mjmmn 1 x A30 8 vCPU 31 GB RAM runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04 On-Demand - Community Cloud Running 40 GB Disk 20 GB Pod Volume Volume Path: /workspace I can see that it has a GPU with nvidia-smi, and the cuda and pytorch version seem correct, but I cannot use the GPU with torch... Can anyone help? Best ``` root@54be7382bee1:~# python Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux Type "help", "copyright", "credits" or "license" for more information.
import torch torch.cuda.is_available() /usr/local/lib/python3.10/dist-packages/torch/cuda/init.py:141: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.) return torch._C._cuda_getDeviceCount() > 0 False torch.version '2.2.0+cu121' exit() root@54be7382bee1:~# nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Mon_Apr__3_17:16:06_PDT_2023 Cuda compilation tools, release 12.1, V12.1.105 Build cuda_12.1.r12.1/compiler.32688072_0
Solution:
@Dhruv Mullick I don't think it has to do with the image... If you select it from the runpod website, there is a filter button at the top and then a drop down menu where you can select 12.2 as "Allowed CUDA Versions" as @ashleyk pointed out earlier 'the machine is running CUDA 12.3 which is not production ready'. if I select 12.2 it works....
Jump to solution
17 Replies
annah_do
annah_do•4mo ago
root@54be7382bee1:~# nvidia-smi Fri Feb 23 11:56:47 2024
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA A30 On | 00000000:00:06.0 Off | On | | N/A 45C P0 31W / 165W | 0MiB / 24576MiB | N/A Default | | | | Enabled | +-----------------------------------------+----------------------+----------------------+
ashleyk
ashleyk•4mo ago
Maybe because the machine is running CUDA 12.3 which is not production ready.
annah_do
annah_do•4mo ago
most machines use CUDA 12.3 and with the 48GB GPU it works
ashleyk
ashleyk•4mo ago
@JM said they should all be on 12.2 because 12.3 is not production ready. I haven't seen any machines on 12.3 personally.
annah_do
annah_do•4mo ago
hm just double checked and you are right. my 48GB GPU is actually on 12.2... will keep an eye open for thin in the future...
Dhruv Mullick
Dhruv Mullick•4mo ago
@ashleyk how do we use 12.2? I spawned an H100 SXM5 pod with the image: runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04, but still nvidia-smi shows that cuda is 12.3 ID: axwx9s1edwts9x Facing the same issue as @annah_do This happens even if I change my template to: runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04
Solution
annah_do
annah_do•4mo ago
@Dhruv Mullick I don't think it has to do with the image... If you select it from the runpod website, there is a filter button at the top and then a drop down menu where you can select 12.2 as "Allowed CUDA Versions" as @ashleyk pointed out earlier 'the machine is running CUDA 12.3 which is not production ready'. if I select 12.2 it works.
annah_do
annah_do•4mo ago
No description
Dhruv Mullick
Dhruv Mullick•4mo ago
Awesome, thank you @annah_do ! I thought it was the image that was controlling this.
Dhruv Mullick
Dhruv Mullick•4mo ago
Even with Cuda 12.2 I'm seeing the same error now
No description
ashleyk
ashleyk•4mo ago
How did you install torch? Probably conda breaking stuff, conda sucks
Dhruv Mullick
Dhruv Mullick•4mo ago
I just used the torch from the latest torch + Cuda template ( I think it was runpod/pytorch :2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04 but I've now deleted the pod)
ashleyk
ashleyk•4mo ago
RunPod templates don't use conda though as far as I'm aware. Your application probt installed it
Dhruv Mullick
Dhruv Mullick•4mo ago
This is clean VM, with no other commands executed but the ones shown above 😅
ashleyk
ashleyk•4mo ago
Thats not true, it does not say (torch_env) in front of my prompt like yours does with a clean pod.
No description
ashleyk
ashleyk•4mo ago
That only happens when that crap conda gets installed. And it shows that CUDA is available on A100.
>>> torch.cuda.is_available()
True
>>> torch.cuda.is_available()
True
So I don't know what you are doing, but you are clearly doing something wrong.
JM
JM•4mo ago
Hey guys! Yep, thanks @ashleyk Indeed, it might be possible that there would be some machines that slip off with 12.3, but the biggest bulk is on 12.2. Like already mentionned, 12.3 is beta and we recommend production ready drivers 🙂
Want results from more Discord servers?
Add your server
More Posts
Pod is stuck in a loop and does not finish creatingHi, I'm trying to start a 1 x V100 SXM2 32GB with additional disk space (40 GB). It worked fine untoptimize ComfyUI on serverlessI have ComfyUI deployed on runpod serverless, so I send the json workflows to runpod and receive theProbleme when writing a multi processing handlerHi there ! I got an issue when I try to write a handler that processes 2 tasks in parallel (I use ThIdle time: High Idle time on server but not getting tasks from queueI'm testing servers with high Idle time to keep alive and get new tasks, but the worker is showing iIs there a programatic way to activate servers on high demand / peak hours load?We are testing the serverless for production deployment for next month. I want to assure we will havRunpodctl in container receiving 401Over the past few days, I have sometimes been getting a 401 response when attempting to stop pods wiIncreasing costs?guys last few days seems an increase in cost without a spike in active usage. do you have any idea wCannot establish connection for web terminal using Standard Diffusion podI'm able to connect to the Webui HTTP client. And I can connect via SSH from my local machine AND I [URGENT] EU-RO region endpoint currently only processing one request at a timeWe have a production endpoint running in the EU-RO region but despite us having 21 workers 'running'Runpod errors, all pods having same issue this morning. Important operationI got this error on all my pods today We have detected a critical error on this machine which may a