RunpodR
Runpod2y ago
8 replies
mozthefox

Could not find CUDA drivers

I am experiencing issues with the Stable Diffusion Kohya_ss ComfyUI Ultimate template. I have setup an RTX 3090 pod, transferred the training images and setup Kohya.

I am really new to RunPod, so I apologise if I'm misunderstanding something or missed something obvious.

When I begin training, the Kohya log file displays the following message:

2024-03-11 20:04:56.909562: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-11 20:04:57.518054: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-11 20:04:57.518308: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-11 20:04:57.643386: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-11 20:04:57.881246: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-11 20:04:57.883751: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-11 20:05:02.761582: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT


Is this normal? Also, the training process is reporting 2.14s/it and with Epoch set to 10 and 7000 steps it will take about 42 hours. Is that right?

Thanks,
James
Solution
Check your GPU memory and GPU utilization and you will see that the GPU is being used. This is just some weird tensorflow error.
Was this page helpful?