R
RunPod•4mo ago
ribbit

cudaGetDeviceCount() Error

When importing exllamav2 library I got this error which made the serverless worker stuck and keeps on spitting an error stack trace. The error is:
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW
What's about this error? Is this about the library or is there something wrong with the worker hardware that I've chosen? and why doesn't the error stop the worker? It keeps on running for 5mins without I even realizing.
5 Replies
Madiator2011
Madiator2011•4mo ago
What GPU and PyTorch version?
ashleyk
ashleyk•4mo ago
Looks like your Docker image probably uses CUDA 12.1, but you didn't use the CUDA filter and got a worker with CUDA 11.8 or 12.0.
ribbit
ribbit•4mo ago
I checked these only, torch is 2.1.2
No description
ribbit
ribbit•4mo ago
ah i see, thanks just realized that i didn't have that filter on, enabled it already but why don't the worker return an error? it doesn't get stopped automatically
ashleyk
ashleyk•4mo ago
You probably need to scale workers down to zero and back up again for the change to take effect. No, you need to check stuff during development and not assume everything is working 🙂