43 Replies
Solution
The GPU is dedicated to you, they are not shared.
on Vast, GPU is shared, some has TFlops under 82.2 or GPU bandwidth a half. So i hope Runpod has 100% power of GPU
Internet speed is shared, not GPU
Great! Thanks for your help
@flash-singh i've got this error when trying to install environment:
environment: line 75: sudo: command not found
environment: line 76: sudo: command not found
"" command failed with exit code 127.
this is not happen when i'm install on other cloud service
You don't need sudo, you are already root in almost all RunPod templates.
Also no need to tag RunPod devs unless you have a RunPod specific hardware issue etc, things like this, the community can help you with.
i got this error when run project, i don't know what happened with cuda
Did you use the CUDA filter at the top of the page to select only CUDA versions 12.1, 12.2 and 12.3 before deploying your pod?
You can run nvidia-smi to check which version of CUDA the host machine has.
yep
i want to use lastest version of CUDA
Looks fine, not sure why you have errors, I suggest logging a GitHub issue for the application you're using.
how to test before runs a project?
Not sure what you mean. You test by running it, looks like some issue with the application you're using because the CUDA version of the pod is correct to use the Pytorch template you're running.
or maybe the problem is my GPU not visible
Best to contact the developer of the application if its not working.
Use python CLI, import torch and check but seems to be fine according to nvidia-smi
my project work well on Vast.AI
Maybe you're using an unsupported torch version on RunPod
torch 2.2.0 is pretty new, a lot of older projects still rely on version 2.1.2 etc
there is no 2.1.2
do you mean 2.1.1?
Stack Overflow
CUDA initialization: Unexpected error from cudaGetDeviceCount()
I was running a deep learning program on my Linux server and I suddenly got this error.
UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions
i've searched on GG and the suggestion is reboot
how to reboot without losing installed data
Looks like RunPod has all of these versions available:
1.13.0
2.0.1
2.1.0
2.1.1
2.2.0
I would try 2.1.1 or 2.1.0 and then if you still have the same issue, try 2.0.1
Do you remember which torch version you were using on Vast?
It is safe to reboot your pod if all of your data is installed on persistent storage or a network volume
However, you will lose your data on reboot if you installed everything into the container disk
here is vast's template
Wow CUDA 12.4 already 😲 . Does that install torch or did you install it yourself?
could be vast's template
i never have to install pytorch manually
Do you have a link to the Github project for your application?
GitHub
GitHub - nimaaghli/NASChain at abb8d2309a769cae54be7190fb2d01f4a66c...
Neural Architecture Search Powered by Bittensor. Contribute to nimaaghli/NASChain development by creating an account on GitHub.
Seems to use torch 2.2.0 - https://github.com/nimaaghli/NASChain/blob/abb8d2309a769cae54be7190fb2d01f4a66c7e1d/requirements.txt#L3
GitHub
NASChain/requirements.txt at abb8d2309a769cae54be7190fb2d01f4a66c7e...
Neural Architecture Search Powered by Bittensor. Contribute to nimaaghli/NASChain development by creating an account on GitHub.
everything seems harder now 😄
So in theory, the RunPod template you're using should be fine, I don't know why you are getting errors.
wanna try to reboot, but seems not a good idea 🙂
I doubt rebooting will fix it. When they mention rebooting, they probably mean the machine with the GPU, ie. the host machine and not your pod.
out of idea now
You can try this https://discord.com/channels/912829806415085598/1213495584539811860
then i got the same error log
Seems to be some issue with your pod then.
have to create a new one?
Yeah, put pod id here so RunPod staff can check it out, then I suggest terminating it and creating a new one.
ID: c7cg4v0dgysoj5
thanks man, really appreciate!
cuda 12.4 just got rolled out on RunPod
and it works fine. thanks!
in template popup, i only see CUDA 12.1.1 as newest version