R
Runpod17mo ago
Monster

not enough GPUs free

Hi there, wish you a good day today. I have a serverless endpoint running on runpod, it is created on top of the network storage belongs to US-OR-1 data center. it was running well for somedays, but 20 mins before, I have encountered the issue that no worker is able to be created because no GPU resource. the system throws a log like this repeatedly. 2024-07-13T06:32:22Z create container USERNAME/ENDPOINT 2024-07-13T06:32:22Z error creating container: not enough GPUs free how can I make sure there are GPU resources whenever the request comes, should I change the endpoint and the network volume to other region which has more GPU resoures? how often this shortage will be happening. it post a risk on the stability and quality of service which is critical in most scenarios. thank you.
20 Replies
Unknown User
Unknown User17mo ago
Message Not Public
Sign In & Join Server To View
Monster
MonsterOP17mo ago
I did not get it. the image version is my customized version on dockerhub, it might be V1, V2, V3 any thing, how does it related to the GPU resource competing? and which env file should I edit, to add the dummy any? how. thank you so much.
Unknown User
Unknown User17mo ago
Message Not Public
Sign In & Join Server To View
Monster
MonsterOP17mo ago
any thing? like ENVDUMMY=anything
Unknown User
Unknown User17mo ago
Message Not Public
Sign In & Join Server To View
Monster
MonsterOP17mo ago
ok, so you suggest it is not actually a resource lackage, it is a bug that is why I should add the dummy env and or update image version
Unknown User
Unknown User17mo ago
Message Not Public
Sign In & Join Server To View
Monster
MonsterOP17mo ago
ok, thx
Unknown User
Unknown User17mo ago
Message Not Public
Sign In & Join Server To View
Monster
MonsterOP17mo ago
I have added a dummy env. it did work, but it does not mean the problem solved, since the error gone after 2 or 3 mins by itself, after couple of times trying new worker. so, any idea or suggestion how to make it not happening again? thx
Unknown User
Unknown User17mo ago
Message Not Public
Sign In & Join Server To View
Monster
MonsterOP17mo ago
yes, I did report. thank you so much for the support.
Unknown User
Unknown User17mo ago
Message Not Public
Sign In & Join Server To View
Monster
MonsterOP17mo ago
"they" what do you mean "they", I thought you are from runpod support team, isn't you
Unknown User
Unknown User17mo ago
Message Not Public
Sign In & Join Server To View
Monster
MonsterOP17mo ago
so you are hired by them or you are the volunteer to give support based on your experience.
Unknown User
Unknown User17mo ago
Message Not Public
Sign In & Join Server To View
Monster
MonsterOP17mo ago
I feel you are very confident and familar with all sort of issue, platforms, technologies. at least would be a senior member of their support team. so it supprised me that you do not have access to runpod internal
Unknown User
Unknown User17mo ago
Message Not Public
Sign In & Join Server To View
Monster
MonsterOP17mo ago
yes, you will for sure. thx anyway.

Did you find this page helpful?