R
Runpod17mo ago
Nafi

0 GPU pod makes no sense

I have network storage attached to my pods. I don't care if a GPU gets taken from me, but it's very inconvenient that I have to spinup a completely new pod when it does. I am automating runpod via the CLI, and at the moment I dont see any way to deploy a fresh instance and GET the ssh endpoint. I think just slapping on a warning saying you have to start fresh when a GPU gets taken and finding the next available one makes so much more sense, especially when using network storage.
42 Replies
digigoblin
digigoblin17mo ago
0 GPU is not a thing when you use network volumes, that only happens when you don't use a network volume
Nafi
NafiOP17mo ago
It's happened several times to me, with a network volume attached I assure you this because the data is persistent
Unknown User
Unknown User17mo ago
Message Not Public
Sign In & Join Server To View
Nafi
NafiOP17mo ago
essentially I want to be able to start an exited pod, but I dont care if the GPU returns to the pool I am happy to use the next available one. The issue is that every so often the instance can only have 0 GPU's so I have to redeploy a completely new pod on the network storage. I cannot do this via the CLI so I have to do it manually, which defeats the purpose of the automation. Here's a sample response when trying to start the exited pod via the CLI:
Error: There are not enough free GPUs on the host machine to start this pod.
Error: There are not enough free GPUs on the host machine to start this pod.
Next time it happens I can send a screenshot of the runpod UI
Unknown User
Unknown User17mo ago
Message Not Public
Sign In & Join Server To View
Nafi
NafiOP17mo ago
runpodctl stop pod <id>
Unknown User
Unknown User17mo ago
Message Not Public
Sign In & Join Server To View
Nafi
NafiOP17mo ago
can be done via CLI?
Unknown User
Unknown User17mo ago
Message Not Public
Sign In & Join Server To View
Nafi
NafiOP17mo ago
and create new one via CLI?
Unknown User
Unknown User17mo ago
Message Not Public
Sign In & Join Server To View
Nafi
NafiOP17mo ago
yes I see now. I will try thanks
Unknown User
Unknown User17mo ago
Message Not Public
Sign In & Join Server To View
Nafi
NafiOP17mo ago
what am I missing here?
runpodctl create pod --networkVolumeId "fpomddpaq0" --gpuType "RTX A6000" --templateId "8wwnezvz5k" --imageName "runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04" --cost 1.00
runpodctl create pod --networkVolumeId "fpomddpaq0" --gpuType "RTX A6000" --templateId "8wwnezvz5k" --imageName "runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04" --cost 1.00
Error: There are no longer any instances available with the requested specifications. Please refresh and try again.
Unknown User
Unknown User17mo ago
Message Not Public
Sign In & Join Server To View
Nafi
NafiOP17mo ago
tried removing cost, and setting GPU type to "L40" i copied the IDs directly I will double check yeah nothing, Im unsure why the imageName has to be specified if you can specify the template?
Unknown User
Unknown User17mo ago
Message Not Public
Sign In & Join Server To View
Nafi
NafiOP17mo ago
yes Error: required flag(s) "imageName" not set
Unknown User
Unknown User17mo ago
Message Not Public
Sign In & Join Server To View
Madiator2011
Madiator201117mo ago
imageName your docker image name
Unknown User
Unknown User17mo ago
Message Not Public
Sign In & Join Server To View
Nafi
NafiOP17mo ago
? runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04 ^^^^^ This doesn't happen though. The GPU I use is never unavailable, it's just like it gets taken from me and theres no option to fetch another from the pool. I raised an issue about the imageName and they are handling it internally, but that isn't even necessary for me, as long as I can pool a new GPU rather than having 0 GPUs on my pod, even with a network storage attached.
Unknown User
Unknown User17mo ago
Message Not Public
Sign In & Join Server To View
Nafi
NafiOP17mo ago
No description
Nafi
NafiOP17mo ago
this would say 0xL40
No description
Nafi
NafiOP17mo ago
even though I have a network volume attached to it stopping is fine, its starting it
Unknown User
Unknown User17mo ago
Message Not Public
Sign In & Join Server To View
Nafi
NafiOP17mo ago
Tried that
Unknown User
Unknown User17mo ago
Message Not Public
Sign In & Join Server To View
Nafi
NafiOP17mo ago
Do you have an exact command that work for you (CLI) for create pod I tried the issue occured with the imageName they raised it internally Sample:
runpodctl create pod --secureCloud --gpuType 'L40' --imageName 'runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04' --networkVolumeId 'fpomddpaq0' --ports '8888/http,22/tcp' --templateId '8wwnezvz5k'
runpodctl create pod --secureCloud --gpuType 'L40' --imageName 'runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04' --networkVolumeId 'fpomddpaq0' --ports '8888/http,22/tcp' --templateId '8wwnezvz5k'
Output:
Error: There are no longer any instances available with the requested specifications. Please refresh and try again.
Error: There are no longer any instances available with the requested specifications. Please refresh and try again.
**
Unknown User
Unknown User17mo ago
Message Not Public
Sign In & Join Server To View
Nafi
NafiOP17mo ago
graphql would be fantastic, is there documentation? found it graphql didnt fix the problem :/ For this input:
{
"input": {
"cloudType": "ALL",
"gpuCount": 1,
"gpuTypeId": "NVIDIA L40",
"volumeInGb": 40,
"containerDiskInGb": 40,
"minVcpuCount": 2,
"minMemoryInGb": 15,
"name": "RunPod Test Pod",
"imageName": "runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04",
"dockerArgs": "",
"ports": "8888/http,22/tcp",
"volumeMountPath": "/workspace",
"startJupyter": False,
"startSsh": True,
"supportPublicIp": True,
"templateId": "8wwnezvz5k",
"networkVolumeId": "fpomddpaq0",
}
}
{
"input": {
"cloudType": "ALL",
"gpuCount": 1,
"gpuTypeId": "NVIDIA L40",
"volumeInGb": 40,
"containerDiskInGb": 40,
"minVcpuCount": 2,
"minMemoryInGb": 15,
"name": "RunPod Test Pod",
"imageName": "runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04",
"dockerArgs": "",
"ports": "8888/http,22/tcp",
"volumeMountPath": "/workspace",
"startJupyter": False,
"startSsh": True,
"supportPublicIp": True,
"templateId": "8wwnezvz5k",
"networkVolumeId": "fpomddpaq0",
}
}
Output:
Deployment Response: {'errors': [{'message': 'Something went wrong. Please try again later or contact support.', 'locations': [{'line': 12, 'column': 5}], 'path': ['podFindAndDeployOnDemand', 'gpus'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR'}}], 'data': {'podFindAndDeployOnDemand': None}}
Deployment Response: {'errors': [{'message': 'Something went wrong. Please try again later or contact support.', 'locations': [{'line': 12, 'column': 5}], 'path': ['podFindAndDeployOnDemand', 'gpus'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR'}}], 'data': {'podFindAndDeployOnDemand': None}}
For this input:
{
"input": {
"cloudType": "SECURE",
"gpuCount": 1,
"gpuTypeId": "NVIDIA L40",
"cloudType": "SECURE",
"networkVolumeId": "fpomddpaq0",
"ports": "8888/http,22/tcp",
"startJupyter": False,
"startSsh": True,
"supportPublicIp": True,
"templateId": "8wwnezvz5k",
}
}
{
"input": {
"cloudType": "SECURE",
"gpuCount": 1,
"gpuTypeId": "NVIDIA L40",
"cloudType": "SECURE",
"networkVolumeId": "fpomddpaq0",
"ports": "8888/http,22/tcp",
"startJupyter": False,
"startSsh": True,
"supportPublicIp": True,
"templateId": "8wwnezvz5k",
}
}
Output:
Deployment Response: {'errors': [{'message': 'There are no longer any instances available with enough disk space.', 'path': ['podFindAndDeployOnDemand'], 'extensions': {'code': 'RUNPOD'}}], 'data': {'podFindAndDeployOnDemand': None}}
Deployment Response: {'errors': [{'message': 'There are no longer any instances available with enough disk space.', 'path': ['podFindAndDeployOnDemand'], 'extensions': {'code': 'RUNPOD'}}], 'data': {'podFindAndDeployOnDemand': None}}
Confirmed an internal problem then
digigoblin
digigoblin17mo ago
Its not an internal problem, its a problem with your request. You can't specify a networkVolumeId without a data center id.
℠
17mo ago
The UX generally leaves something to be desired when it comes to provisioning and terminating resources for whichever reason The fact that they don't the pod in a terminated state when they shut it down is really frustrating, as it leaves you no recourse when they decide to kill an instance you are using, because you balance is nearing $0. No warning, no idle state, no buffer at all
digigoblin
digigoblin17mo ago
Not sure what you're referring to but sounds like a different issue to this thread
℠
17mo ago
It's super issue to this
Unknown User
Unknown User17mo ago
Message Not Public
Sign In & Join Server To View
℠
17mo ago
They? What about me? Ive had my pods vanish with the blink of an eye. While I was working on it.
Unknown User
Unknown User17mo ago
Message Not Public
Sign In & Join Server To View
℠
17mo ago
Youre missing the point, but Im too far away to hand it to you There’s almost always a way to work around UX issues But those are judt work-arounds, not solutions.
Unknown User
Unknown User17mo ago
Message Not Public
Sign In & Join Server To View
Nafi
NafiOP17mo ago
This is similar but also completely different I would open a separate thread and describe your issue more in depth The way to implement what you desire would be some sort of network storage system caching /frozen system state or something Theres probably a reason why they haven’t done it yet - too complicated/not possible with the infrastructure The original issue I raised in this thread can only be avoided by creating and deleting pods on demand via graphql

Did you find this page helpful?