RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

🔧|api-opensource

📡|instant-clusters

🗂|hub

stop pod

hello, i am kind of confused. i havent used runpod in a while. I want to stop my gpu instances, butif i select the trash button on my pods, it seems to want to delete the volume. I am using a volume and running secure cloud gpu's. Isnt there a way to terminate pod but keep all the data in the volume?
No description

How to transfer between pods?

I'm running stable diffusion and would like to transfer my outputs to a different pod to continue working. When using runpodctl to transfer data between from 1 pod to another, what is the command? I have tried using runppdctl send “file path name” but this isn’t working for me. What file path should I be using? Can someone share an example of the file paths structure, please? It was suggested I post the question here, I'm not getting an error, it's just that nothing is happening.
No description

Network connection

I launched two pods using secure cloud, and each pod need to communicate with the other. But when I check, they couldn't communicate to each other. How can I connect to another one? (region is same)...
No description

Multi-node training with multiple pods sharing same region.

I am trying multi-node training with multiple pods. When I launched multiple pods with same region, they share same public IP, but only port is different. How should I specify the proper port and IP for multi-node training? Does secure cloud offers multi-node training?...

Dev Accounts Adding Public Key

I'm an admin on a team account. Can dev accounts add public keys to the org?

Does Runpod Support Kubernetes?

My current understanding is that runpod only supports docker-images, in the sense that you (1) create a template, (2) reference a docker-image, and then runpod pulls that image and runs it as needed. However, what if I want to run a kubelet and have it join my kubernetes cluster as a node, and then have my k8s cluster place my own docker-images onto the node?...
Solution:
Hi there - can you advise how you got there? We definitely need to do something about that page, it's incredibly outdated 😅 As far as Kubernetes - no, no support for that, but if you are willing to rent out an entire machine of GPUs for a minimum time commitment for at least a few months, we can offer a baremetal setup instead...

Does GPU Cloud is suitable for deploying LLM or only for training?

I'm pretty new in RunPod, I have already build 4 endpoints on Serverless and it's pretty straight-forward for me, however I don't understand is GPU Cloud is also suitalbe for pure LLM Inferencing via API for chatbot purposers or it's only for training models and saving weights. The main question is that can I also deploy my LLM for inference on GPU Cloud for production? Where to find API on which I should make calls? Because I find Serverless very unstable for production, or maybe it's mine faul...

Issues with connecting/initializing custom docker image

I've created a custom docker image for quick ocr training; https://hub.docker.com/repository/docker/jeffchen23/paddleocr-image/general The problem is, everything downloads properly, but then I am unable to connect. When trying to connect, I get Permission denied (publickey); but the permissions are not an issue for any of my other pods. I think it is because the pod fails to initialize correctly, as it constantly spams messages of Start container. Can anyone help me pin down this issue? It works on my local machine when I pull it from the web. My local docker command is as follows: docker run -it --runtime nvidia --shm-size 2g --gpus all -v paddleocr-volume:/PaddleOCR paddleocr-image bash It doesn' t look like I have any direct control over the Docker command from RunPod (from what I can tell), so I'm a little lost....

Error occurred when executing STMFNet VFI: No module named 'cupy'

Running Comfy UI on runpod and hits this error. Can someone help provide the steps to install or update Cupy? Much appreciated!

my pod start very slow

it takes 10 minutes for my port 5000 to go to ready , any helep pls ?
No description

Template sharing in a team doesn't work

We have a RunPod Team with several people in it and other users can't access our custom Template from the Graph API using their own API key, but they can see it on the UI (so the UI and API are not consistent). We get the following error:
Error: {'errors': [{'message': 'Template not found', 'path': ['podFindAndDeployOnDemand'], 'extensions': {'code': 'RUNPOD'}}], 'data': {'podFindAndDeployOnDemand': None}}
Error: {'errors': [{'message': 'Template not found', 'path': ['podFindAndDeployOnDemand'], 'extensions': {'code': 'RUNPOD'}}], 'data': {'podFindAndDeployOnDemand': None}}
...
Solution:
Yes there is, they need to use an API key from the team account, not from their own account and also API keys are not scoped so they will have access to do anything they want with the API key.

ComfyUI not launching

I've tried running ComfyUI using the runpod community template (ai-dock/comfyui:latest) and now both buttons in the "Connect" modal point to the "Service" endpoint even though the 8188 port should open the web interface. Clicking that link (Connect to HTTP Service on port 8188) opens the service logs which are stuck with "Waiting for workspace mamba sync..." repeating. I would expect ComfyUI to open on this port.

I can't shutdown my pod ?

There is just no button on the interface to shut down my pod? I can only terminate it... ID: oeyqtrae2ex5tv...
Solution:
U can instead just terminate it completely 🙂 and just always spin up new ones. stopping pushes a pod to idle state but is mainly for persistent storage.

LocalAI Deployment

Hello RunPod Team, I'm considering your platform for deploying an AI model and have some questions. My project involves using LocalAI (https://localai.io/ https://github.com/mudler/LocalAI), and it's crucial for the deployed model to support JSON formatted responses, this is the main reason I chose localai. Could you guide me on how to set up this functionality on your platform? Is there a feature on RunPod that allows the server or the LLM model to automatically shut down or enter a low-resource state if it doesn't receive requests for a certain period, say 15 minutes? This is to optimize costs when the model is not in use....
Solution:
What u are looking for is the runpod serverless. Can read their documentation, but the tldr is can use a runpod official template as a base, then build on it to have ur own handler.py. U must be able to build a docker image. Build whatever model you want into the docker image so it isnt constantly downloaded at runtime...

Jupiter notebook (In chrome tab) consistently crashing after 20 hours

My Jupiter lab notebook chrome tab has crashed in the middle of 22 hours of training a model, how do i know if it's still training it, if it has stopped, or if it is just running without doing anything? This has happened to me 3 times in a row and this time i would like to know what is happening. The GPU usage is going up and down with is suggesting it is training and simply not showing on the notebook, but i would like to make sure.
No description

Extremely slow sync speed

Syncing a pod to dropbox and the speed is extremely slow. Maxing out at 80kb/s and dropping as low as a few b/s at times.

How can I remove a network volume?

Hi, I'd like to know how I can remove a network volume I created? Tried looking through your docs but couldn't find info on it, could you please help?
Solution:
You can delete it under the network volume section in GPU cloud

Can I remove a GPU & resize my storage after I've created a pod?

I'd like to create a pod with two GPUs. However, I won't be needing 2 forever so I would like to if I can remove one after I'm done with it. I would also like to know if I can resize my pod's persistent storage after I've created it (either by shrinking or adding more).
Solution:
Im not sure u can resize but prob ur best bet just have a network storage to always store to then u can always terminate and spin back up as needed 🙂

Need to update Auto1111 to 1.7.0

I want to enable SDXL inpainting, and git pull doesn't seem to work. I've understood that there are some other files that need to be altered as well, and sometimes things don't work as expected on Runpod (like updating an extension). Could I have some help in getting this to work?
Solution:
My template is already updated to 1.7.0 😎

How can I clean up storage in my network volume?

Hello, I'm using stable diffusion template with a network volume. I noticed that even though I clean up files in Jupyter, space is not freed up in my volume. I suspect files go to trash but not removed completely. I searched a lot but could not find the trash folder. Does anybody know where I can find or any other way of cleaning up my storage space properly?
Solution:
Alright I found using ncdu that path is /workspace/.Trash-0 and then I removed it with rm -rf /workspace/.Trash-0 All good now. Storage space is freed up....