RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods-clusters

Spot

The pricing of Spot is really too tempting. As a student with little money, this price seems very cost-effective, but it is taken up too quickly. What are you guys doing with Spot? Sometimes this thing It can only be used for 5 minutes.

Should i able to launch a pod using nvidia/cuda docker images?

I am trying to start a pod using nvidia/cuda:12.6.0-cudnn-runtime-ubuntu24.04 (to get both cuda and cudnn). I'm not an a docker expert, but should that work? The pod appears to start, but the licensing message keeps looping in the logs, and i can't SSH into the pod? Any ideas? thx....
Solution:
No, you need to setup it yourself, and add a sleep infinity or a program that's running in the main thread, to make sure the container isn't looping start/stop

Connecting to Pod- Web Terminal Not Starting

Good Day, I am using my first pod and I have virtually no Linux skills. I created a pod and it's running fine. When I first created it, I connected by the "Start Web Terminal" and then clicked Connect and everything worked fine for about 30 minutes. Then it said I was disconnected. Tried to start web terminal again and it doesn't start and I cannot connect. Tried connecting via ssh on my windows box, bit it's asking me for the root password and I have no idea how to determine what that is....

Am I downloading directly from HuggingFace when I download models?

When I download a model from huggingface, am I using up their bandwidth or does runpod have some cache server that sits between my runpod and huggingface? I feel bad downloading from huggingface, bandwidth isn't free for them and all that.

Not 1:1 port mappings for multinode training

Hi I am trying to run multinode distributed training over multiple machines, but it isn't working. I think this because when I use the torchrun.distributed command and specify the port, if I choose the internal port the other machine send data to the wrong external port. If I choose the external port, my master node doesn't listen on the correct internal port. Is this a problem other people have had and is there a solution? Thanks in advance!...

How to override ollama/ollama image to run a model at startup

Hi, I´m trying to run pods using the ollama template (ollama/ollama) and trying to override the default template to during pod creating serving the model that I want. I tried to use ./bin/ollama serve && ollama run llama3.1:8b command into "container start command" but it doesn´t work. Any way to do this? Thanks!...

How to send request to a pod ?

Hello, for various reasons, I have a docker image that I want to run on a pod with a flask api. Except that I can't send any request, locally everything works but as soon as I put it on a pod it's problematic. First of all, I can't get a public IP address, so I thought I'd go through https://{pod_id}-{internal_port}.proxy.runpod.net/ping, but still, my requests don't work. So I tried using nginx to redirect requests to my container's internal port, but again, I'm having a bit of trouble with nginx, it doesn't seem to work. Is there something I've misunderstood about how to use runpod?...

Stoped Pod Price

When we use runpod.stop_pod(pod['id']) to stop a pod, and the pod's status becomes "stopped", how is the pod billed in this state? Is the GPU resource fee still charged?

Looking for best A1111 Stable Diffusion template

Anyone know any custom templates for Stable Diffusion A1111 that have ADETAILER and CONTROLNET extensions pre-installed?

No A40s available

Been checking all throughout the day, but no A40s are available. Anyone know why?

community cloud spot POD

A spot instance suddenly automatically switched to an on-demand instance. Is this normal? Also, when downloading Docker images, it often fails or becomes slow. (The speed variance when downloading Docker images each time a pod is created is too large (depends on luck)). Is this normal? Is there a way to minimize this? I host an average of 20 RTX 4090 instances for about 12 hours a day, automatically removing or adding pods to match demand. I'm curious about situations where Docker image downloads suddenly fail and about the behavior of spot instances...

Does the pod hardware differ a lot in US?

Hi, We deployed several times in US region (secure cloud) with runpod cli, but the inference performance/speed differs a lot, even model loading time differs a lot, what's the reason? and how do I know what data center I'm using. it only shows 'US'. thanks...

GPU requires reset

Restarted and re-created the pod a couple times, getting the same error on container start. I assume it keeps grabbing the same bad node. I was able to start the container by switching to a different instance type. 2024-08-26T21:15:45Z error creating container: nvidia-smi: parsing output of line 5: failed to parse ([GPU requires reset]) into int: strconv.Atoi: parsing "": invalid syntax Pod ID: 2hvpqmtrowunjp...

Problems updating admin passwords on kasm image

I'm trying to change the default admin and kasm passwords on a kasm instance using the image runpod/kasm-docker:cuda11 once the pod is running, I login via ssh and successfully use passwd to change the admin password. Then I successfully change the kasm password using vncpasswd -u kasm_user then when i login using kasm, i can login successfully but the screen is completely gray and the cursor doesn't appear. something's broken and i have no clue what it is. ...

No A40s available?

I have my pod on an A40 and -a lot- of material i've downloaded onto it, but the A40 gpu's have been taken up all night. Is there anyway to quickly transfer all my downloaded material to another pod, or will the lack of availability be solved quickly?...

kernel dying issue.

Starting today, the kernel has suddenly stopped working properly, and it keeps dying or failing to run. I need to quickly check the results, but all my work has come to a halt. I need a quick response regarding this kernel dying issue.

Running out of disk space

I am trying to load a large dataset to train my model. How do I increase the available disk space of my pod?

Interested in multinode training on Runpod

Hi guys, my team is interested in using RunPod for multinode training. We are looking for 24-96 a100s for larger scale model training. Do you guys currently support this?

Continuous Deployment for Pods

Hello, I recently transitioned from using Serverless Endpoints to Pods, but I'm encountering issues with my existing build and deployment workflow. Previously, with Serverless Endpoints, I had a setup in GitHub where I used GitHub Actions workflows to build container images, push them to my registry, and update the template image reference via the GraphQL API. When I updated the template, the endpoint would automatically restart and pull the new image. However, with Pods, this behavior doesn't seem to work the same way. Even after updating the template, the Pod continues running the "old" image and doesn't refresh automatically. Could you suggest a method to trigger a dynamic update or replacement of the Pod? Additionally, are there any other deployment strategies you recommend for my situation? I appreciate your assistance! ...

Production pod suddenly unreachable, how long can I expect this to last for? (Please provide ETA)

Hi, I have an On-Demand Secure Cloud pod that runs the backend for my app. My app is now not working, and the pod has the message in the screenshot. How long can I expect this to last for? Minutes? Hours?
No description