RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods-clusters

Possibility of Pausing a Pod Created with Network Storage

Hello, I am a new user of RunPod. Currently, I am using a pod created through network storage. I noticed that regular pods have a pause function, but I couldn't find this feature in the pod created with network storage. I would like to know if this feature is available for such pods and, if so, how I can use it.

Docker run in interactive mode

Hi, I want to be able to ssh into my pod and run bash commands. If i provide no entry command in my Dockerfile I am unable to connect to my pod via ssh. I also don't see anywhere with the option to edit the docker run command to include the interactive flag. Any help is appreciated...

Made a optimized SimplerTuner runpod : Failed to save template: Public templates cannot have Registr

Hi! I've spent a couple of days creating and testing a Docker flow for RunPod, and I've run the pod privately multiple times with no problems. There is no registry information in the Dockerfile, but I keep encountering this error, with absolutely no indication of its origin or how to fix it. Any help would be greatly appreciated, as we have a community eager to train Flux1 on RunPod....

URGENT! Network Connection issues

Hi, looks like there is a general issue in all pods and all of them are suffering network connection issues. Can someone look into this?

Looking for suggestion to achieve Faster SDXL Outputs

Hi, I am currently trying to generate large amount of images (200+) every session via Automatic1111/Forge UI SDXL Model and was wondering how can I generate them fast? I tried using RTX 3090 for generation and it's about 1.5-2 it/s which is pretty slow in the long run. Perhaps there is a faster alternative to this or workflow? Please let me know. Provide me a workflow and GPU suggestion that can generate large amount of images swiftly....

Official Template not running correct version of CUDA

Hello ! I'm trying to run a pod using the official templates : runpod/pytorch:2.1.0-py3.10-cuda11.8.0-devel-ubuntu22.04 runpod/pytorch:2.0.1-py3.10-cuda11.8.0-devel-ubuntu22.04 Unless I completely misunderstood the notation, the image should run with cuda11.8.0 right? I've tried with Secure Cloud RTX 4090 and Secure Cloud RTX Ada 6000...
Solution:
@InnerSun nvidia-smi shows max CUDA version supported by host

I can't run the pod with container start command

bash -c "cd /workspace/ && sh run.sh" I tried with this start command, but it does not work, it seems that it run repeatedly but after I connect to pod and run "cd /workspace && sh run.sh", it work well ...

Volume / Storage issues

I am attempting to install comfy on a few different machines. Before Comfy and the flux dev models are done installing I am getting a out of volume error and cannot run the pod. Could this be just a bad string of luck in a few broken pods? Or am I not srtting something up correctly?...
No description

I'm trying to start a cpu pod using the graphql endpoint and specifying an image

Hey, I've succesfully ran a cpu pod creation using the graphql endpoint, however, it does not seem to follow the same structure as the gpu creation. What I'm trying which is working is: ``` mutation { deployCpuPod( input: { ...

Please help urgent

Hey I have been struggling for hours trying to set up this pod to train a LORA for Flux..... Can someone explain why or how the container is full? I know that I need to install some models, but shouldn't they go into the Volume - which I made to 200GB to avoid storage issues?!...
No description

Running custom docker images (used in Serverless) to use in Pods

Hey everyone - so here is my current situation: - I have created a custom docker image for my serverless endpoint in Runpoint - My local machine is a macbook so I am unable to execute the NVIDIA dependent comfyui installation I have in the image, so trying to see if I can run this on a Runpoint pod instead - The use case is that I'm trying different workflows in ComfyUI that I want to test out in a Pod, before I deploy to the Serverless endpoint...

cloud sync fail

Syncing to Dropbox failed; it always shows: Something went wrong! some detail:...
No description

Can not start docker container

I use custom docker image: Here is system log: 2024-08-15T04:53:11Z start container Here is container log: 2024-08-15T04:52:55.667454224Z /usr/local/bin/docker-entrypoint.sh: line 414: exec: docker: not found SSH to this pod response Container not running....

libcudnn.so.9: cannot open shared object file: No such file or directory

Getting this error when using the CUDAExecutionProvider with onnxruntime-gpu. I'm building the container for cuda 12 and installing onnxruntime-gpu 1.18 directly from microsoft's package index to fully support cuda 12. nvidia-smi works inside the container. not sure why im getting the issue.

Can't access pod

it's been down over 16 hours, would be great if this can be dealt with ASAP. Stuck on Waiting for logs if I try to turn it on
No description

Multiple containers on a single GPU instance?

Are there any plans to allow multiple docker containers on a single GPU instance? I have workloads which do not utilize the full resources of a single GPU, and I'd like to be organize the workloads using multiple containers sharing a single GPU. I don't believe there is a way to do this currently, the closest is to run multiple processes inside a single docker container, but that is a docker anti-pattern and not very good for workload organization.

Connecting Current Pod to Network Volume

Hello, is there a way to connect a current pod to a network volume, or would I have to transfer all the data into a network volume and set up a new pod. If that is the case, what's the fastest way to do that (I have a large dataset I would have to move)?

Weird error when deploy lorax inference server

Hi guys, i'm trying to deploy the lorax inference server on runpod A100 PCIe pod. I got a very weird error attached in the image. Why the error is weird? Because it only happened for some pods but not all, do you guys know any reason about this?
No description

Passwordless SSH doesn’t work half the time.

I’m using pods in the secure cloud. Half the time, I can’t SSH in and it asks for a password. My key is in authorized files, all the settings for the ssh server are right, but it won’t accept my key. Debug logging gives no reason why. The template is a standard PyTorch 2.2 template from RunPod. The only thing I can do is set a root password and allow using it for SSH and enter my password every time, which is very annoying. Happens all the time and then every now and then it doesn’t and I can SSH in fine without a password. Nothing different on my end. Same template, same scripts doing the login. ...