RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods-clusters

Production pod suddenly unreachable, how long can I expect this to last for? (Please provide ETA)

Hi, I have an On-Demand Secure Cloud pod that runs the backend for my app. My app is now not working, and the pod has the message in the screenshot. How long can I expect this to last for? Minutes? Hours?
No description

Test Support Thread

Test Support Description
No description

Maximum number of A40s that can run at one time

I'm looking to run as many A40s to finish a large-scale inference/LLM generation job. How many could I run at one time? 40, 80, 100?

Cannot SSH over exposed TCP (multiple pods, tested from different local machine)

Hi @here I cannot SSH over TCP but is able to do basic. I suspected my Docker at first, but I have the same issue with multiple Docker image. I tested it from multiple local machine. This is the verbosed error message: debug1: Reading configuration data ~/.ssh/config...

Does RunPod support other repos other than Docker Hub?

Wodering if we can use AWS or GitHub as an alternative

Persistent container disk

Is there a way to make the container disk mounted at / persistent for a pod instead of the additional drive at /workspace or whatever?

How to avoid Cloudflare timeouts on pods?

I saw a previous post mentionning using the public IP but it doesn't seem to work for me? I'm using runpod to host a vLLM server (the serverless endpoint doesn't work for me). I'm running batch workloads and those timeout (cloudflare)...

Environment variables in direct SSH

Is there a way to access environment variables defined in the web app in an SSH connection over exposed TCP port?

How does runpod handle pod terminating

It is very likely that runpod simply sends a sigkill to the main container process. This is really annoying when you are trying to handle termination. Could you please provide information on how your orche system handles pod termination and how I can get the OS signal

KoboldCpp - Official Template broken

I've tried to launch the KoboldCpp template a few times, but am hitting errors. The model I want to use downloads in two parts (split with commas in launch arguments). The downloads finish and append, but the logs show 'rm: cannot remove './mmproj.gguf': No such file or directory' right before it finishes. The container then restarts and the downloads begin again from square one. These same models worked the last week. I have saved the entire logs if needed.

Secret now showing up in the pod `env` output

hi, i added some secrets and added those secrets as environment variables for my pod, but i couldn't see it when i run env in my pod, i'm using {{ RUNPOD_SECRET_secret_name }} as the environment variable value...
No description

transfer data of a stopped pod to a new one

hey i finished my training on a big pod and i want to share all the data to another pod using the storage (network volume) how can i do that?

pod error

2024-08-19T23:00:50Z create pod network 2024-08-19T23:00:51Z create container runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04 2024-08-19T23:00:52Z 2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04 Pulling from runpod/pytorch 2024-08-19T23:00:52Z Digest: sha256:75bf115d87ee3813f8026fed3e11bae3bf68bfd789a9566878735245b723ef8b 2024-08-19T23:00:52Z Status: Image is up to date for runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04...

Pod Down for hrs

Any idea how long this will take to resolve, I cannot access my pod.
No description

Can pods shutdown from inside the pod itself?

Wondering if the pod can accept a shutdown command to stop billing

Does runpod provides environments isolation?

Hi, if we want to have two isolated environments, dev and prod, what can I do in Runpod? Thanks,...

error pulling image (US community server)

When creating a new community pod based in the US I get this message: error pulling image: Error response from daemon: Get "https://registry-1.docker.io/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers What is the problem here?...

Resuming an on demand pod via sdk

Hello, how can I resume a spot pod through the python sdk? I am using the resume_pod function but I am not able to

Possibility of Pausing a Pod Created with Network Storage

Hello, I am a new user of RunPod. Currently, I am using a pod created through network storage. I noticed that regular pods have a pause function, but I couldn't find this feature in the pod created with network storage. I would like to know if this feature is available for such pods and, if so, how I can use it.