Runpod

R

Runpod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

🔧|api-opensource

📡|instant-clusters

🗂|hub

Error: Unauthorized

Unable to create an available Pod via
$ runpodctl create pod --gpuType "1x NVIDIA A40" --imageName MedicineMan
Error: Unauthorized
$ runpodctl create pod --gpuType "1x NVIDIA A40" --imageName MedicineMan
Error: Unauthorized
...
Solution:
I discovered that runpoctl interface is baseically deprecated or not full-featured

Configure pod to auto stop/auto delete once the container's main process exits?

Hi everyone! I've been trying out Fly.io for GPU stuff a bit, and I absolutely love the workflow of being able to build and push a container of whatever I'm working on, and have it automatically de-provision the container once my training process exits and finishes uploading artifacts. This is really nice as it lets me easily run as many tasks as I want on separate GPUs without having to worry about manually stopping them. However, I much prefer Runpod as a platform (and much prefer runpod's pricing as well) and I want to replicate the same workflow here. Is there a good way to do that? I did some testing and it appears that if my main process exits the pod just restarts....

unable to ssh to serverless pod

My key has not changed and I am unable to ssh to serverless pods using the command given. Port 22 is kept open for TCP connections

Network Volumes on CPU pods

When will Network Volumes on CPU pods be enabled?
No description

Import SD3.5 from HF to Runpod

Hi, any option to import this model using my token? Or any template with sd3.5 already installed? Thanks.

POD connectivity/bandwidth extremely slow

I am running fooocus, on a 4090 community pod, image gets generated pretty fast (can see tht in terminal), But fooocus ui lags, and keeps loading and waiting on Sampling step x/30 etc It seems like a bandwidth issue to me, because overall focus ui loads very slow, even when i first open it. I am located in Asia, and I tried with pods in Europe / and USA regions, with almost same issue....

I'm getting a 502 bad gateway cloudfare error to my pod's http endpoint. Will anyone fix this issue?

Since yesterday, any pod that I created has gone offline. I've deleted and created another GPU pod. I cannot make HTTP requests to my llama instance.

runpod is running very slow

what is happening with runpod and comfyui? its been 5 mins its still loading, jupyterlab is running slow. installing dependencies taking very long (like 5min - 30 mins) when installing nodes in comfyui its giving failed error but after restarting the pod its installing on launch (very slow)....

Pod not "opening"

I tried starting a new pod and when i go to "open" the pod in the UI (hitting the down buttom within the pod), all I see is this black page. It refers to a error: "Application error: a client-side exception has occurred (see the browser console for more information)."
No description

Cannot connect A100 PCIE on secure pod with vscode

I can successfully SSH into the A100 PCIE secure pod using the terminal. However, when I try to connect using the "Remote-SSH" extension in VSCode, I encounter the following error: "Could not establish connection to 'runpod': XHR failed." This issue occurs only with this specific pod, other pods work without any problems. Also I suspect there is something wrong on its p2p connection, like gpu 0 access gpu 1 is not possible....

The "connect to http service" button disappeared since yesterday. How do I connect to my pod now?

I really can`t find a solution for that and I am going crazy about it. Everything worked until yesterday and from nowhere the button disappeared. I dont recall changing anything to make this happen. Can someone help me with that?

SSH over exposed not working

I am being prompt a password for SSH over exposed TCP: (Supports SCP & SFTP) I created an key pair using root@ip the server side has both the key .ssh/authorized_keys and in the SSH Public Key UI the Basic SSH Terminal works however. ...

Faulty node?

Since this morning, I encountered this error multiple times: 'CUDA error: uncorrectable ECC error encountered'. Everytime, after terminating the pod and starting a new one, the problem went away. All incidents were on US-GA-2, H100-PCIe...

Specify runpodctl location

Hey! Can you specify location when creating a pod with runpodctl or curl? Can't find flag in docs. Ex: "runpodctl create pod --name 'test' --gpuType 'NVIDIA GeForce RTX 4090' --containerDiskSize 100 --volumeSize 50 --secureCloud --ports '22/tcp,9200/tcp,9946/tcp,70000/tcp,70001/tcp,70002/tcp,70003/tcp,70004/tcp,70005/tcp' --imageName 'runpod/pytorch:2.1.0-py3.10-cuda11.8.0-devel-ubuntu22.04'" + location...

GPU memory already in use when pod starts

I have seen this happen multiple times across different GPU types and regions. When launching a pod some of the GPU memory is already in use and any attempt to make full use of the GPUs memory results in errors/crashes. For example, I have been trying to deploy 2xA100 GPUs in the Romania data center for the past hour. Each time I launch a pod one of the GPUs already shows 40% of the memory in use and attempting to utilize the GPU results in a crash. This is a screenshot of my GPU usage immediately after launching the pod, before any model had been loaded (or even downloaded). Restarting the pod and deleteing/recreating the pod does not resolve the issue. If I paying to rent a GPU I expect to be able to make full use of it and not have half of the memory be locked up for no apparent reason. Oh, and I tried running koboldcpp in the CA region which doesn't have this problem, but for some reason it is unable to create a cloudflare URL (only happens on CA region, have seen this for 2+ months now)....
No description

Pods not even starting due to low memory

I'm in US-OR-1 and trying to start pods with 0 GPU to do some config work. They don't start normally, the web UI embedded console goes back and forth between "Waiting for logs" and acting like the pod is healthy. When I try to connect, Jupyter server is offline, ComfyUI is offline (which I don't care about since I started with 0 GPU) and the "Start Web Terminal" button doesn't do anything; I never get the "Connect to Web Terminal" button to enable. Container log: 2024-12-03T21:46:51Z create container valyriantech/comfyui-with-flux:latest 2024-12-03T21:46:52Z latest Pulling from valyriantech/comfyui-with-flux...

Do Runpod Pods run in privileged mode?

Trying to run a pod that requires privileged mode to true. Wondering if they are privileged by default?
Solution:
nope

RunPod disconnecting/resetting during model training

Hi everyone, I've encountered an issue several times over the past week and have yet to successfully complete a model because of it. I've triple-checked to ensure I'm using an On-Demand instance. However, after a few hours of running my model, the web server or Jupyter notebook loses its connection. When I reconnect, the session appears to have reset:...

ERR_NVGPUCTRPERM when profiling CUDA kernels

I'm trying to profile CUDA kernels with NCU and I encountered this error due to a said lack of permission : "ERR_NVGPUCTRPERM - The user does not have permission to access NVIDIA GPU Performance Counters on the target device 0. For instructions on enabling permissions and to get more information see https://developer.nvidia.com/ERR_NVGPUCTRPERM" on the linked website, it is said that when profiling kernels on containers (which is the case here with pods right?), one has to launch the container with --cap-add=SYS_ADMIN but I'm not sure this is possible with Runpod pods. Have you find a workaround ? Surely there is a way to profile kernels on container GPUs ?...

SSH missing password

Hi, I am having trouble with setting up my ssh conection, I always get asked about a password
Solution:
Found the issue, when creating the key, the email has to be root@<ip>