Runpod

R

Runpod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

🔧|api-opensource

📡|instant-clusters

🗂|hub

dstack apply: Resource name should match regex

Getting this error running `dstack apply: Resource name should match regex '^[a-z][a-z0-9-]{1,40}$'. I'm using the exact yml details from your RunPod Dstack Tutorial. I've asked your AI channel and it can't help....

volumeEncrypted Broken in API

When creating a new pod using the GraphQL API, requests that specify volumeEncrypted, whether true or false, fail with the response “Internal server error” instead of a pod ID. Setting “volumeKey” alongside volumeEncrypted doesn’t fix this either. I am able to create pods without this parameter, so I believe my query is well formed I assume this is a known issue but is there any way around this, other than keeping an encrypted pod in an exited state?...

RTX 5090 Pod Availability

Hi, when will the RTX 5090 pods be available for use for Runpod customers? Looking forward to it!

Team access to authorized ssh keys

I've just added new team members to my account. I'd like them to be able to add their own public ssh keys to the pods that they launch from the team account. They don't see the user settings on their team account but they can see it for their personal account - those keys don't get added to the team pods though. I can add their keys myself but am wondering if there is a way for them to do it themselves. "connect to pods" is listed on the dev role here https://docs.runpod.io/get-started/manage-accounts?#dev-role...

comfyui

Pod not operational. Files lost. Have tried to follow ai-bot but it just sends me in circles.

Pod overwrites my project code in "Volume Mount Path"

I run my container on pod where all my code lives in "/workspace/project/". When I set "Volume Mount Path" in my pod template to be "/workspace/project/" it overwrites my project code completely. It seems that volume mount happens after container is up and running, which removes everything in my project folder. Is there a workaround for this? (I'm not using a network volume btw)
Solution:
just put files outside of /workspace then sync or copy

UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu.

I created a pod with A40 GPU but I am getting the following error when running torch.cuda.is_available()

Network volumes temporarily disabled for CPU Pods - timeline query

Hi folks, We noticed that network volumes are temporarily disabled for CPU pods, which is blocking our scholars’ weekend experiments. Can you share an estimated timeline for restoration? Thanks! Iftekhar (MATS)...

Pod memory limits

pods are based on containers and containers don't do virtualization-style isolation, only process-based isolation. So which resources are visible from inside the pod is - well - everything.
cat /proc/meminfo
And then runpod relies on cgroups to enforce resource limits. So far-so-good....

AI Toolkit Lora Training torch.OutOfMemoryError

I've tried different pods for Flux Lora Training on AI Toolkit and couldn't get any luck at all. I even used 2 x RTX 4090 24 vCPU 62GB RAM and it was also reporting “torch.OutOfMemoryError”. How could that be??? The RTX 6000 Ada 48 GB VRAM 188GB RAM 24 vCPU could start the training process but it took more than 10 minutes (!!) to generate sample image and it was practically not visible (denoise <0.2). How's that?...

Where is My Network Volume Mounted?

I'm using some large tensor files that don't fit in the standard 20GB of a pod. I created a network volume and a pod for it, but when I ssh to the pod, I don't see the volume mounted anywhere. How do I access it?

Lost Workspace running (official) Stable Diffusion Template

Hi there. I'm a newbie on runpod (and the theme cloud gpu) and not 100% aware of all logics behind the scene. For image Generation I deployed a pod (community). I modified it so there is enough storage for all the models I needed to work with. I installed those using Jupyter and terminal. I generated lots of images but at one point the working day was over and I had to stop the pod. I read that all files within the workspace remain connected to my account. But when I started again today I couldn't use the pod (which I expected), but when I deployed a new pod from the same template the /workspace folder was a new one (clean slate). Why? Is every template creating/overriding the workspace?...

omp.h unable to access all processors.

Hi! I'm running a pod with 8 vCPUs, but it seems like omp.h can only access 1 of them while thread can access all of them. For example,...

How to queue requests to vLLM pods?

Hi there, I run an AI chat site (https://www.hammerai.com) with ~100k users. I was previously using vLLM serverless, but switched over to using dedicated Pods with the vLLM template (Container Image: vllm/vllm-openai:latest) because serverless was getting very expensive. Currently I have three pods spun up and a Next.js API which uses the Vercel ai SDK to call one of the three pods (I just choose one of the three randomly). This works okay as a fake load balancer, but sometimes the pods are all busy and I fail with:...

Pod stuck trying to start custom docker image

Hi all I'm having trouble using a custom template. I'm trying to use a docker image provided by this simulation project https://github.com/Genesis-Embodied-AI/Genesis/blob/main/docker/Dockerfile and I built and uploaded the image to https://hub.docker.com/repository/docker/nathankau/genesis-docker/general However when I create a template and use it to start a pod, it seems like the pod gets stuck trying to start the container. I have no container logs and the system logs repeat the following. ```...
Solution:
Yay I made it work. I unset both ENTRYPOINT and CMD in my Dockerfile so that the default nvidia_entrypoint.sh is used. Then in the runpod template UI, I set the container start command to sleep infinity.
No description

My pod is taking forever to download the image

1 x RTX 2000 Ada 6 vCPU 31 GB RAM Image size is around 18gb...
Solution:
@jojje runpod team suggested to keep images in docked registry instead of GitHub.

Pods stuck on “Waiting for logs”

Hi, not only one of my pods (cpu5c-2-4) but 2 new pods I’m spinning up are stuck on “waiting for logs…”. It’s been like this for many hours, I’ve tried restarting them and also creating new pods but to no avail. All of the pods are in different locations. Any help would be appreciated as this is extremely time sensitive

I am having issues with running jupyter lab on my pod, it was running before but just got disconnect

I am having issues with running jupyter lab on my pod, it was running before but just got disconnect and now i am seeing this message not ready make sure your service is running

Container Registry Auth not working for private docker images

Hi guys, I created a key from dockerhub and added it to runpod settings "Container Registry Auth". I - chose a random credential name, - used my dockerhub username as the username of the credential, - and used the generated key as the password. ...
Solution:
Thank you a lot for the help it worked now! Porbably because i accidentally included blank space when pasting the credential into runpod.

Model Maximum Context Length Error

Hi there, I run an AI chat site (https://www.hammerai.com). I was previously using vLLM serverless, but switched over to using dedicated Pods with the vLLM template (Container Image: vllm/vllm-openai:latest. Here is my configuration:
--host 0.0.0.0 --port 8000 --model LoneStriker/Fimbulvetr-11B-v2-AWQ --enforce-eager --gpu-memory-utilization 0.95 --api-key foo --max-model-len 4096 --max-seq-len-to-capture 4096 --trust-remote-code --chat-template "{{ (messages|selectattr('role', 'equalto', 'system')|list|last).content|trim if (messages|selectattr('role', 'equalto', 'system')|list) else '' }} {% for message in messages %} {% if message['role'] == 'user' %} ### Instruction: {{ message['content']|trim -}} {% if not loop.last %} {% endif %} {% elif message['role'] == 'assistant' %} ### Response: {{ message['content']|trim -}} {% if not loop.last %} {% endif %} {% elif message['role'] == 'user_context' %} ### Input: {{ message['content']|trim -}} {% if not loop.last %} {% endif %} {% endif %} {% endfor %} {% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} ### Response: {% endif %}"
--host 0.0.0.0 --port 8000 --model LoneStriker/Fimbulvetr-11B-v2-AWQ --enforce-eager --gpu-memory-utilization 0.95 --api-key foo --max-model-len 4096 --max-seq-len-to-capture 4096 --trust-remote-code --chat-template "{{ (messages|selectattr('role', 'equalto', 'system')|list|last).content|trim if (messages|selectattr('role', 'equalto', 'system')|list) else '' }} {% for message in messages %} {% if message['role'] == 'user' %} ### Instruction: {{ message['content']|trim -}} {% if not loop.last %} {% endif %} {% elif message['role'] == 'assistant' %} ### Response: {{ message['content']|trim -}} {% if not loop.last %} {% endif %} {% elif message['role'] == 'user_context' %} ### Input: {{ message['content']|trim -}} {% if not loop.last %} {% endif %} {% endif %} {% endfor %} {% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} ### Response: {% endif %}"
I then call it with:...