Runpod

R

Runpod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

🔧|api-opensource

📡|instant-clusters

🗂|hub

Multi GPU problem

Hi, how can I evenly distribute workers across multiple GPU? I am trying to get the Stable Diffusion model up, however I am getting an out of memory error as gunicorn is trying to run them on one GPU. How can I solve this problem, given that I need to run all the workers on the same port. Either how can I configure proxying requests inside the pod.
No description

Unable to create pod with GraphQL

Hi I tried to use following command to create a pod to test. ```bash curl --request POST \ --url https://api.runpod.io/graphql \ --header "Authorization: Bearer YOUR_API_KEY" ...

Creating a Pod with dockerArgs and a docker image from a registry that requires auth

I'm trying to create a pod from a template or from a docker image from a docker registry with authentication. I'm using the method podFindAndDeployOnDemand. If I specify a templateId, the pod starts but it seems that the dockerArgs I specify in the API call is ignored and the CMD in the Dockerfile is run instead. ...

Indicate region in deployment console menu

Hi can you please add an option to see what region the availability is for in the deployment console?
No description

runpodctl create pod --communityCloud --gpuType 'A4500' --cost 0.19 is not working

I am trying to deploy a pod RTX A4500 on community cloud, on web page I can see available machines with that price (0.19 USD/h) but comand returns: " runpodctl create pod --imageName 'pod-1' --communityCloud --gpuType 'NVIDIA GeForce RTX A4500' --templateId '...' --cost 0.19 Error: The current minimum price for this type of instance is 0.5. " Why?...

50/50 success with running a standart vllm template

So when I start vllm/vllm-openai:latest on 2xA100 or 4xA40 I only able to do it 1/2 or 1/3 times. I haven't noticed any logic befind it it just fails sometimes. Here are parameters I use for 2xA100 for instanse: --host 0.0.0.0 --port 8000 --model meta-llama/Llama-3.3-70B-Instruct --dtype bfloat16 --enforce-eager --gpu-memory-utilization 0.95 --api-key key --max-model-len 16256 --tensor-parallel-size 2 I also need have some logs....

Container keeps restarting

Hey, I have a container that keeps restarting, not quite sure why - there's nothing in the logs (or they get deleted way too quickly when it restarts?). I'm using a custom template. The issue still remains with a long-running run command (e.g. /bin/bash -c "sleep infinity"). Any ideas what might be wrong?

Extremely slow upload to HuggingFace

For the past 10 hours or so I've had issues uploading to HuggingFace on my pod (jhw1d9hmjb8d3v). speedtest-cli shows acceptable speeds, but specifically uploading to HuggingFace goes below 1MB/s often....

Enable performance counter on runpod

Hi, I'm trying to profile some CUDA kernels on a pod with A100 in order to improve its performance. Is there a way to enable the performance counters as per https://developer.nvidia.com/nvidia-development-tools-solutions-err_nvgpuctrperm-permission-issue-performance-counters on pods? I've tried to enable it by creating necessary config files on /etc/modprobe.d but no avail It seems that the permission needs to be enabled on the host ``` When profiling within a container, access must be enabled on the host, or the container must be started with the appropriate permissions by passing --cap-add=SYS_ADMIN as an admin user....

The "Fine tune an LLM with Axolotl on RunPod" tutorial should mention uploading public key first

The tutorial is very useful, but would be even more so if it mentioned that to "connect to it over secure SSH", you have to provide your SSH public key beforehand so the created image will have it for you to make the first connection. That would help it further to be a self-contained article for this use case

Error response from daemon: container

After uploading my ED25519 SSH, creating a pod (using the "winglian/axolotl-cloud:main-latest image"), and trying to SSH into it, I immediately get a :
Error response from daemon: container [..id..] is not running
Error response from daemon: container [..id..] is not running
error after successfully authenticating to ssh.runpod.io with the public key. Looking at the pod's system logs, it says the image was created, but it doesn't get beyond starting it and the container log shows a "curl: no URL specified" error: ```...
Solution:
Quick update, this is due to the template in the tutorial no longer being maintained as the repo for git moved under axolotl. The working RunPod Template for this is axolotali/axolotl-cloud:main-latest And currently the axolotlai template is working. I have raised this issue internally with our team and we will get the template in the tutorial updated to point to this as well....

Templates view is broken

"Templates" on the website never stops loading: https://www.runpod.io/console/user/templates

Suspicious space consumption or volume disk not mounted

I provisioned a pod with 240GB volume disk, but that 240GB space is nowhere to be found.
No description

runpodctl get pod -a does not return the pods IP

Per the title. There is no way to get the IP address of a created Pod via command line
Solution:
I eventually found out that the GraphQL interface is more robust and you need to requisition a pod that can have a public IP, but not all can.

Pod separation

Is there any way to separate pods by project/tag/team and use API keys for each?

maintainance time

Start: 12/13/2024 14:31 Local Time End: 12/13/2024 18:30 Local Time what is the local time , where are you folks based ? which country ?...

How often do pods get network speed tested?

I am setting up a pod and the reported upload speed is significantly lower than manually run speedtests. It also hasn't changed in hours which leads one to wonder how often this gets tested. Can a new test be triggered on demand?

Error: Unauthorized

Unable to create an available Pod via
$ runpodctl create pod --gpuType "1x NVIDIA A40" --imageName MedicineMan
Error: Unauthorized
$ runpodctl create pod --gpuType "1x NVIDIA A40" --imageName MedicineMan
Error: Unauthorized
...
Solution:
I discovered that runpoctl interface is baseically deprecated or not full-featured