RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

network volume

How can I copy my network volume in one region to another region? Since network is slow downloading and installing everything over is a nightmare.

Kobold.cpp - Remote tunnel loads before the model, causing confusion (possible off-product issue)

Here's the log piece: ``` load_tensors: offloading 88 repeating layers to GPU load_tensors: offloading output layer to GPU...

VRAM stuck at 77% usage

VRAM usage stuck at 77% on 1 of my 4 GPUs. already restarted, hard stop, and start. and reset. i don't want to have to switch pods bc I have hundreds of GB of data on the volume that will take a long time to set up again. anything else i can do? tried reset. still stuck. ID: ox02c3pvm058j3...

Restore_snapshot error.

Hello, have anyone seen an error like this? dockerf build error log Restore the snapshot to install custom nodes...

Choose CPU model on Pods

Hi everyone, I have to test AMD EPYC 9354 perfomances on our product and found that it is supported in serverless mode. Is it possible to have it on a pod? I only saw two options CPU3 and CPU5, but the pods I started to check contained an older model....

Maintanence

does this mean the server will be taken down during this time or at the end of may cause 6th of Feb has already passed?
No description

Network Storage question

Hi, I am looking to create several GPU pods that all share the same shared network storage. When I go to create a network storage, it looks like I have to deploy a new GPU pod that is always running. How do I create a storage that doesn't rely on a GPU pod being always on? I want to be able to turn off these pods when they are not being used and use the shared storage when they turn back on.

no more full ssh? cannot connect vs code / cursor

hi, I used to be able to connect vs code to my pods over ssh by using the 'full ssh' option (supports scp & sftp). that option doesn't seem to be around any more? I have connected to multiple A40 pods and now an H100 over the last couple days and there's no full ssh option. is this going to come back? is there some new configuration needed?

runpodctl communityCloud + spot

How to create spot on community? start a pod from runpod.io Usage:...

Multiple Pods with same network storage, ports?

I'm trying to run multiple pods simultaneously, all connected to the same network storage. According to the info in the website this is possible, but i cannot connect to the second's pods jupyter - do i need to assign a different port to the second pod in order for this to work and if so how and which port? I've tried adding 8889 to https and tcp but i probably had to specify it somewhere before trying to start Jupyter or something?

HTTP 502 on VLLM pod

I'm getting a 502 when trying to connect to the deployed service. Using the vllm-latest image and these arguments: --host 0.0.0.0 --port 8000 --model mistralai/Mistral-Small-24B-Instruct-2501 --dtype auto --enforce-eager --gpu-memory-utilization 0.95 --tensor-parallel-size 2 Using the ollama service doesn't have any issues. Any ideas?...

Potential L40S P2P Communication Issue via NCCL on Some Hosts in US-TX-4

I’m seeing a possible NCCL P2P issue on some L40S hosts in US-TX-4. Some pods hang indefinitely while others in the same region work fine. Here’s a reproducible example:
runpod-vllm-nccl-diagnostic Observations - Environment: 2 x L40S GPU pods in US-TX-4 ...

Pod data loss after disk resize

I deployed an RTX 3090 pod on RunPod and was working on a project. I then edited the pod to increase the disk size. Afterward, I could no longer connect to the server, and all my files and folders seem to have disappeared. What can I do?

Can't connect to terminal or jupyterlab on runpod pytorch 2.1 or 2.4 template

This is for EU-RO region, I tried a different region it worked fine, but I need to use EU-RO because of network volume

NVIDIA Driver Selection

Is there a way to select which NVIDIA GPU driver my pod is using?

How to extend pod with saving plan

I purchased a pod for 1 week saving plan. How can I extend the contract with 1 month saving plan before the term ends? I want to keep the files and scripts in the current pod's disk volume

securing channel...room not ready

in the web terminal im trying to import loras with ComfyUI with Flux.1 dev one-click, when putting in the code from my PCs terminal i get the the error securing channel room not ready, the pod says everything is running and cpu is on 0, i have no tech skills so i need help

Pod unusable, extremely slow

░░░░░░░░░░░░░░░░░░░░ [0/7] Installing wheels... warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. If the cache and target directories are on different filesystems, hardlinking may not be supported. If this is intentional, set export UV_LINK_MODE=copy or use --link-mode=copy to suppress this warning. ██████████████░░░░░░ [5/7] torch==2.6.0
...

jupyter not opens when I activate the pod

I tried to open pods with runpod/pytorch:2.0.1-py3.10-cuda11.8.0-devel-ubuntu22.04 image, but the jupyter lab server doesn't connect to pods and show 502 bad gateway even I wait more than 5minutes However Web Terminal works fine and it makes me more confused as the server is still work fine, just jupyter doesn't works My pod id is 0jct97bqf5ijws...
Solution:
However, even the container and system log from the pod with problem is completely looks normal from outside (which cross-checked by tickets I send through webpage) it still doesn't work, so I think the problem is related to some version mismatch which necessary to run jupyter lab.

Pytorch 2.4.0 ROCm 6.1 pod broken

When I create a pod with a pytorch image, the following error is displayed in the log. Also for other pytorch images. I have selected the MI 300X GPU create container runpod/pytorch:2.4.0-py3.10-rocm6.1.0-ubuntu22.04...
Next