RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods-clusters

A100 PCIe is not working with EU-RO-1 storage.

I have created storage(network volume based on EU-RO-1) A100 PCIe is available. But I am getting an error while deploying runpod instance. There are no longer any instances available with the requested specifications. Please refresh and try again. whats wrong with me?...
Solution:
hmm maybe tthe gpu is taken, low on stock, and there are no currently

Error when synching with Backblaze

I'm getting "Something went wrong!" most of the time when syncing with Backblaze. It sometimes works so doesn't seem to be an issue with credentials. No other info in the error popup.

Import backup from volume disk to Network volume

Hello! Right now I have a POD with stable diffusion installed, all my files are in jupyter and I am using a normal disk, I would like to be able to transfer all the information from this volume to a network volume (for price reasons). What would be the best way to do it?...

Pod is stuck on network outage message, no changes for quite a while.

Our pod has been having network issues for a while now (saw it first yesterday afternoon). Also I have recently purchased a savings plan for this pod (id nx9twh8ikfjru8) so I am not sure what will happen if I will try to recreate this pod. Also there is probably some data outside of the /workspace directory (I know not a good idea..). Any way to check what is going wrong here?...

authorized_keys not working on runpod

I've deployed a runpod server, and added ssh key into user settings. that key is working for ssh. but when I add a new public key into ~/.ssh/authroized_keys directly in in terminal, that key pairs not working for ssh....

Runpod is not utilizing GPU and Showing zero GPUs

I am current running a Runpod with A40 GPUs with pytorch template, when I am trying to check GPUs in the jupyter notebbok using list(range(torch.cuda.device_count())) or print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU'))) , it is showing zero GPUs. Also I want to know if there is any template available to tensorflow version 2, I couldn't find it, so currently using pytorch first template. It would be very helpful if someone could help me with this issue. I am stuck in the middle. I need to finish my project in asap....

Stuck on "Waiting for logs"

I've tried everything: switching GPU, switching regions, creating new storage, changing browers, switching accounts, clearing history, restarting WiFi, nothing has worked. What's weird is my colleagues can connect on the shared account and are able to initialize a pod in this same manner, which I can then access. Obviously asking my colleague to start up a pod every time is not sustainable. I'm pretty sure the log is being initialized but I can't access it as I don't have the access code....
No description

Multiple Pods SSH Resolving to Same Machine

I'm trying to connect to multiple community cloud pods simultaneously through VSCode. They happen to have the same public IP with different ports, but they seem to be sharing resources (e.g. storage and GPU). It seems like it could be an issue with VSCode, since connecting from a generic SSH terminal doesn't have this problem, but I'm wondering if there is a known workaround.

Network errors in Secure Cloud

Hello, I am using secure cloud to serve inference for an LLM, can someone explain what these messages mean? Is this the infra’s fault or mine? Is there any roadmap for improving reliability of network?...
No description

Pod with Comfy (flux + stable diffusion)

Hello, Right now I have a pod with stable-diffusion:web-ui-10.2.1 and I want to have only 1 pod where I can choose whether to use flux dev version or stable-diffusion:web-ui-10.2.1 , I heard about comfy that allows both but I am not clear, can you recommend me the best template according to my requirements? I don't know if in my current pod with stable diffusion I can add comfy, if I create another pod I will have to move all my files to the new stable diffusion and it will be long 😦...

Changed Log output on the Runpod website

we are using FastAPI in one of our applications on your run pods. Since a couple of days the FastAPI log output is not displayed on the website's log window. In order to see the log output I have to start FastAPI via terminal now. Have there been recent changes to the way logfiles are displayed on the runport website?...

How do I find my network volume with runpodctl?

How do I find my network volume with runpodctl?

network outage pls fix to it

my pod is not works pls fix to it

Cannot see logs on my pods

I can only see queue time but cannot see logs on my pods. is this issue faced by anyone else as well

Storage Pricing

How is storage pricing calculated? Is it per month altogether or same like pods per minute or maybe per day?

Any network issues in EU-RO-1?

My git clone is running at 32KiB/s and I can't copy from s3 (its very slow). Also apt-get is slow. (same speed as git). But downloading files seems to work as expected (got 33MiB/s)...

I'm seeing 93% GPU Memory Used even in a freshly restarted pod.

Not sure what to do about this. nvidia-smi shows there are no processes running, but when I try to run a job it shows "Process 1726743 has 42.25 GiB memory in use". How do I find and kill that?
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 26.00 MiB. GPU 0 has a total capacity of 44.52 GiB of which 18.44 MiB is free. Process 1726743 has 42.25 GiB memory in use. Process 3814980 has 2.23 GiB memory in use. Of the allocated memory 1.77 GiB is allocated by PyTorch, and 53.97 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 26.00 MiB. GPU 0 has a total capacity of 44.52 GiB of which 18.44 MiB is free. Process 1726743 has 42.25 GiB memory in use. Process 3814980 has 2.23 GiB memory in use. Of the allocated memory 1.77 GiB is allocated by PyTorch, and 53.97 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
...

Persistance in pod logs from my training

I started my pod instance, associated with a volume where my dataset is located and cloned my repository through github using VS code integration. I left from home and my laptop went to sleep mode. When I come back, my training was stopped and session disconected

Custom template

Hi there! I'm trying to make my custom CPU docker-based template, but something wrong Locally the image starts well and I don't have any problems, but the same image can't run like pods I'm wondering what I'm doing wrong, because it is really simple app ...

Help Request: ODM Container Only Using CPU

Has anyone tried to deploy an ODM processing node using a pod before? https://github.com/OpenDroneMap/NodeODM How do I add the --gpus all to the pod?...