Runpod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

Runpod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡｜serverless

⛅｜pods

🔧｜api-opensource

📡｜instant-clusters

🗂｜hub

benjamin_n

2/27/2025

Spot instances dissappearing??

I started a spot instance twice and after a while its just gone??

spacedust27

2/27/2025

I've installed docker on an ubuntu pod, but cannot start it with systemctl

I've installed docker but when running systemctl status docker I get an error: System has not been booted with systemd as init system How can I start docker within a runpod ubuntu pod?...

benjamin_n

2/27/2025

For network storage how can I use big files in specific comfyUI directories?

I assume the docker mounting to subfolders is not available (according to the ai docs helper) and simlinks failed due to permisssions. But maybe Im also completely on the wrong track here. What I want to do is have a model file and lora say in the storage and then comfyUI requires those to be accessible within its specific folder structure...

Kamil

2/27/2025

I am trying to run ComfyUI on my second pod but I am getting error: RuntimeError: CUDA unknown error

I am trying to run ComfyUI on my second pod but I am getting error:

RuntimeError: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.

, I did not have this error on my first pod. How can I fix it ?

BigCatc

2/27/2025

US-OR-1 Net Volume Export

I am sorry that I did not notice any announcements about US-OR-1 service shutting down. Today when I tried to use the data in this region, I had no way to access to the network volume. Currently is there any way I can retrieve the data inside the volume? I read the doc and saw an option to export to external storage, but I did not find the entry. Any suggestion or information will be appreciated 🙂

JoeG

2/26/2025

Training runs 2-5x slower on pods than on home system.

Home system: 4090, 7950x, 64GB RAM, W.2 SSD. I comparisons: 1x 4090: 2.5-3x slower on ALL ops. L40: 5x slower...

coding_achilles

2/26/2025

Completely lost connection to volume at CA-MTL from different account at now

About ten miniutes ago the serverless and pod service all went dead, meets diffenerent error related to File IO. Error: [Errno 6] No such device or address...

FackJox

2/25/2025

how do I fork a community template

I want to increase the volume of a community template, how can I do this? I asked the AI and it said I can copy templates from the Explore Templates section, but I don't see any button/UI to do this unfortunately...

tampe125

2/25/2025

Authenticate to AWS ECR private repository

Hello, I have my private image on AWS ECR. I have created a new IAM role that can pull that image and I have fulfilled those credentials inside the template.
However, when I run a pod I'm getting some errors: error pulling image: Error response from daemon: unauthorized: Not Authorized What's the correct way to pull images form AWS ECR?...

John lanser

2/25/2025

Run commands on restart

I want to run some commands on restart, that will not run on initial start but all the restarts after that, so how can I do such a thing? cause I dont wanna use Container Start Command, considering it will run initially as well and would require me reconfigure and start from scratch in my current running machines.

shishito pepprito

2/25/2025

Wandb giving 403 error

When running a training job in a L40 instance with a custom template, I get the following error:

``
wandb: W&B API key is configured. Use

wandb login --relogin` to force relogin 2025-02-25 03:27:06,293 - ERROR - 403 response executing GraphQL. 2025-02-25 03:27:06,293 - ERROR - ...

nycguy54

2/24/2025

Can't stop my pod! Only terminate

I have an H100 SXM running on koboldcpp - I can't figure out how to stop the pod. There's no stop button in the UI. I can only terminate it. How do I simply pause it, so that I can unpause it later and continue?

Lucia

2/24/2025

I am trying to send my LoRA to runpod but I keep getting 'room not ready' on the web terminal

Here is my input: 1. cd ComfyUI/models/loras 2. runpodctl receive ... =And here is where the error arises and it says 'securing channel...room not ready' I don't know if this is relevant but the first time I tried this it worked but it only donwloaded 70% of the way, so I restarted it. Idon't know if I need to do it all over again or what. I've tried dping it with different pods but it will not work. Please help....

bubu23

2/24/2025

jupiter

plz i need your help guys!!! i always start my pod then conect to comfyui port and jupiter port 8888, now i try to connect jupiter but it not connect in put the link in browser and do nothing ??? plz you help is say " https://<my pod -id>-8888.proxy.runpod.net/lab?

runpodrobert

2/21/2025

using iptables with pods whilst maintaining jupyter access

Has anyone managed to do this? I've been installing iptables and some rules in the container start command. I set a rule to drop all outgoing packets and then selectively add in exceptions. One of those exceptions is to able to communicate on Jupyter's 8888 port. However, when I start the pod, I no longer have the option of connecting to the pod via jupyter. any ideas?...

KamKam

2/21/2025

Limit Memory Usage

Multiprocessing is requiring a lot of memory usage and the server just crashes when the threshold is reach (needing a pod restart). Is this the intended interaction? Is there are a way that I can prevent this interaction from happening so I don't have to keep restarting the server? Perhaps a way to set a server-wide memory usage limit before the threshold is hit?

const

2/21/2025

H100 pod not connecting to network drive of the same region

I have a dual H100 pod that's supposed to be connected to a network drive (both on CA-MTL-1), but when I try to move data, do a git status of a repo, or even start a python script residing on the network drive the terminal hangs. Seems like a network issue? I've trying to spawn dual H100 pods multiple times, but I'm getting the same IP (probably the same hardware?), so nothing changes. Trying this out from a machine with RTX A5000 works fine! Is there something I can do?...

coding_achilles

2/21/2025

something wrong with pytorch2.4.0 image's jupyter

most of my pod created today using template pytorch2.4.0 couldn't open jupyter lab, while 2.2.0 was fine. Wonder some updates on the docker image.

freedomk520

2/21/2025

4 x A40 never ready in CA

Create 4 x A40 Pod today in CA, however Pod never ready state no log no connect...

Milad

2/20/2025

Unable to connect to pod after launch H100s

Today consistentatly this seems to be happening. Everytime we launch a H100 GPU

Previous Next

Gaming

Programming

Runpod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Runpod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!