RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods-clusters

Pod stopped and I cannot re access

Hi 🙂 Forgive my noobness. Pod stopped as I didn't notice funds running out. I re-uped the funds but now cannot connect to any GPUs. I need to access the volume. Would also like to continue using the setup if possible. How can I? - What's the problem? Thanks in advance. Sorry for any dumbness....

Pod easily get OOM!

I am using an 8xA40 instance. Pod id: k3urxcxexkj989 Even though I do not run any heavy tasks, just unzip a file and upload some data to the pod using scp commands, the pod frequently got OOM issues. My pod has ~375GB of RAM, and I don't think my process caused the problem. Could you check out the issue? Thanks...

Spot instances dissappearing??

I started a spot instance twice and after a while its just gone??

I've installed docker on an ubuntu pod, but cannot start it with systemctl

I've installed docker but when running systemctl status docker I get an error: System has not been booted with systemd as init system How can I start docker within a runpod ubuntu pod?...

For network storage how can I use big files in specific comfyUI directories?

I assume the docker mounting to subfolders is not available (according to the ai docs helper) and simlinks failed due to permisssions. But maybe Im also completely on the wrong track here. What I want to do is have a model file and lora say in the storage and then comfyUI requires those to be accessible within its specific folder structure...

I am trying to run ComfyUI on my second pod but I am getting error: RuntimeError: CUDA unknown error

I am trying to run ComfyUI on my second pod but I am getting error: RuntimeError: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. , I did not have this error on my first pod. How can I fix it ?

US-OR-1 Net Volume Export

I am sorry that I did not notice any announcements about US-OR-1 service shutting down. Today when I tried to use the data in this region, I had no way to access to the network volume. Currently is there any way I can retrieve the data inside the volume? I read the doc and saw an option to export to external storage, but I did not find the entry. Any suggestion or information will be appreciated 🙂

Training runs 2-5x slower on pods than on home system.

Home system: 4090, 7950x, 64GB RAM, W.2 SSD. I comparisons: 1x 4090: 2.5-3x slower on ALL ops. L40: 5x slower...
No description

Completely lost connection to volume at CA-MTL from different account at now

About ten miniutes ago the serverless and pod service all went dead, meets diffenerent error related to File IO. Error: [Errno 6] No such device or address...

how do I fork a community template

I want to increase the volume of a community template, how can I do this? I asked the AI and it said I can copy templates from the Explore Templates section, but I don't see any button/UI to do this unfortunately...

Authenticate to AWS ECR private repository

Hello, I have my private image on AWS ECR. I have created a new IAM role that can pull that image and I have fulfilled those credentials inside the template.
However, when I run a pod I'm getting some errors: error pulling image: Error response from daemon: unauthorized: Not Authorized What's the correct way to pull images form AWS ECR?...

Run commands on restart

I want to run some commands on restart, that will not run on initial start but all the restarts after that, so how can I do such a thing? cause I dont wanna use Container Start Command, considering it will run initially as well and would require me reconfigure and start from scratch in my current running machines.

Wandb giving 403 error

When running a training job in a L40 instance with a custom template, I get the following error: `` wandb: W&B API key is configured. Use wandb login --relogin` to force relogin 2025-02-25 03:27:06,293 - ERROR - 403 response executing GraphQL. 2025-02-25 03:27:06,293 - ERROR - ...

Can't stop my pod! Only terminate

I have an H100 SXM running on koboldcpp - I can't figure out how to stop the pod. There's no stop button in the UI. I can only terminate it. How do I simply pause it, so that I can unpause it later and continue?

I am trying to send my LoRA to runpod but I keep getting 'room not ready' on the web terminal

Here is my input: 1. cd ComfyUI/models/loras 2. runpodctl receive ... =And here is where the error arises and it says 'securing channel...room not ready' I don't know if this is relevant but the first time I tried this it worked but it only donwloaded 70% of the way, so I restarted it. Idon't know if I need to do it all over again or what. I've tried dping it with different pods but it will not work. Please help....

jupiter

plz i need your help guys!!! i always start my pod then conect to comfyui port and jupiter port 8888, now i try to connect jupiter but it not connect in put the link in browser and do nothing ??? plz you help is say " https://<my pod -id>-8888.proxy.runpod.net/lab?

using iptables with pods whilst maintaining jupyter access

Has anyone managed to do this? I've been installing iptables and some rules in the container start command. I set a rule to drop all outgoing packets and then selectively add in exceptions. One of those exceptions is to able to communicate on Jupyter's 8888 port. However, when I start the pod, I no longer have the option of connecting to the pod via jupyter. any ideas?...

Limit Memory Usage

Multiprocessing is requiring a lot of memory usage and the server just crashes when the threshold is reach (needing a pod restart). Is this the intended interaction? Is there are a way that I can prevent this interaction from happening so I don't have to keep restarting the server? Perhaps a way to set a server-wide memory usage limit before the threshold is hit?
No description

H100 pod not connecting to network drive of the same region

I have a dual H100 pod that's supposed to be connected to a network drive (both on CA-MTL-1), but when I try to move data, do a git status of a repo, or even start a python script residing on the network drive the terminal hangs. Seems like a network issue? I've trying to spawn dual H100 pods multiple times, but I'm getting the same IP (probably the same hardware?), so nothing changes. Trying this out from a machine with RTX A5000 works fine! Is there something I can do?...

something wrong with pytorch2.4.0 image's jupyter

most of my pod created today using template pytorch2.4.0 couldn't open jupyter lab, while 2.2.0 was fine. Wonder some updates on the docker image.
No description