RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods-clusters

Bad file descriptor

I deployed several CPU pods with a network volume, and at first, they work well. But after a few hours, with some of them, I get a "Bad file descriptor" error when I try to access "/workspace"...

ModuleNotFoundError: No module named 'diskcache'

Receiving this error when trying to run the Stable Diffusion cell in Jupyter notebook for RP's Fast Stable Diffusion;
Solution:
Try this in your pod cli
pip install diskcache
pip install diskcache
...

Blocking ICMP?

I'm trying to set up monitoring for the runpod I've rented and can't seem to ping it. Looks like you're only allowing TCP connections? If so, is there anyway I can get around this?

My pod has randomly crashed several times today, and received emails of Runpod issues.

Today, my pod has crashed a few times, to the point where I'm receiving emails from Runpod about the issues. How can I fix?
Solution:
@rethinkstudios#001 apt-get install google-perftools...

Can't access Jupyterlab

I can't access Jupyterlab, can still use the SD webgui but can't access my data. Is there some way I can recover my workspace?
No description

This is third time and no support for this issue, I lost all of my credits and time.

I ask you to flee from runpod system one day you will no longer have access to all the data you have put into the security and community cloud. I urge you not to use it because there is no solution. I'm going to write this to all our communities who use it. Thank you for not even replying messages or issues.
No description

Spend limit

hi, i'm first time here how can I raise the $30 per hour limit at my account?...

Do 2 GPUs will fine tune 2 times faster than 1 GPU on axolotl ?

Do 2 GPUs will fine tune 2 times faster than 1 GPU on axolotl ?
Solution:
It seems

Very slow download via JupyterLab

Hey, I need to transfer rather large files from my Pod to my local machine. I am unsure on how to set up sftp (maybe thats faster?). Restarting doesn't fix the issue. What else can I try?

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memor

Hi I keep getting ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). when trying to train a model on RunPod with a large batch size. I can't reproduce the error locally. I found this https://github.com/pytorch/pytorch#docker-image and this https://pytorch.org/docs/stable/multiprocessing.html#strategy-management but I'm not sure how to fix the problem....

SSH Connection Refused

I'm using template runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04 with 6xH100s. I added my public key to bash -c 'apt update;DEBIAN_FRONTEND=noninteractive apt-get install openssh-server -y;mkdir -p ~/.ssh;cd $_;chmod 700 ~/.ssh;echo "$PUBLIC_KEY" >> authorized_keys;chmod 700 authorized_keys;service ssh start;sleep infinity' (of course replaced $PUBLIC_KEY with mine) and logged into the machine using the web terminal and checked that the authentication_key is correct. Yet I get connection refused when trying to connect. This is not the first runpod I set up (I did A100s and A40s before and both worked fine but first time for H100s)....

Unable to connect to Jupyter lab

Seems like Jupyter lab has crashed on my pod after a job running for around 2 days . This is unfortunate . Is there anyway I can restart jupyter lab so that I can resume training ? Is it also possible that my process may still be running despite Jupyter lab having crashed ?
No description

Web terminal keeps closing connection for no reason

I have an on demand GPU pod deployed and I'm running a shell script that's training a model through the web terminal. Systematically, every roughly 1h40m, the web terminal dies with the message "Connection closed", for seemingly no reason. This is very frustrating as I'm paying for on-demand specifically because I want to be able to leave it training for a long period unattended. What can be done to fix this?

No module named 'axolotl.cli

I get No module named 'axolotl.cli
No description

"We have detected a critical error on this machine which may affect some pods." Can't backup data

During a training run with 8xH100, I started seeing strange "Directory not found" errors in my jupyter notebook which could not be dismissed (they kept popping up). Although my training run continued and completed, I wasn't able to copy the data off of the volume disk due to the modals blocking operation. I looked into the deployment and saw the error "We have detected a critical error on this machine which may affect some pods. We are looking into the root cause and apologize for any inconvenience. We would recommend backing up your data and creating a new pod in the meantime." Unfortunately everything I've tried to get my data doesn't work - reconnecting to the notebook, Web Terminal, SSH (both options), and even stopping and starting the pod fails. ...
No description

Operation not permitted - Sudo access missing

Hi, I am currently trying to install python3-venv on my runpod instance. However I am getting bunch of sudo: setrlimit(RLIMIT_NOFILE): Operation not permitted messages and ultimately the install finishes with ModuleNotFoundError: No module named 'apt_pkg' However python was not installed. If I try sudo -v it shows:
sudo: setrlimit(RLIMIT_NOFILE): Operation not permitted
sudo: setrlimit(RLIMIT_NOFILE): Operation not permitted
...

Download Mixtral from HuggingFace

How can I download this model in my pod ?
No description

Is there a way to run more than 1 image in a pod?

I would like to add monitoring sidecar container running inside a pod, along side with the app container. Is there a way to do this?

Slow model loading over some instances

I am using ComfyUI, some pod instances take extremely long time to load a model. I am using A100 and H100. For testing, I tried to load a simple diffusers pipeline on the same pods, they also load very slow. I have tried different torch versions, different cuda versions too...

ulimit increase?

I have a pod that runs a binary and tries to set ulimit but is failing, any way I can increase?