Bad file descriptor
ModuleNotFoundError: No module named 'diskcache'
pip install diskcache
pip install diskcache
Blocking ICMP?
My pod has randomly crashed several times today, and received emails of Runpod issues.
Can't access Jupyterlab

This is third time and no support for this issue, I lost all of my credits and time.

Do 2 GPUs will fine tune 2 times faster than 1 GPU on axolotl ?
Very slow download via JupyterLab
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memor
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
when trying to train a model on RunPod with a large batch size. I can't reproduce the error locally.
I found this https://github.com/pytorch/pytorch#docker-image and this https://pytorch.org/docs/stable/multiprocessing.html#strategy-management but I'm not sure how to fix the problem....SSH Connection Refused
runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04
with 6xH100s. I added my public key to
bash -c 'apt update;DEBIAN_FRONTEND=noninteractive apt-get install openssh-server -y;mkdir -p ~/.ssh;cd $_;chmod 700 ~/.ssh;echo "$PUBLIC_KEY" >> authorized_keys;chmod 700 authorized_keys;service ssh start;sleep infinity'
(of course replaced $PUBLIC_KEY
with mine) and logged into the machine using the web terminal and checked that the authentication_key is correct. Yet I get connection refused when trying to connect. This is not the first runpod I set up (I did A100s and A40s before and both worked fine but first time for H100s)....Unable to connect to Jupyter lab

Web terminal keeps closing connection for no reason
"We have detected a critical error on this machine which may affect some pods." Can't backup data

Operation not permitted - Sudo access missing
python3-venv
on my runpod instance.
However I am getting bunch of sudo: setrlimit(RLIMIT_NOFILE): Operation not permitted
messages and ultimately the install finishes with ModuleNotFoundError: No module named 'apt_pkg'
However python was not installed. If I try sudo -v
it shows:
sudo: setrlimit(RLIMIT_NOFILE): Operation not permitted
sudo: setrlimit(RLIMIT_NOFILE): Operation not permitted
Is there a way to run more than 1 image in a pod?
Slow model loading over some instances
ulimit increase?