RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods-clusters

How to tell how much storage being used in pod? (including network drive)

I try df -h, but it seems to represent the whole filesystem. ```(base) root@f3165c77df52:/workspace# df -h Filesystem Size Used Avail Use% Mounted on overlay 30G 8.9G 22G 30% / tmpfs 64M 0 64M 0% /dev...

Can't see training progress after reset

hello, i've started a new training on a notebook and then my computer restarted. after restarting, i sign in my runpod account and opened the traning instance. then i can't see any progress anymore. it shows gpu memory using, but how can see training progress?
Solution:
Jupiter notebooks do not save output on tab browser closing. Though the job continues to run one app finish to run it should update cell
No description

Maintenance - only a Community Cloud issue?

Hey there! I just started a new pod and noticed this maintenance window. Is this only a thing on community cloud or also on secure cloud?...
No description

SDK GPU naming specification

When I am setting up a pod using the sdk how specific does the GPU name have to be? Is there a list of proper naming?

How to get a general idea for max volume size on secure cloud?

I have been able to deploy 2TB drives, but what is the standard here? How much storage is there generally per server to estimate what i should expect to be able to get?

Template pytorch-1.13.1 lists cuda 11.7.1 version but is actually cuda 11.8?

I tried running a model that requires pytorch-1.13.1 and 11.7 but it said the cuda version doesn't match (the pod is actually on 11.8). The mismatch check happens in the deepspeed package. I tried starting up a new pod with the same template and did nvcc --version and it said the pod was on cuda version 11.8. Is this normal or an error? I can't seem to run my model because of the cuda version mismatch. For reference, I'm using A40....

Can't connect to sfpt

Hi, I can't access sftp. On my previous pod I could do it and I just swaped the ip and the port, but now it doesn't work. Is there a problem on runpod's side?

Unable to ssh onto my pod with the public key already on the runpod server

I am unable to ssh into the pod when using the command from runpod's site: ``` name .ssh % ssh [email protected] -p 22138 -i ~/.ssh/id_ed25519 ssh: connect to host 194.68.245.27 port 22138: Operation timed out...

Python modules missing when pod is starting

When starting a comfy-ui pod after some downtime, I get a lot of messages of the kind ``` Import times for custom nodes: 0.0 seconds: /workspace/ComfyUI/custom_nodes/websocket_image_save.py...

Unable to connect to Pod

Since last Friday I have been unable to connect to my pod. It worked fine Thursday and now whenever I send the following command, it returns {"detail":"Not Found"}: curl https://6ppno5hzfbrl76-8000.proxy.runpod.net/v1/model Am I missing something? I even get this error when launching web terminal - is my model not loaded? I used a pre-built template that should download the model from HuggingFace...
No description

I am having trouble finding the location of the model file when trying to use ComfyUI.

I have edited the 'extra_model_paths.yaml' file, but I still can't seem to find it.
No description

Turn on Confidential Computing

Hi! I created a pod using a H100 and tried to do some tests with Confidential Computing but it turns out CC in fact disabled, is it possible to turn it on? Its absolutly necessary for me to have CC on. Here is a documentation on how to turn CC on: https://docs.nvidia.com/confidential-computing-deployment-guide.pdf...

"SSH Public Keys" in account settings are completely ignored

Hello, I am trying to access the env variables, as well as standard PUBLIC_KEY variable, that I specify for my pod from my python app. However they are only set when I am connecting with ssh via proxy server. Proxy is extremely slow and does not allow scp to be run through it. When I try to connect directly (via public ip), the ~/.ssh/authorized_keys is not configured at all with the public key I set in the settings. The env vars that I pass during the pod creation are also missing. Two problems: - why isn't the ~/.ssh/authorized_keys file created and populated with my public key from account settings - why env variables are missing when connecting directly via public ip to my instance? I assume proxy has some .bashrc which is activated when I connect through it, but why the env vars are not set with -e parameter in docker run command for the pod?...
Solution:
in your running pod xargs -0 -L1 -a /proc/1/environ will list the environment variables that the process is getting, which is launched on container start. if there is a PUBLIC_KEY given to your pod, it will be there. if this process is a bash and doesn't export those variables when starting other processes, it will be the only process who knows about your PUBLIC_KEY

Is there an instance type that cannot be taken from you even if you stop the pod?

I'd like to have the comfort of knowing I can spin up whenever I want to without worrying if my GPU had been taken from me while I wasn't using it
Solution:
Nope you would either need to run pod 24/7 or use network storage

Kill a pod from the inside?

Last weekend I started a community pod for a large workload and went to bed once it confirmed it was starting the work properly. Unfortunately though the pod was on a very slow connection to my cloud storage, and so it spent about 14 out of the 16 hours run time just downloading the job files… I’ve only just realised it after noticing how much faster things went on other runs and analysing my cloud egress logs. I’ve rewritten my code to report current download speeds so I can kill pods by hand, but is there any way to do it from a running python app? Ideally if it detected slow disk or downloads it’s kill itself so that at least I’d know. My alternative is to have it send me a discord message, but that’s not as useful!...
Solution:
Thanks, can se that combined with some pre-set environment variables https://docs.runpod.io/pods/references/environment-variables the podStop command is what I'd need - I didn't realise that the pod knew who t was (so to speak)

Performance A100-SXM4-40GB vs A100-SXM4-80GB

Hello! I have one GPU: NVIDIA A100-SXM4-40GB on Google Colab Pro. I have one GPU: NVIDIA A100-SXM4-80GB on RunPod. My notebook successfully fine-tunes Whisper-Small on Google Colab (40GB) with batch size 32....
Solution:
could be diffrent things like cuda version, python version etc

API problem

After having been away from Runpod for a couple of weeks, I am greeted by a fancy new GUI! But also a problem. (See attached) I tried several different community pods but get the same result. It seems to happen as soon as I try and change the template. Please advise..
Solution:
it's false positive
No description

Why is there no indicators of file transfer operations? Am I supposed to guess when they're done?

Why is there no indicators of file transfer operations? Am I supposed to guess when they're done? No file transfer indicators, no zip extraction process reporting, nothing. Am I supposed to just magically guess when file operations on runpod are done?!

data didn't persist

which folder should I put my files in if I want them to persist across deployments? I've started my pod from Storage > Select existing disk > Deploy I've left my files in /workspace Now that folder is not listed through SSH dir, or with VS Code...
Solution:
oh its /workspace for pods

Tailscale on Pod

Hello, all. I need to set up Tailscale VPN in Pod in order to allow access to our DB. Issues is that /dev/net/tun is not available, and using SOCKS5 proxy as described in this article https://tailscale.com/kb/1112/userspace-networking is not an option for us. Are there any recommendations, how I can run Tailscale? ...