RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods-clusters

GPU seems to have stopped...logs don't show any errors, but there is no activity

The pod id is sluqyzp1j6z48n My network volume is attached to this location, but I keep having issues with the A6000....

Migrate pod volume to Network volume

Hello, I'd like to create a network volume to avoid the unavailable GPUs issue but I already installed some stuff in my current pod. Is there a migration option somewhere?

Unable to modify owner of network volume

Hey all, I'm attempting to create a network volume and mount it to /home inside the pod, attempting to create a user home dir. However, I am unable to change the owner away from root.

Can't run extensions in stable diffusion

Since 10h I am sitting and trying to use anhy extention in stable diffusion official pod and I can't. They don't show in tabs but I see them on the list of extentions 😦 ANy help?😩

Cuda not connecting to image provisioned for GPU

Started a community pod with 1 GPU (4090) using the Runpod pytorch image/template (runpod/pytorch:2.4.0-py3.11-cuda12.4). Immediately after starting pod, GPU is unavailable even though nvidia-smi seems to see the GPU. This is happening about 20% of the time I start images with this official container. No errors thrown in system or container logs. root@5c367a0d4ea2:/# python -c "import torch; print(torch.cuda.is_available())" /usr/local/lib/python3.11/dist-packages/torch/cuda/init.py:128: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.) return torch._C._cuda_getDeviceCount() > 0...

Requests using RUNPOD_API_KEY fail with 403 unauthorized.

Hello, I'm experimenting with using runpod for running a bunch of one-off jobs. According to the [pods environment variables] page, the RUNPOD_API_KEY is an api key for making api calls scoped to the specific job. Basically, I want to terminate (or at least shut down) the pod once it is done with its task. However when I make a call to the rest api, I get 403 Forbidden and an empty response body....

run commands remotely on my pod

Hi I've been trying for an hour to run bash command on my Pod via python. Nothing seems to work. Tried fabric, paramaiko. runpodctl has this command:
$ runpodctl exec python /ru.py --pod_id <redacted>
Running remote Python shell...
Waiting for Pod to come online...
$ runpodctl exec python /ru.py --pod_id <redacted>
Running remote Python shell...
Waiting for Pod to come online...
But it just hangs there and does nothing...

Flux Gym

Hi, I'm running the FLUXGym template and all seems to be working fine but I realized there is no way to access the directory structure, just the web interface. How do I get to the files to do some maintenance of the volume?

Http bad gateway error

I'm getting this error when I click the http service Any clues as to why I have this error ??...
No description

LLM training process killed/SSH terminal disconnected, seemingly at random, no CUDA/OOM error in log

I have been trying keep my LLM finetuning process alive unsuccessfully. I am using 4 V200 GPUs w/ Pytorch FSDP. The process tends to crash when saving checkpoints, BUT not always. I removed the checkpoints and now it's crashing in the middle of the training loop, somewhat randomly. This is what's in my nohup.out: {'loss': 0.1151, 'grad_norm': 2.4503021240234375, 'learning_rate': 5.616492701703402e-07, 'mean_token_accuracy': 0.9721812009811401, 'epoch': 3.42}...

2 GPU but only one work

I have a 2 x RTX A5000, i run a notbook and get erro not enought memory. When i go to dashboard only one is used. How can i use both in same notebook?

deploy fail, can't get template, networking, could not resolve host github.com

I am new to runpod. Have been using same template successfully for two days. I am having OOM errors so I went for a bigger machine, same data center. I changed from L40S to H100SXM. Same data center, TX3. Also I changed from persistent "network volume" to temporary since I always ended up having to recreate every thing anyway. With the L40S I never had to do anything to set up networking. Anybody know why the bigger machine would be giving me a problem getting my template from github? Like I sai...

I can not do training out of memory error I got)

Hellow there I have rented a pod with GPU: H100 SXM RAM:251 G RAM .I tried to train my model on images and their mask but unfortunately it return Out of memory error. Please help I am very confused...

How to self terminate pod on crash

I’d like for my script in my docker image to auto terminate its own Runpod instance if the script crashes out Presumably this could be set up easily with a .bat file: ‘Run script.py Run uploadLog.py...

API endpoint

Do we get API endpoint in POD? I need to get an API endpoint so that I can host it in streamlit.

Having to re-download all models

Hi I’ve been using Local Lab HunyuanVideo in ComfyUI pod template and every time I start it I have to download all the models again, I’m using storage drive and I thought by doing so I wouldn’t have to download everything again, what could be the problem?

Error while deserializing header: HeaderTooSmall

Any idea what could be the cause? hearmeman/comfyui-flux-pulid:v2...

Trouble training sdxl lora with kohya

It seems as if its getting stuck in the process, anyone else having the same issues?

vLLM and multiple GPUs

Hi, I am trying to deploy a model (LLM) of 3B in Runpod with vLLM. I have tried different configurations (4xL4 or 2xL40, etc) but in all I get a CUDA memory error, as if both GPUs are not sharing memory. I have tried pipeline-parallel-size and tensor-parallel-size but I still get the same error.

You must remove this network volume from all pods before deleting it.

Why can't I delete my storage, it says "You must remove this network volume from all pods before deleting it." but i dont have any pod running or any serverless is running...