ERR_NVGPUCTRPERM when profiling CUDA kernels
I'm trying to profile CUDA kernels with NCU and I encountered this error due to a said lack of permission :
"ERR_NVGPUCTRPERM - The user does not have permission to access NVIDIA GPU Performance Counters on the target device 0. For instructions on enabling permissions and to get more information see https://developer.nvidia.com/ERR_NVGPUCTRPERM"
on the linked website, it is said that when profiling kernels on containers (which is the case here with pods right?), one has to launch the container with --cap-add=SYS_ADMIN but I'm not sure this is possible with Runpod pods.
Have you find a workaround ? Surely there is a way to profile kernels on container GPUs ?...
SSH missing password
Hi, I am having trouble with setting up my ssh conection, I always get asked about a password
Solution:
Found the issue, when creating the key, the email has to be root@<ip>
Real Performance Comparison: H100 vs RX 6000 Ada
Hi,
I’m experiencing some confusion or perhaps misunderstanding regarding the performance of the H100 and RX 6000 Ada GPUs during model training.
Lately, I’ve been working with both GPUs to train a model using 9 GB of training data and 8 GB of testing data. The model has 2.6M parameters....
Auto-exit on finish?
Hi all,
Is there a way to automagically close a pod after some code has finished running?
Thanks!...
I can't deploy a pod, it won't recognize my keys!
I have been at this all day. I got it to deploy once yesterday. Please help me.

Critical error suggests copying data from pod, but can't log onto pod
Hi!
My pod is displaying the following warning in the pods page:
We have detected a critical error on this machine which may affect some pods. We are looking into the root cause and apologize for any inconvenience. We would recommend backing up your data and creating a new pod in the meantime.
...
Someone is using my CUDA Memory?
Hi people, I get error when I try to train my model:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.06 GiB. GPU 0 has a total capacty of 19.67 GiB of which 30.31 MiB is free. Process 2169311 has 19.64 GiB memory in use. Of the allocated memory 19.43 GiB is allocated by PyTorch, and 22.72 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I do not really see how, since I just booted the server, and the task is relatively simple, how is my memory fully used?...
Zero Gpu Available Issue
I am using community cloud. And now there is zero gpu available for my pod. What are the chances that, if I wait I might get a available gpu!? Or any other solution?
Problem With 0 GPU's
I had my POD fully configured, and now I can no longer start it with the GPU. It shows 0 GPUs available, even though I have always used one. I need help.
runpod id : 92li1oriuhdngp...
Unable to boot mi300x
Getting the following error:
error starting container: Error response from daemon: error gathering device information while adding custom device "/dev/dri/renderD136": no such file or directory
Pod ID: kptjoa8hkns744...
Logging in / Starting Pods problems
Logging in takes minutes of staring at the purple loading wheel.
When I do get in, i see / do the following: I see available Pods. I select my network volume i want to use, the pods get filtered, i still see available pods. I pick one, select SDXL Template and start her up. I get a "Failed to Fetch" error, after which every single pod (all filters set to any) are unavailable..
If I refresh the page i'm looking at minutes of purple loading wheel again....
If I refresh the page i'm looking at minutes of purple loading wheel again....
Mi300x HIP error: no ROCm-capable device is detected
I'm using the Mi300x and getting a
RuntimeError: HIP error: no ROCm-capable device is detected using RunPod Pytorch 2.4.0 ROCm 6.1 template, how can I resolve this?Need Help Deploying Stable Diffusion on RunPod
Hi everyone, I’m having trouble setting up Stable Diffusion on RunPod. Here are the main issues I’ve encountered:
1. Model Download Problems:
Some models fail to download properly or give errors like:
“File Load Error: not UTF-8 encoded.”
2. Model Loading Issues:...
Best way to run offline
Hi Runpod community,
What is the best way to keep pods running while I am not connected to the internet?
Thanks!...
Solution:
use a tmux or any other terminal multiplexer
Message Not Public
Sign In & Join Server To View
CPUs not available
I am running AutoML on a 4090 pod. It worked last night, but now I get the error in the attached photo. Weird thing is, this persists even when I check the CPU count with multiprocessing.cpu_count() and try run the AutoML (autogluon) with only those.

Can't connect via ssh: Runpod asking for password
When I try to connect via
ssh root@213.173.110.198 -p 17455 -i ~/.ssh/id_ed25519 I get asked for a password. Following the support page I tried generating a new key, but didn't have any luck.
Specifically, here's what I tried:
Create a new key:
ssh-keygen -t ed25519 -C "myemail@gmail.com"...Solution:
Resolved! The issue is that I had to deploy my pod AFTER adding the new key.
Securing a POD with an API key
Any good resources or tutorials for walking a some-what beginner through the process of securing HTTPS API endpoints (port:11434) with an API key?
I have a Pod running Ollama and serving API requests on port 11434, it is currently open and anyone with access to the url can use it. I haven't seen any malicious use, but would like to secure it by requiring an API key to access the endpoint.
Thanks,...
Pod eternal image fetching
yf6hnl4zdwmvem - it's been fetching an image for 15 minutes
US-TX-3 - the region of the problematic host
Logs are just like:
```...mi300x are unavailable
I've been using multiple mi300x on a single host for a while and all at sudden the resources have become unavailable. Is there anything that happened and will more resources become available?
