R
Runpod10mo ago
stevex

I'm seeing 93% GPU Memory Used even in a freshly restarted pod.

Not sure what to do about this. nvidia-smi shows there are no processes running, but when I try to run a job it shows "Process 1726743 has 42.25 GiB memory in use". How do I find and kill that?
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 26.00 MiB. GPU 0 has a total capacity of 44.52 GiB of which 18.44 MiB is free. Process 1726743 has 42.25 GiB memory in use. Process 3814980 has 2.23 GiB memory in use. Of the allocated memory 1.77 GiB is allocated by PyTorch, and 53.97 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 26.00 MiB. GPU 0 has a total capacity of 44.52 GiB of which 18.44 MiB is free. Process 1726743 has 42.25 GiB memory in use. Process 3814980 has 2.23 GiB memory in use. Of the allocated memory 1.77 GiB is allocated by PyTorch, and 53.97 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
3 Replies
Unknown User
Unknown User10mo ago
Message Not Public
Sign In & Join Server To View
stevex
stevexOP10mo ago
I tried most of that .. the process id it quoted doesn't show up in ps -ef (and the number is a bit unusual). If there was a process holding onto memory, restarting the pod would clear that.
Unknown User
Unknown User10mo ago
Message Not Public
Sign In & Join Server To View

Did you find this page helpful?