R
Runpod16mo ago
fireice

Why "CUDA out of memory" Today ? Same image to generate portrait, yesterday is ok , today in not.

"delayTime": 133684, "error": "CUDA out of memory. Tried to allocate 1.50 GiB (GPU 0; 23.68 GiB total capacity; 18.84 GiB already allocated; 1.47 GiB free; 20.46 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF", "executionTime": 45263, "id": "ae1e4066-e2b7-43c1-8f37-3525bda03893-e1",
32 Replies
Marcus
Marcus16mo ago
Ask the developer of the application, it has nothing to do with RunPod.
Unknown User
Unknown User16mo ago
Message Not Public
Sign In & Join Server To View
fireice
fireiceOP16mo ago
I am the developer. When I use my ai app, I get CUDA out of memory. I did nothing to the app.
Marcus
Marcus16mo ago
Then it needs a larger GPU as nerdy said.
Encyrption
Encyrption16mo ago
It looks like you are trying to use a 24GB GPU when you need more VRAM. Try to run it on a 48GB GPU. If that is still not enough then try to run it on a 80GB GPU.
fireice
fireiceOP16mo ago
OK, I see, I will test.
echoSplice
echoSplice16mo ago
I have exactly the same problem. We have changed nothing in our setup. Just today most image generation fails
No description
echoSplice
echoSplice16mo ago
I have a second serverless endpoint running that uses the same template. that one is running fine
Unknown User
Unknown User16mo ago
Message Not Public
Sign In & Join Server To View
echoSplice
echoSplice16mo ago
I have just realised this only happend on one specific worker: m07jdb658oetph thats why not all of the generations failed and my other endpoint runs fine
Unknown User
Unknown User16mo ago
Message Not Public
Sign In & Join Server To View
echoSplice
echoSplice16mo ago
I have not switched it back on. But I can give you the logs from the weekend when it happened
Unknown User
Unknown User16mo ago
Message Not Public
Sign In & Join Server To View
echoSplice
echoSplice16mo ago
sorry, not sure how I would get a stacktrace. I just downloaded the logs directly from runpod This? {5 items "endpointId":"6oe3safoiwidj3" "workerId":"m07jdb658oetph" "level":"info" "message":"Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. " "dt":"2024-08-03 18:27:11.64919904" }
Unknown User
Unknown User16mo ago
Message Not Public
Sign In & Join Server To View
echoSplice
echoSplice16mo ago
we run stable diffusion with automatic1111
Unknown User
Unknown User16mo ago
Message Not Public
Sign In & Join Server To View
echoSplice
echoSplice16mo ago
yes
Unknown User
Unknown User16mo ago
Message Not Public
Sign In & Join Server To View
echoSplice
echoSplice16mo ago
we don have that functionality in our code. They should all load the very same way This specific worker had a 100% fail rate though.
Unknown User
Unknown User16mo ago
Message Not Public
Sign In & Join Server To View
echoSplice
echoSplice16mo ago
I'll send you the start.sh and handler script.
Unknown User
Unknown User16mo ago
Message Not Public
Sign In & Join Server To View
echoSplice
echoSplice16mo ago
How? I dont see a file option in DM
Unknown User
Unknown User16mo ago
Message Not Public
Sign In & Join Server To View
Marcus
Marcus16mo ago
Its an OOM issue, why are you using sdp attention and not xformers?
Unknown User
Unknown User16mo ago
Message Not Public
Sign In & Join Server To View
Marcus
Marcus16mo ago
yes
echoSplice
echoSplice16mo ago
Ill try this in a new deployment. Just thought it was odd that just this one worker failed
Marcus
Marcus16mo ago
A1111 can fail intermittently with OOM errors based on your request. I experienced random/intermittent OOM and had to upgrade from 24GB to 48GB GPU tier.
echoSplice
echoSplice16mo ago
Thanks for that tipp

Did you find this page helpful?