Why "CUDA out of memory" Today ? Same image to generate portrait, yesterday is ok , today in not.
"delayTime": 133684,
"error": "CUDA out of memory. Tried to allocate 1.50 GiB (GPU 0; 23.68 GiB total capacity; 18.84 GiB already allocated; 1.47 GiB free; 20.46 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF",
"executionTime": 45263,
"id": "ae1e4066-e2b7-43c1-8f37-3525bda03893-e1",
32 Replies
Ask the developer of the application, it has nothing to do with RunPod.
Unknown User•16mo ago
Message Not Public
Sign In & Join Server To View
I am the developer. When I use my ai app, I get CUDA out of memory. I did nothing to the app.
Then it needs a larger GPU as nerdy said.
It looks like you are trying to use a 24GB GPU when you need more VRAM. Try to run it on a 48GB GPU. If that is still not enough then try to run it on a 80GB GPU.
OK, I see, I will test.
I have exactly the same problem. We have changed nothing in our setup. Just today most image generation fails

I have a second serverless endpoint running that uses the same template. that one is running fine
Unknown User•16mo ago
Message Not Public
Sign In & Join Server To View
I have just realised this only happend on one specific worker:
m07jdb658oetph
thats why not all of the generations failed and my other endpoint runs fineUnknown User•16mo ago
Message Not Public
Sign In & Join Server To View
I have not switched it back on. But I can give you the logs from the weekend when it happened
Unknown User•16mo ago
Message Not Public
Sign In & Join Server To View
sorry, not sure how I would get a stacktrace. I just downloaded the logs directly from runpod
This?
{5 items
"endpointId":"6oe3safoiwidj3"
"workerId":"m07jdb658oetph"
"level":"info"
"message":"Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. "
"dt":"2024-08-03 18:27:11.64919904"
}Unknown User•16mo ago
Message Not Public
Sign In & Join Server To View
we run stable diffusion with automatic1111
Unknown User•16mo ago
Message Not Public
Sign In & Join Server To View
yes
Unknown User•16mo ago
Message Not Public
Sign In & Join Server To View
we don have that functionality in our code.
They should all load the very same way
This specific worker had a 100% fail rate though.
Unknown User•16mo ago
Message Not Public
Sign In & Join Server To View
I'll send you the start.sh and handler script.
Unknown User•16mo ago
Message Not Public
Sign In & Join Server To View
How? I dont see a file option in DM
Unknown User•16mo ago
Message Not Public
Sign In & Join Server To View
Its an OOM issue, why are you using sdp attention and not xformers?
Unknown User•16mo ago
Message Not Public
Sign In & Join Server To View
yes
Ill try this in a new deployment. Just thought it was odd that just this one worker failed
A1111 can fail intermittently with OOM errors based on your request. I experienced random/intermittent OOM and had to upgrade from 24GB to 48GB GPU tier.
Thanks for that tipp