Runpod•16mo ago

Why "CUDA out of memory" Today ? Same image to generate portrait, yesterday is ok , today in not.

"delayTime": 133684, "error": "CUDA out of memory. Tried to allocate 1.50 GiB (GPU 0; 23.68 GiB total capacity; 18.84 GiB already allocated; 1.47 GiB free; 20.46 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF", "executionTime": 45263, "id": "ae1e4066-e2b7-43c1-8f37-3525bda03893-e1",

32 Replies

Marcus•16mo ago

Ask the developer of the application, it has nothing to do with RunPod.

Unknown User•16mo ago

Message Not Public

fireiceOP•16mo ago

I am the developer. When I use my ai app, I get CUDA out of memory. I did nothing to the app.

Marcus•16mo ago

Then it needs a larger GPU as nerdy said.

Encyrption•16mo ago

It looks like you are trying to use a 24GB GPU when you need more VRAM. Try to run it on a 48GB GPU. If that is still not enough then try to run it on a 80GB GPU.

fireiceOP•16mo ago

OK, I see, I will test.

echoSplice•16mo ago

I have exactly the same problem. We have changed nothing in our setup. Just today most image generation fails

echoSplice•16mo ago

I have a second serverless endpoint running that uses the same template. that one is running fine

Unknown User•16mo ago

Message Not Public

echoSplice•16mo ago

I have just realised this only happend on one specific worker: m07jdb658oetph thats why not all of the generations failed and my other endpoint runs fine

Unknown User•16mo ago

Message Not Public

echoSplice•16mo ago

I have not switched it back on. But I can give you the logs from the weekend when it happened

echoSplice•16mo ago

logs-Indie_productio...

Unknown User•16mo ago

Message Not Public

echoSplice•16mo ago

sorry, not sure how I would get a stacktrace. I just downloaded the logs directly from runpod This?

{5 items
"endpointId":"6oe3safoiwidj3"
"workerId":"m07jdb658oetph"
"level":"info"
"message":"Compile with

TORCH_USE_CUDA_DSA

 to enable device-side assertions. "
"dt":"2024-08-03 18:27:11.64919904"
}

Unknown User•16mo ago

Message Not Public

echoSplice•16mo ago

we run stable diffusion with automatic1111

Unknown User•16mo ago

Message Not Public

echoSplice•16mo ago

yes

Unknown User•16mo ago

Message Not Public

echoSplice•16mo ago

we don have that functionality in our code. They should all load the very same way This specific worker had a 100% fail rate though.

Unknown User•16mo ago

Message Not Public

echoSplice•16mo ago

I'll send you the start.sh and handler script.

Unknown User•16mo ago

Message Not Public

echoSplice•16mo ago

How? I dont see a file option in DM

Unknown User•16mo ago

Message Not Public

Marcus•16mo ago

Its an OOM issue, why are you using sdp attention and not xformers?

Unknown User•16mo ago

Message Not Public

Marcus•16mo ago

yes

echoSplice•16mo ago

Ill try this in a new deployment. Just thought it was odd that just this one worker failed

Marcus•16mo ago

A1111 can fail intermittently with OOM errors based on your request. I experienced random/intermittent OOM and had to upgrade from 24GB to 48GB GPU tier.

echoSplice•16mo ago

Thanks for that tipp

Gaming

Programming

Why "CUDA out of memory" Today ? Same image to generate portrait, yesterday is ok , today in not.

Did you find this page helpful?