R
Runpod•7mo ago
dbtr

Serverless endpoint fails with Out Of Memory despite no changes

For several months I am using the same endpoint code to generate Stable Diffusion 1.5 images in 512x512 with Auto1111 (in other words, quite low specs). I have a serverless endpoint with 16GB (the logs show more memory available, but the setup was 16 GB). There are very few requests to the endpoint. That's why I know that the worker was just booting up with a fresh start in my two test cases that failed Practically right after booting and when I try to begin inference, I get the following error: A1111 Response: {'error': 'OutOfMemoryError', 'detail': '', 'body': '', 'errors': 'CUDA out of memory. Tried to allocate 146.00 MiB. GPU 0 has a total capacty of 19.70 GiB of which 10.38 MiB is free. Process 1790219 has 19.49 GiB memory in use. Process 3035077 has 194.00 MiB memory in use. Of the allocated memory 244.00 KiB is allocated by PyTorch, and 1.76 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF'} It says that a certain process is using around 20GB in memory. This hasn't failed before and I assume it is unlikely that my specific Stable Diffusion operation uses this much memory. Can anyone help me where to start digging? Is it (at least theoretically) possible that some other process running on the same machine, but not from me, is using some shared memory here? Thanks!
17 Replies
dbtr
dbtrOP•7mo ago
I'd like to add that the same operation with the same image processed successfully a day later with no errors
Unknown User
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
riverfog7
riverfog7•7mo ago
maybe that specific gpu errored and needs a reset
dbtr
dbtrOP•7mo ago
thank you guys! the original problem was with the worker id "3yo6ri2zzmuvmq" (i can provide a log). in fact, two calls on the same worker (the worker was idle/down in between for around 4-5 days) failed, whereas other workers seemed to work in the meantime i have since upgraded my setup from 16gb to 24gb setups, for lack of an alternative what to do. i now have a new failure with the new worker (yesterday when i tried it worked): {'error': 'RuntimeError', 'detail': '', 'body': '', 'errors': 'CUDA error: misaligned address\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n'} it's all a bit confusing and I don't know where to start debugging really 😕
Unknown User
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
Eren
Eren•7mo ago
I can approve that, 1 out of 100 worker having problem to run the code ending up with OOM, just kill the worker and move on
dbtr
dbtrOP•7mo ago
Thanks @Eren - is there a way to kill the worker programmatically? Since the worker is still intact (even though we get OOM when invoking the Stable Diffusion A1111 API on the server's side), subsequent requests will again use the same worker, again resulting in OOM. I would like programmatically catch the OOM and force Runpod to terminate the worker / choose a different one
Eren
Eren•7mo ago
Yes you can catch the OOM exception and use GraphQL API to kill the worker. OOM can occur due to several reasons and that doesn't mean that worker is a poor worker but yeah you can do that programmatically I also strongly recommend implementing torch cache clear support and periodically clearing the garbage collector
Unknown User
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
Eren
Eren•7mo ago
many ways but simple as this
try:
output = model(input_tensor.to("cuda"))
except RuntimeError as e:
if "out of memory" in str(e).lower():
print("Caught CUDA OOM – cleaning up")
torch.cuda.empty_cache()
gc.collect()
else:
raise
try:
output = model(input_tensor.to("cuda"))
except RuntimeError as e:
if "out of memory" in str(e).lower():
print("Caught CUDA OOM – cleaning up")
torch.cuda.empty_cache()
gc.collect()
else:
raise
Unknown User
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
Eren
Eren•7mo ago
if it polls/checks the request execution result and if it returns FAILED, the error key has the value, it also has this raised "e" variable as string so another way might be that like this:
{
"delayTime": 2222,
"error": "bla-bla-bla out of memory i need help to fit 1 mb please bla-bla-bla",
"executionTime": 2222,
"id": "223j2kn2b3j23jbk2b",
"status": "FAILED",
"workerId": "2j3njkg8fdsmgk"
}
{
"delayTime": 2222,
"error": "bla-bla-bla out of memory i need help to fit 1 mb please bla-bla-bla",
"executionTime": 2222,
"id": "223j2kn2b3j23jbk2b",
"status": "FAILED",
"workerId": "2j3njkg8fdsmgk"
}
Unknown User
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
Eren
Eren•7mo ago
Yeah this applies using your own pipeline mostly wrapping inside of this logic
Unknown User
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View
Eren
Eren•7mo ago
I don't have a1111 ready deployment but i assume it should return that above "error" key in the request fail response, that might be read and then kill worker
Unknown User
Unknown User•7mo ago
Message Not Public
Sign In & Join Server To View

Did you find this page helpful?