R
RunPod•4w ago
dbtr

Serverless endpoint fails with Out Of Memory despite no changes

For several months I am using the same endpoint code to generate Stable Diffusion 1.5 images in 512x512 with Auto1111 (in other words, quite low specs). I have a serverless endpoint with 16GB (the logs show more memory available, but the setup was 16 GB). There are very few requests to the endpoint. That's why I know that the worker was just booting up with a fresh start in my two test cases that failed Practically right after booting and when I try to begin inference, I get the following error: A1111 Response: {'error': 'OutOfMemoryError', 'detail': '', 'body': '', 'errors': 'CUDA out of memory. Tried to allocate 146.00 MiB. GPU 0 has a total capacty of 19.70 GiB of which 10.38 MiB is free. Process 1790219 has 19.49 GiB memory in use. Process 3035077 has 194.00 MiB memory in use. Of the allocated memory 244.00 KiB is allocated by PyTorch, and 1.76 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF'} It says that a certain process is using around 20GB in memory. This hasn't failed before and I assume it is unlikely that my specific Stable Diffusion operation uses this much memory. Can anyone help me where to start digging? Is it (at least theoretically) possible that some other process running on the same machine, but not from me, is using some shared memory here? Thanks!
20 Replies
dbtr
dbtrOP•4w ago
I'd like to add that the same operation with the same image processed successfully a day later with no errors
Jason
Jason•4w ago
No I don't think so, it's highly unlikely to be like that since usually it's divided into container like 1 gpu 1 container or .. It is limited to the amount of gpu configured I think it's your code but if you want you can ask support to check by your worker id
riverfog7
riverfog7•4w ago
maybe that specific gpu errored and needs a reset
dbtr
dbtrOP•4w ago
thank you guys! the original problem was with the worker id "3yo6ri2zzmuvmq" (i can provide a log). in fact, two calls on the same worker (the worker was idle/down in between for around 4-5 days) failed, whereas other workers seemed to work in the meantime i have since upgraded my setup from 16gb to 24gb setups, for lack of an alternative what to do. i now have a new failure with the new worker (yesterday when i tried it worked): {'error': 'RuntimeError', 'detail': '', 'body': '', 'errors': 'CUDA error: misaligned address\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n'} it's all a bit confusing and I don't know where to start debugging really 😕
Jason
Jason•4w ago
Hmm try passing this to chatgpt, might be able to help
Eren
Eren•4w ago
I can approve that, 1 out of 100 worker having problem to run the code ending up with OOM, just kill the worker and move on
dbtr
dbtrOP•4w ago
Thanks @Eren - is there a way to kill the worker programmatically? Since the worker is still intact (even though we get OOM when invoking the Stable Diffusion A1111 API on the server's side), subsequent requests will again use the same worker, again resulting in OOM. I would like programmatically catch the OOM and force Runpod to terminate the worker / choose a different one
Eren
Eren•4w ago
Yes you can catch the OOM exception and use GraphQL API to kill the worker. OOM can occur due to several reasons and that doesn't mean that worker is a poor worker but yeah you can do that programmatically I also strongly recommend implementing torch cache clear support and periodically clearing the garbage collector
Jason
Jason•4w ago
or runpodctl tool
Jason
Jason•4w ago
you can run this command programatically, and fill the pod id with your worker id https://docs.runpod.io/runpodctl/reference/runpodctl_remove_pod
Jason
Jason•4w ago
each worker has this variables (try to test it ) https://docs.runpod.io/pods/references/environment-variables
Pod environment variables | RunPod Documentation
Configure and manage your pods with these essential environment variables, including pod ID, API key, host name, GPU and CPU count, public IP, SSH port, data center ID, volume ID, CUDA version, current working directory, PyTorch version, and public SSH key.
Jason
Jason•4w ago
how can we catch oom? im curious, is it from the request response of a1111 api?
Eren
Eren•3w ago
many ways but simple as this
try:
output = model(input_tensor.to("cuda"))
except RuntimeError as e:
if "out of memory" in str(e).lower():
print("Caught CUDA OOM – cleaning up")
torch.cuda.empty_cache()
gc.collect()
else:
raise
try:
output = model(input_tensor.to("cuda"))
except RuntimeError as e:
if "out of memory" in str(e).lower():
print("Caught CUDA OOM – cleaning up")
torch.cuda.empty_cache()
gc.collect()
else:
raise
Jason
Jason•3w ago
I see yeah, but this is using a1111 api which will not return errrors like that in the api
Eren
Eren•3w ago
if it polls/checks the request execution result and if it returns FAILED, the error key has the value, it also has this raised "e" variable as string so another way might be that like this:
{
"delayTime": 2222,
"error": "bla-bla-bla out of memory i need help to fit 1 mb please bla-bla-bla",
"executionTime": 2222,
"id": "223j2kn2b3j23jbk2b",
"status": "FAILED",
"workerId": "2j3njkg8fdsmgk"
}
{
"delayTime": 2222,
"error": "bla-bla-bla out of memory i need help to fit 1 mb please bla-bla-bla",
"executionTime": 2222,
"id": "223j2kn2b3j23jbk2b",
"status": "FAILED",
"workerId": "2j3njkg8fdsmgk"
}
Jason
Jason•3w ago
That's when using another library not making request to a1111 api isn't it
Eren
Eren•3w ago
Yeah this applies using your own pipeline mostly wrapping inside of this logic
Jason
Jason•3w ago
Yeah.. I wonder what if we use a1111 api like the main discussion on this thread Maybe need to catch error from logs
Eren
Eren•3w ago
I don't have a1111 ready deployment but i assume it should return that above "error" key in the request fail response, that might be read and then kill worker
Jason
Jason•3w ago
Yeah Im not sure about that, might be

Did you find this page helpful?