Serverless endpoint fails with Out Of Memory despite no changes
For several months I am using the same endpoint code to generate Stable Diffusion 1.5 images in 512x512 with Auto1111 (in other words, quite low specs). I have a serverless endpoint with 16GB (the logs show more memory available, but the setup was 16 GB).
There are very few requests to the endpoint. That's why I know that the worker was just booting up with a fresh start in my two test cases that failed
Practically right after booting and when I try to begin inference, I get the following error:
A1111 Response: {'error': 'OutOfMemoryError', 'detail': '', 'body': '', 'errors': 'CUDA out of memory. Tried to allocate 146.00 MiB. GPU 0 has a total capacty of 19.70 GiB of which 10.38 MiB is free. Process 1790219 has 19.49 GiB memory in use. Process 3035077 has 194.00 MiB memory in use. Of the allocated memory 244.00 KiB is allocated by PyTorch, and 1.76 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF'}
It says that a certain process is using around 20GB in memory. This hasn't failed before and I assume it is unlikely that my specific Stable Diffusion operation uses this much memory.
Can anyone help me where to start digging? Is it (at least theoretically) possible that some other process running on the same machine, but not from me, is using some shared memory here?
Thanks!
20 Replies
I'd like to add that the same operation with the same image processed successfully a day later with no errors
No I don't think so, it's highly unlikely to be like that since usually it's divided into container like 1 gpu 1 container or .. It is limited to the amount of gpu configured
I think it's your code but if you want you can ask support to check by your worker id
maybe that specific gpu errored and needs a reset
thank you guys! the original problem was with the worker id "3yo6ri2zzmuvmq" (i can provide a log). in fact, two calls on the same worker (the worker was idle/down in between for around 4-5 days) failed, whereas other workers seemed to work in the meantime
i have since upgraded my setup from 16gb to 24gb setups, for lack of an alternative what to do. i now have a new failure with the new worker (yesterday when i tried it worked):
{'error': 'RuntimeError', 'detail': '', 'body': '', 'errors': 'CUDA error: misaligned address\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.\n'}
it's all a bit confusing and I don't know where to start debugging really 😕Hmm try passing this to chatgpt, might be able to help
I can approve that, 1 out of 100 worker having problem to run the code ending up with OOM, just kill the worker and move on
Thanks @Eren - is there a way to kill the worker programmatically? Since the worker is still intact (even though we get OOM when invoking the Stable Diffusion A1111 API on the server's side), subsequent requests will again use the same worker, again resulting in OOM. I would like programmatically catch the OOM and force Runpod to terminate the worker / choose a different one
Yes you can catch the OOM exception and use GraphQL API to kill the worker. OOM can occur due to several reasons and that doesn't mean that worker is a poor worker but yeah you can do that programmatically
I also strongly recommend implementing torch cache clear support and periodically clearing the garbage collector
or runpodctl tool
you can run this command programatically, and fill the pod id with your worker id
https://docs.runpod.io/runpodctl/reference/runpodctl_remove_pod
Remove Pod | RunPod Documentation
runpodctl remove pod
each worker has this variables (try to test it ) https://docs.runpod.io/pods/references/environment-variables
Pod environment variables | RunPod Documentation
Configure and manage your pods with these essential environment variables, including pod ID, API key, host name, GPU and CPU count, public IP, SSH port, data center ID, volume ID, CUDA version, current working directory, PyTorch version, and public SSH key.
how can we catch oom? im curious, is it from the request response of a1111 api?
many ways but simple as this
I see yeah, but this is using a1111 api which will not return errrors like that in the api
if it polls/checks the request execution result and if it returns FAILED, the error key has the value, it also has this raised "e" variable as string so another way might be that
like this:
That's when using another library not making request to a1111 api isn't it
Yeah this applies using your own pipeline mostly wrapping inside of this logic
Yeah.. I wonder what if we use a1111 api like the main discussion on this thread
Maybe need to catch error from logs
I don't have a1111 ready deployment but i assume it should return that above "error" key in the request fail response, that might be read and then kill worker
Yeah Im not sure about that, might be