RunPod•4w ago

Serverless endpoint fails with Out Of Memory despite no changes

For several months I am using the same endpoint code to generate Stable Diffusion 1.5 images in 512x512 with Auto1111 (in other words, quite low specs). I have a serverless endpoint with 16GB (the logs show more memory available, but the setup was 16 GB). There are very few requests to the endpoint. That's why I know that the worker was just booting up with a fresh start in my two test cases that failed Practically right after booting and when I try to begin inference, I get the following error: A1111 Response: {'error': 'OutOfMemoryError', 'detail': '', 'body': '', 'errors': 'CUDA out of memory. Tried to allocate 146.00 MiB. GPU 0 has a total capacty of 19.70 GiB of which 10.38 MiB is free. Process 1790219 has 19.49 GiB memory in use. Process 3035077 has 194.00 MiB memory in use. Of the allocated memory 244.00 KiB is allocated by PyTorch, and 1.76 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF'} It says that a certain process is using around 20GB in memory. This hasn't failed before and I assume it is unlikely that my specific Stable Diffusion operation uses this much memory. Can anyone help me where to start digging? Is it (at least theoretically) possible that some other process running on the same machine, but not from me, is using some shared memory here? Thanks!

20 Replies

dbtrOP•4w ago

I'd like to add that the same operation with the same image processed successfully a day later with no errors

Jason•4w ago

No I don't think so, it's highly unlikely to be like that since usually it's divided into container like 1 gpu 1 container or .. It is limited to the amount of gpu configured I think it's your code but if you want you can ask support to check by your worker id

riverfog7•4w ago

maybe that specific gpu errored and needs a reset

dbtrOP•4w ago

thank you guys! the original problem was with the worker id "3yo6ri2zzmuvmq" (i can provide a log). in fact, two calls on the same worker (the worker was idle/down in between for around 4-5 days) failed, whereas other workers seemed to work in the meantime i have since upgraded my setup from 16gb to 24gb setups, for lack of an alternative what to do. i now have a new failure with the new worker (yesterday when i tried it worked): {'error': 'RuntimeError', 'detail': '', 'body': '', 'errors': 'CUDA error: misaligned address\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n'} it's all a bit confusing and I don't know where to start debugging really 😕

Jason•4w ago

Hmm try passing this to chatgpt, might be able to help

Eren•4w ago

I can approve that, 1 out of 100 worker having problem to run the code ending up with OOM, just kill the worker and move on

dbtrOP•4w ago

Thanks @Eren - is there a way to kill the worker programmatically? Since the worker is still intact (even though we get OOM when invoking the Stable Diffusion A1111 API on the server's side), subsequent requests will again use the same worker, again resulting in OOM. I would like programmatically catch the OOM and force Runpod to terminate the worker / choose a different one

Eren•4w ago

Yes you can catch the OOM exception and use GraphQL API to kill the worker. OOM can occur due to several reasons and that doesn't mean that worker is a poor worker but yeah you can do that programmatically I also strongly recommend implementing torch cache clear support and periodically clearing the garbage collector

Jason•4w ago

or runpodctl tool

Jason•4w ago

you can run this command programatically, and fill the pod id with your worker id https://docs.runpod.io/runpodctl/reference/runpodctl_remove_pod

Remove Pod | RunPod Documentation

runpodctl remove pod

Jason•4w ago

each worker has this variables (try to test it ) https://docs.runpod.io/pods/references/environment-variables

Pod environment variables | RunPod Documentation

Configure and manage your pods with these essential environment variables, including pod ID, API key, host name, GPU and CPU count, public IP, SSH port, data center ID, volume ID, CUDA version, current working directory, PyTorch version, and public SSH key.

Jason•4w ago

how can we catch oom? im curious, is it from the request response of a1111 api?

Eren•3w ago

many ways but simple as this

try:
    output = model(input_tensor.to("cuda"))
except RuntimeError as e:
    if "out of memory" in str(e).lower():
        print("Caught CUDA OOM – cleaning up")
        torch.cuda.empty_cache()
        gc.collect()
    else:
        raise

try:
    output = model(input_tensor.to("cuda"))
except RuntimeError as e:
    if "out of memory" in str(e).lower():
        print("Caught CUDA OOM – cleaning up")
        torch.cuda.empty_cache()
        gc.collect()
    else:
        raise

Jason•3w ago

I see yeah, but this is using a1111 api which will not return errrors like that in the api

Eren•3w ago

if it polls/checks the request execution result and if it returns FAILED, the error key has the value, it also has this raised "e" variable as string so another way might be that like this:

{
  "delayTime": 2222,
  "error": "bla-bla-bla out of memory i need help to fit 1 mb please bla-bla-bla",
  "executionTime": 2222,
  "id": "223j2kn2b3j23jbk2b",
  "status": "FAILED",
  "workerId": "2j3njkg8fdsmgk"
}

{
  "delayTime": 2222,
  "error": "bla-bla-bla out of memory i need help to fit 1 mb please bla-bla-bla",
  "executionTime": 2222,
  "id": "223j2kn2b3j23jbk2b",
  "status": "FAILED",
  "workerId": "2j3njkg8fdsmgk"
}

Jason•3w ago

That's when using another library not making request to a1111 api isn't it

Eren•3w ago

Yeah this applies using your own pipeline mostly wrapping inside of this logic

Jason•3w ago

Yeah.. I wonder what if we use a1111 api like the main discussion on this thread Maybe need to catch error from logs

Eren•3w ago

I don't have a1111 ready deployment but i assume it should return that above "error" key in the request fail response, that might be read and then kill worker

Jason•3w ago

Yeah Im not sure about that, might be

Gaming

Programming

Serverless endpoint fails with Out Of Memory despite no changes

Did you find this page helpful?