dbtr
dbtr
RRunPod
Created by dbtr on 4/6/2025 in #⚡|serverless
Serverless endpoint fails with Out Of Memory despite no changes
Thanks @Eren - is there a way to kill the worker programmatically? Since the worker is still intact (even though we get OOM when invoking the Stable Diffusion A1111 API on the server's side), subsequent requests will again use the same worker, again resulting in OOM. I would like programmatically catch the OOM and force Runpod to terminate the worker / choose a different one
26 replies
RRunPod
Created by dbtr on 4/6/2025 in #⚡|serverless
Serverless endpoint fails with Out Of Memory despite no changes
thank you guys! the original problem was with the worker id "3yo6ri2zzmuvmq" (i can provide a log). in fact, two calls on the same worker (the worker was idle/down in between for around 4-5 days) failed, whereas other workers seemed to work in the meantime i have since upgraded my setup from 16gb to 24gb setups, for lack of an alternative what to do. i now have a new failure with the new worker (yesterday when i tried it worked): {'error': 'RuntimeError', 'detail': '', 'body': '', 'errors': 'CUDA error: misaligned address\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n'} it's all a bit confusing and I don't know where to start debugging really 😕
26 replies
RRunPod
Created by dbtr on 4/6/2025 in #⚡|serverless
Serverless endpoint fails with Out Of Memory despite no changes
I'd like to add that the same operation with the same image processed successfully a day later with no errors
26 replies