Higherend GPU Worker Stop Prematurely

Hi. I am trying to run a serverless endpoint with the Omost model, which requires more VRAM. When I accidentally started it with a 20GB all works fine, except the expected CUDA OoM. Configuring to use 80GB VRAM in EU-RO-1 the endpoint is created but the workers end prematurely constantly. Is there any way to figure out what and why this is happening? The logs do not really seem to help me here.

yhlong00000•7/11/24, 1:50 PM

maybe post the log here?

BBjörn Hi. I am trying to run a serverless endpoint with the Omost model, which require...

Jason•7/11/24, 1:54 PM

can you send the handler code here

Jason•7/11/24, 1:54 PM

and how do you start the handler

Jason•7/11/24, 1:54 PM

maybe your dockerfile entrypoint or cmd too will help

BjörnOP•7/11/24, 4:36 PM

That's the whole code (sorry, I have no publicly accessible Git repo). I start it via the Serverless + Endpoint UI using a template: Checking the two 80 GB GPU options, selecting the template and our Network Volume in the EU-RO-1 region. The template's Container Image is runpod/base:0.6.1-cuda12.1.0runpod/base:0.6.1-cuda12.1.0 and its Container Start Command is

bash -c ". /runpod-volume/b426d666/prod/venv/bin/activate && python -u /runpod-volume/b426d666/prod/omost/src/handler.py"

bash -c ". /runpod-volume/b426d666/prod/venv/bin/activate && python -u /runpod-volume/b426d666/prod/omost/src/handler.py"

with 100 GB Container Disk.

runpod-omost.tgz60.32KB

Jason•7/11/24, 5:24 PM

Ends prematurely?

Jason•7/11/24, 5:25 PM

like it doesn't return any output?

Jason•7/11/24, 5:25 PM

Whats the job status like

Jason•7/11/24, 5:26 PM

your handler file looks normal

Jason•7/11/24, 5:27 PM

Try making

output

output

a list

BjörnOP•7/11/24, 8:18 PM

The worker itself doesn't start, e.g., its box turns red and the tooltip says "Worker stopped prematurely". But this does not occure when running on 20GB GPUs.

Jason•7/12/24, 4:17 AM

Then it might be oom

Jason•7/12/24, 4:18 AM

Worker stopped prematurely means it stopped even before starting to handle the job, so I guess it failed to load your model

JJason Then it might be oom

Jason•7/12/24, 4:19 AM

Oh nvm, try to contact runpod to report this problem

Jason•7/12/24, 4:19 AM

From the website use the contact button from there submit a ticket about this

Higherend GPU Worker Stop Prematurely

Similar Threads

Similar Threads

Similar Threads