R
Runpod17mo ago
Björn

Higherend GPU Worker Stop Prematurely

Hi. I am trying to run a serverless endpoint with the Omost model, which requires more VRAM. When I accidentally started it with a 20GB all works fine, except the expected CUDA OoM. Configuring to use 80GB VRAM in EU-RO-1 the endpoint is created but the workers end prematurely constantly. Is there any way to figure out what and why this is happening? The logs do not really seem to help me here.
6 Replies
yhlong00000
yhlong0000017mo ago
maybe post the log here?
Unknown User
Unknown User17mo ago
Message Not Public
Sign In & Join Server To View
Björn
BjörnOP17mo ago
That's the whole code (sorry, I have no publicly accessible Git repo). I start it via the Serverless + Endpoint UI using a template: Checking the two 80 GB GPU options, selecting the template and our Network Volume in the EU-RO-1 region. The template's Container Image is runpod/base:0.6.1-cuda12.1.0 and its Container Start Command is bash -c ". /runpod-volume/b426d666/prod/venv/bin/activate && python -u /runpod-volume/b426d666/prod/omost/src/handler.py" with 100 GB Container Disk.
Unknown User
Unknown User17mo ago
Message Not Public
Sign In & Join Server To View
Björn
BjörnOP17mo ago
The worker itself doesn't start, e.g., its box turns red and the tooltip says "Worker stopped prematurely". But this does not occure when running on 20GB GPUs.
Unknown User
Unknown User17mo ago
Message Not Public
Sign In & Join Server To View

Did you find this page helpful?