Serverless vLLM workers crash
Whenever I create a serverless vLLM (doesn't matter what model I use), the workers all end up crashing and having the status "unhealthy". I went on the vLLM supported models website and I use only models that are supported. The last time I ran a serverless vLLM, I used meta-llama/Llama-3.1-70B, and used a proper huggingface token that allows access to the model. The result of trying to run the default "Hello World" prompt on this serverless vLLM is in the attached images. A worker has the status running, but when you open the stats, they all are at 0%, and there are no logs. The worker then has the status unhealthy, and is moved to the Extra section. In this specific scenario, the last worker had the status idle and never picked up the request. I didn't let it sit for too long, only about 10 minutes, but it did not pick up the request and start working.
4 Replies
Can you check the logs?
no, when they crash the logs disappear
i mean
they will CUDA OOM
70B model needs 140+GB VRAM in FP16 and u r giving vLLM only 96GB
lower precision to FP8 or INT4
INT4 with 32K context should work
DM me if it keeps crashing
is it the default template?
oh a40
yes you might need more vram..
select other gpu/ more gpu
i'd recommend discusing here instead