Stuck vLLM startup with 100% GPU utilization
Twice now today I've deployed a new vLLM endpoint using the "Quick Deploy" "Serverless vLLM" option at: https://www.runpod.io/console/serverless only to have the worker stuck after launching the vLLM process and before reaching the weights downloading. It never reaches the state of actually downloading the HF model and loading it into vLLM.
* The image I've used is Qwen/Qwen2.5-72B-Instruct
* The problematic machines have all been A6000.
* Only a single worker configured with 4 x 48GB GPUs was set in the template configuration, in order to make the problem easier to track down (a single pod and a single machine).
I have a current worker stuck in this state presently, and it has the id: wxug1x04v59mxu
I'm going to terminate it since it just costs me money without providing any value, but if runpod has the ability to check logs after the fact (e.g. some ELK stack or the like), I hope they can pin-point the issue using that ID. If not, let me know and next time this happens I'll let you ping you so you can live-trouble shoot. Just let me know who to ping in that case.
Attached is the complete log from the worker.
9 Replies
@jojje
Escalated To Zendesk
The thread has been escalated to Zendesk!
Hi, is there any update on this issue? I am seeing this quite consistently, although not always, on A40 GPUs when using vLLM (not serverless)

@Poddy
Unknown User•9mo ago
Message Not Public
Sign In & Join Server To View
my issue was solved in https://discord.com/channels/912829806415085598/1359382997693894656/1359382997693894656
Sorry for the late reply. Been doing non-ml related stuff of late. How can I check if the ticket has been resolved? Tried clicking the "Open Zendesk Ticket" button here in the thread, but just got a bot message "You already have a ticket open ..., please wait for it to be resolved before opening a new one". So how do I find that ticket and read its content?
Unknown User•8mo ago
Message Not Public
Sign In & Join Server To View
Nothing that I can find. Might have gone into spam and got auto deleted. Weird there's no way to access tickets from the same channel one creates them.