R
Runpod11mo ago
PhysSci

50/50 success with running a standart vllm template

So when I start vllm/vllm-openai:latest on 2xA100 or 4xA40 I only able to do it 1/2 or 1/3 times. I haven't noticed any logic befind it it just fails sometimes. Here are parameters I use for 2xA100 for instanse: --host 0.0.0.0 --port 8000 --model meta-llama/Llama-3.3-70B-Instruct --dtype bfloat16 --enforce-eager --gpu-memory-utilization 0.95 --api-key key --max-model-len 16256 --tensor-parallel-size 2 I also need have some logs.
9 Replies
Unknown User
Unknown User11mo ago
Message Not Public
Sign In & Join Server To View
yhlong00000
yhlong0000011mo ago
max-model-length maybe too large, try to a smaller number.
PhysSci
PhysSciOP11mo ago
Why then I often can start it fine with this parameters? To use this in commercial application I need consistancy, either it works and I can use it or it not and I can troubleshoot this. I use runpod in my job and I almost always stick to it for prototyping, but I can not suggest it to my clients for production because such issues do occur occasionally. That's a pity cause I like runpod.
flash-singh
flash-singh11mo ago
check logs, it might be because the gpu or container is ooming, thats a common problem we see
PhysSci
PhysSciOP10mo ago
@flash-singh can you please explain what ooming means and what should I look for in logs? It appears that problem got much worse and now I can't start it pod at all
flash-singh
flash-singh10mo ago
@yhlong00000 are you able to check? see if we log it also in pod logs when they oom
yhlong00000
yhlong0000010mo ago
Hey, could you share me your pod id?
Unknown User
Unknown User10mo ago
Message Not Public
Sign In & Join Server To View
PhysSci
PhysSciOP9mo ago
@yhlong00000 Hi, sorry for not answering. When I read you massge I already deleted the pod. I finally got back to developing my app so I started pod once again on 4xA40 which woked fine before, and now I can't do it with that do. I am sure that mu settings are the same cause I use python script to launch it with runpod api. Pod id j7tmr0531ylfd6. I can keep it running for a while so you can check. Here are the logs:

Did you find this page helpful?