Too many failed requests
Hello. I've tried to run casperhansen/mixtral-instruct-awq (https://huggingface.co/casperhansen/mixtral-instruct-awq) on A100 80 GB and A100 SXM 80GB GPUs, sending 10 requests per second using this script https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py.
However most of the requests failed with
Could anyone provide guidance on what the cause might be?
However most of the requests failed with
Aborted request log from vLLM. This issue didn't occur on another platform with the same GPU, and same code, so I'm not sure if the problem is with vLLM or with RunPod's internal processing.Could anyone provide guidance on what the cause might be?
Solution
Why are you using GPU cloud for this? If you want to handle many concurrent requests, you need to use Serverless not GPU cloud.
https://github.com/runpod-workers/worker-vllm
https://github.com/runpod-workers/worker-vllm
GitHub
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
