Too many failed requests

Hello. I've tried to run casperhansen/mixtral-instruct-awq (https://huggingface.co/casperhansen/mixtral-instruct-awq) on A100 80 GB and A100 SXM 80GB GPUs, sending 10 requests per second using this script https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py.

However most of the requests failed with

Aborted request

Aborted request

log from vLLM. This issue didn't occur on another platform with the same GPU, and same code, so I'm not sure if the problem is with vLLM or with RunPod's internal processing.

Could anyone provide guidance on what the cause might be?

casperhansen/mixtral-instruct-awq · Hugging Face

Solution

Why are you using GPU cloud for this? If you want to handle many concurrent requests, you need to use Serverless not GPU cloud.

https://github.com/runpod-workers/worker-vllm

GitHub

GitHub - runpod-workers/worker-vllm: The RunPod worker template for...

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

Jump to solution

Runpod•2y ago•

6 replies

norefreshing

Too many failed requests

Aborted request

Aborted request

casperhansen/mixtral-instruct-awq · Hugging Face

Solution

Why are you using GPU cloud for this? If you want to handle many concurrent requests, you need to use Serverless not GPU cloud.

https://github.com/runpod-workers/worker-vllm

GitHub

GitHub - runpod-workers/worker-vllm: The RunPod worker template for...

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

Jump to solution

Too many failed requests

Similar Threads

Too many failed requests

Similar Threads

Similar Threads

Similar Threads