Use serverless to deploy Qwen/Qwen2-7B model
GPU: Nivada A40 48G
Environment variables:
MODEL_NAME=Qwen/Qwen2-7B
HF_TOKEN=xxx
ENABLE_LORA=True
LORA_MODULES={"name": "cn_writer", "path": "{huggingface_model_name}", "base_model_name": "Qwen/Qwen2-7B"}
MAX_LORA_RANK=64
MIN_BATCH_SIZE=384
ENABLE_PREFIX_CACHING=1
My problem:
Batch processing takes too long, which is 3-4 times the time of a single request. How should I reduce the time consumption of this batch processing?
My code is in the attachment
Phenomenon:
The time consumption of 64 batch processing requests is 4 times that of a single batch processing request.
What I expect is how to make the time of 64 batch processing close to the time of single batch processing