Runpod•15mo ago

Not getting 100s of req/sec serving for Llama 3 70B models with default vLLM serverless template

I'm deploying Llama-70B models without quantization using 2x80GB workers but after 10 parallel requests the execution and delay time increases to 10-50sec. I'm not sure if I'm doing something wrong with my setup. I pretty much use the default setup with the vLLM template just setting MAX_MODEL_LEN to 4096 and ENFORCE_EAGER to true

Communities Docs About Terms Privacy

Not getting 100s of req/sec serving for Llama 3 70B models with default vLLM serverless template - Runpod

Not getting 100s of req/sec serving for Llama 3 70B models with default vLLM serverless template

Similar Threads

Not getting 100s of req/sec serving for Llama 3 70B models with default vLLM serverless template

Similar Threads

Similar Threads

Similar Threads