RunpodR
Runpod12mo ago
octopus

Settings to reduce delay time using sglang for 4bit quantized models?

I'm deploying 4bit AWQ quantized model: casperhansen/llama-3.3-70b-instruct-awq
The delay time for parallel requests increases exponentially when using tsglang template. What settings I need to use to make sure the delay time is manageable?
Was this page helpful?