Settings to reduce delay time using sglang for 4bit quantized models?

I'm deploying 4bit AWQ quantized model: casperhansen/llama-3.3-70b-instruct-awq
The delay time for parallel requests increases exponentially when using tsglang template. What settings I need to use to make sure the delay time is manageable?

Continue the conversation

Join the Discord to ask follow-up questions and connect with the community

Runpod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

21,906 Members

Join

Settings to reduce delay time using sglang for 4bit quantized models?

Settings to reduce delay time using sglang for 4bit quantized models?

Continue the conversation

Runpod

Continue the conversation

Runpod

Similar Threads