What are ttft times we should be able to reach?

Of course this depends on token inputs, hardware selection etc. But for the life of me, I cannot get a TTFT of under 2000 ms on serverless.
I'm using llama 3.1 7b / gemma / mystral on 48 GB gpu workers.

For performance evaluation I use guidellm which test for different throughput (continous, small, large) scenarios. Even with 50 input tokens and 100 output tokens I see 2000-2500 ms ttft.

I should add that I'm running guideLLM from a local python script to the serverless endpoint. Has anyone observed quicker times?

What are ttft times we should be able to reach?

Similar Threads

What are ttft times we should be able to reach?

Similar Threads

Similar Threads

Similar Threads