LLM Stream Slowness

We're hitting a wall with high Time-to-First-Token (TTFT) latency on one of our streaming applications using Workers AI. Need some insight if this is expected for the model we're using.

The Problem:
Our application is experiencing an approx 4.0 second delay before the LLM starts streaming any output.

Our Setup:
Model: @cf/meta/llama-3.1-8b-instruct-fp8
Worker Region: MIA (Miami)
Prompt Size: approx 1,144 input characters (RAG context).Worker Performance:Total Wall Time: 4,474 ms
Worker CPU Time: 16 ms
(Conclusion: Almost all time is spent waiting on the model to start generation, not our code.)

Questions:
Is a 4.0+ second TTFT common for the llama-3.1-8b-instruct model at this prompt length?
Are there specific provisioning tips, or is the *-fast variant recommended solely for reducing this TTFT?

Any advice on further optimizing TTFT on Workers AI is appreciated!

Bblah We're hitting a wall with high Time-to-First-Token (TTFT) latency on one of our ...

Chaika•11/27/25, 4:45 PM

#workers-ai

-Fast will be faster yea, at the cost of some quality. I don't have time to first token metrics, but for an 100 token response -fast is ~4x as fast (~1s vs ~5s)
More info about Fast models: https://blog.cloudflare.com/making-workers-ai-faster/

The Cloudflare Blog

Making Workers AI faster and more efficient: Performance optimizati...

With a new generation of data center accelerator hardware and using optimization techniques such as KV cache compression and speculative decoding, we’ve made large language model (LLM) inference lightning-fast on the Cloudflare Workers AI platform.

blahOP•11/27/25, 4:49 PM

will check this out. Giving Thanks to you.

CChaika #workers-ai -Fast will be faster yea, at the cost of some quality. I don't hav...

blahOP•11/27/25, 9:50 PM

Interesting read. [edit] Looks like at least one model might be getting the supercharger. I did try @cf/meta/llama-3.2-3b-instruct and it knocked a bit off the wall time.

LLM Stream Slowness

Similar Threads

Similar Threads

Similar Threads