LLM Stream Slowness
We're hitting a wall with high Time-to-First-Token (TTFT) latency on one of our streaming applications using Workers AI. Need some insight if this is expected for the model we're using.

The Problem:
Our application is experiencing an approx 4.0 second delay before the LLM starts streaming any output.

Our Setup:
Model: @cf/meta/llama-3.1-8b-instruct-fp8
Worker Region: MIA (Miami)
Prompt Size: approx 1,144 input characters (RAG context).Worker Performance:Total Wall Time: 4,474 ms
Worker CPU Time: 16 ms
(Conclusion: Almost all time is spent waiting on the model to start generation, not our code.)

Questions:
Is a 4.0+ second TTFT common for the llama-3.1-8b-instruct model at this prompt length?
Are there specific provisioning tips, or is the *-fast variant recommended solely for reducing this TTFT?
Any advice on further optimizing TTFT on Workers AI is appreciated!
The Problem:
Our application is experiencing an approx 4.0 second delay before the LLM starts streaming any output.
Our Setup:
Model: @cf/meta/llama-3.1-8b-instruct-fp8
Worker Region: MIA (Miami)
Prompt Size: approx 1,144 input characters (RAG context).Worker Performance:Total Wall Time: 4,474 ms
Worker CPU Time: 16 ms
(Conclusion: Almost all time is spent waiting on the model to start generation, not our code.)
Questions:
Is a 4.0+ second TTFT common for the llama-3.1-8b-instruct model at this prompt length?
Are there specific provisioning tips, or is the *-fast variant recommended solely for reducing this TTFT?
Any advice on further optimizing TTFT on Workers AI is appreciated!

