We're hitting a wall with high Time-to-First-Token (TTFT) latency on one of our streaming applications using Workers AI. Need some insight if this is expected for the model we're using.
The Problem: Our application is experiencing an approx 4.0 second delay before the LLM starts streaming any output.
Our Setup: Model: @cf/meta/llama-3.1-8b-instruct-fp8 Worker Region: MIA (Miami) Prompt Size: approx 1,144 input characters (RAG context).Worker Performance:Total Wall Time: 4,474 ms Worker CPU Time: 16 ms (Conclusion: Almost all time is spent waiting on the model to start generation, not our code.)
Questions: Is a 4.0+ second TTFT common for the llama-3.1-8b-instruct model at this prompt length? Are there specific provisioning tips, or is the *-fast variant recommended solely for reducing this TTFT?
Any advice on further optimizing TTFT on Workers AI is appreciated!