How is LLM streaming done for like 100-200 users so that IO is not blocked on every request sent to the LLM?
Is there like a worker/queue pool and the workers handle requests to the LLM asynchronously? Or..? I'm kind of doing this now myself and I am curious as to how this is done at a scale so that IO does not get blocked when we are streaming chunks from the LLM to the UI. Also, how are the chunks sent from LLM to the UI most optimally? I am now using Redis to stream them.