Theo's Typesafe Cult•2mo ago

How is LLM streaming done at scale for users?

How is LLM streaming done for like 100-200 users so that IO is not blocked on every request sent to the LLM? Is there like a worker/queue pool and the workers handle requests to the LLM asynchronously? Or..? I'm kind of doing this now myself and I am curious as to how this is done at a scale so that IO does not get blocked when we are streaming chunks from the LLM to the UI. Also, how are the chunks sent from LLM to the UI most optimally? I am now using Redis to stream them.

5 Replies

Temujin in the post unwrap era•2mo ago

are we thinking too complex? have you tried using websockets? wither with websocket packages or socket.io? as to speak for at scale solutions that would depend on the system itself. how a company hosts and structures their apps and system would determine a lot of it. some would use http streaming some would go for websockets. some may have a kafka / sqs / or redis event stream that takes in events from an agent invocation then emit the even to the right socket / connection. but serving 100-1000 users would not require that much infra overhead.

slipsec•2mo ago

Fall back to websockets if the browser doesn't support SSE

rreOP•2mo ago

im under the impression that ws is just extra bloat. sse is the way to go i guess im asking if the flow is this: - send user prompt - backend gets it, puts it into a queue (?), same router opens up sse - subs to the event - chunks come back asap from the event, depending on the worker situation how the LLM orchestration is handled is what im most curious of different ways to do it. but i guess u explained it already. kafka / sqs / redis stream

Temujin in the post unwrap era•2mo ago

yes something like that or a combination of it. also sometimes other architectural decisions shape how they really process the messages. for example teams that choose rest over lambda may not be able to do SSE so they'd use some other strategies. but as I assume most of the big players will have a more event driven arch to solve their receive requests and stream to the users problems.

rreOP•2mo ago

i got all in a container atm, not using functions (lambdas in aws) and got all in azure but yeah was just wondering how this stuff is done (streaming to the UI) without blocking IO in like enterprise grade solutions. they must use kafka etc im actually using redis streams now and have custom python workers to do the jobs

Gaming

Programming

How is LLM streaming done at scale for users?

Did you find this page helpful?