how does LLM applications scale?
Sorry for sounding like a layman; The thing is, providers like OpenAI have something known as TPM (tokens per minute) limits. So, how do people build applications around it when multiple users are using it simultaneously?
Let's say I'm creating a feature like Deep Research, and due to the TPM limit, I can only run one research process at a time. If I exceed the limit, it will throw a TPM error again; hence making it not so possible for me to even test a pilot for my application.
Thanks!
34 Replies
they use the api https://platform.openai.com/docs/overview
OpenAI Platform
Explore developer resources, tutorials, API docs, and dynamic examples to get the most out of OpenAI's platform.
Most providers have a tier scaling, where TPM increase in higher tiers
e.g. Openai tier 5 has 40,000,000 TPM for gpt-5 models via api. 40M tokens per minutes is a lot, in my experience, even with user concurrency you'll not reach easily the TPM limits
Which TPM limit do you have, and which is the concurrency you are dealing with?
recently i've created this architecture which is fully scalable, take a look if it suits you

yeah, So now I'll just be using multiple LLM providers instead of using one and that'll be my monkey patch.
~ my current openai tier --> 2 π
thanks for this; quite (very very) overwhelming. but definitely saved it
I need a job


here's some rough things we came up with lol
this thing will definitely not scale even for 10 concurrent users π
yeah so you are doing a basic rag
and feeding into a mcp server with agents
aahh kinda you can say that
now thing is we're running out of TPM limits of llm providers if we try to do like 2-3 evalutions together
and that too when agents work one-by-one
i think your lacking a mcp server
but I could be wrong tho, I'm still learning this
here's how our current system works:
cloudflare pages (nextjs frontend) --> ec2 (hono backend) --> ec2 (fastapi server ~ does the RAG work)
High chances that could be the case; even am fairly new into this
I see, so this is horrible
first thing you need to change the fast api server to something like go or rust
and also use queues
or change the backend to GO and then connect to fast api with a queue
if you link services together without a queue it will never scale unless you get a gigantic computer
for agent orchestration we're using this: https://github.com/crewAIInc/crewAI
I doubt it's a python only lib
GitHub
GitHub - crewAIInc/crewAI: Framework for orchestrating role-playing...
Framework for orchestrating role-playing, autonomous AI agents. By fostering collaborative intelligence, CrewAI empowers agents to work together seamlessly, tackling complex tasks. - crewAIInc/crewAI
bro u need logs and a smoke test
try adding logs to every piece of code you got and then use performance 4 all to test
and you will see the pain points of your current system
and then you solve one by one
and retest untill is flawless
oh acualy p4all is proprietary software so Im looking for an alternative for you
what do you think
hmm, already working on that part. but these LLM agents make my life hell; sometimes they work flawless and the next moment they mess up tool calls
I see so you mean they mess up the arguments and you get a type error in a call?
sometimes when I shift to a mini model it ignores previous instructions. like with the semantic search tool; it gives paragraphs of text as input param, ignoring all the restrictions on input. then on retry it realizes 'ohh I gave the tool incorrect input lets fix that'
ok so u should try
top d top k to get the distance between an ideal vectorized output and your actual output and do a retry if the distance of greater than some value
thats called rerank or reretry
basically its a hallucination problem rightwasn't reranking done in your vectors by adding that KNN algo?
yeah I guess but there are many ways you can do it
you're right; need to make the models more deterministic; and somewhat need to play around with the top k,p values
yeah that sounds fun xD
I mean you can do whatever you want, even using something like Zod to make the type checks
as long as it works
tbh that was one reason I didn't used langchain/graph this time.
I remember type checking deepseeks' json outputs; and it was truely a nightmare
all of this is new to everyone so, the bad news is that actually your are on your own
and the good news is that no one knows what we are doing
so you basically have to be creative and try different approachs until you're satisfied
I'd like to help but I dont have a paid model yet so I'm trying to get a job so I can afford that
but I'm horrible at job interviews
some of the best lines; I've heard on internet today. π
one thing I know, if you have specialized agents, you must have MCP
you seem like a fun dude; wish I could offer you a job if I werenβt a bootstrapped college guy π and that too from a country with the lowest wages lol
hue br
gotta look into this;
the thing is we're doing other work than just agents in the fast api server too.
like ingesting the github repo, chunking & embedding it into a vectorstore
No idea; if all of that can be wrapped inside mcp server
no, the mcp server is to share context between agents
this way you save memory and this will help with hallucination
I believe you need to separate your backend into microservices and have an event source to call them
this way you can have multiple instances
that will also help
basically the sketch I've sent you earlier is about that, how to scale micro services
and then the hallucination part is something else
okay now I get what you were saying and your sketch starts to make sense
this shared context thing will also bring down the overall token consumptions drastically? isn't it
yeah
that's the purpose
saving more space to the agent ~think~ and that will also help with hallucination
roger that, will update IF i get any success
sure thing bro, I will be trying something similar soon
also if I can be of assistance dont hesitate to ask