how does LLM applications scale?

Sorry for sounding like a layman; The thing is, providers like OpenAI have something known as TPM (tokens per minute) limits. So, how do people build applications around it when multiple users are using it simultaneously? Let's say I'm creating a feature like Deep Research, and due to the TPM limit, I can only run one research process at a time. If I exceed the limit, it will throw a TPM error again; hence making it not so possible for me to even test a pilot for my application. Thanks!
34 Replies
peculiarnewbie
peculiarnewbieβ€’3mo ago
OpenAI Platform
Explore developer resources, tutorials, API docs, and dynamic examples to get the most out of OpenAI's platform.
GuillermoB
GuillermoBβ€’3mo ago
Most providers have a tier scaling, where TPM increase in higher tiers e.g. Openai tier 5 has 40,000,000 TPM for gpt-5 models via api. 40M tokens per minutes is a lot, in my experience, even with user concurrency you'll not reach easily the TPM limits Which TPM limit do you have, and which is the concurrency you are dealing with?
roger that
roger thatβ€’3mo ago
recently i've created this architecture which is fully scalable, take a look if it suits you
No description
Arsh
ArshOPβ€’3mo ago
yeah, So now I'll just be using multiple LLM providers instead of using one and that'll be my monkey patch. ~ my current openai tier --> 2 πŸ˜† thanks for this; quite (very very) overwhelming. but definitely saved it
roger that
roger thatβ€’3mo ago
I need a job
Arsh
ArshOPβ€’3mo ago
No description
No description
Arsh
ArshOPβ€’3mo ago
here's some rough things we came up with lol this thing will definitely not scale even for 10 concurrent users 😭
roger that
roger thatβ€’3mo ago
yeah so you are doing a basic rag and feeding into a mcp server with agents
Arsh
ArshOPβ€’3mo ago
aahh kinda you can say that now thing is we're running out of TPM limits of llm providers if we try to do like 2-3 evalutions together and that too when agents work one-by-one
roger that
roger thatβ€’3mo ago
i think your lacking a mcp server but I could be wrong tho, I'm still learning this
Arsh
ArshOPβ€’3mo ago
here's how our current system works: cloudflare pages (nextjs frontend) --> ec2 (hono backend) --> ec2 (fastapi server ~ does the RAG work) High chances that could be the case; even am fairly new into this
roger that
roger thatβ€’3mo ago
I see, so this is horrible first thing you need to change the fast api server to something like go or rust and also use queues or change the backend to GO and then connect to fast api with a queue if you link services together without a queue it will never scale unless you get a gigantic computer
Arsh
ArshOPβ€’3mo ago
for agent orchestration we're using this: https://github.com/crewAIInc/crewAI I doubt it's a python only lib
GitHub
GitHub - crewAIInc/crewAI: Framework for orchestrating role-playing...
Framework for orchestrating role-playing, autonomous AI agents. By fostering collaborative intelligence, CrewAI empowers agents to work together seamlessly, tackling complex tasks. - crewAIInc/crewAI
roger that
roger thatβ€’3mo ago
bro u need logs and a smoke test try adding logs to every piece of code you got and then use performance 4 all to test and you will see the pain points of your current system and then you solve one by one and retest untill is flawless oh acualy p4all is proprietary software so Im looking for an alternative for you what do you think
Arsh
ArshOPβ€’3mo ago
hmm, already working on that part. but these LLM agents make my life hell; sometimes they work flawless and the next moment they mess up tool calls
roger that
roger thatβ€’3mo ago
I see so you mean they mess up the arguments and you get a type error in a call?
Arsh
ArshOPβ€’3mo ago
sometimes when I shift to a mini model it ignores previous instructions. like with the semantic search tool; it gives paragraphs of text as input param, ignoring all the restrictions on input. then on retry it realizes 'ohh I gave the tool incorrect input lets fix that'
roger that
roger thatβ€’3mo ago
ok so u should try top d top k to get the distance between an ideal vectorized output and your actual output and do a retry if the distance of greater than some value thats called rerank or reretry basically its a hallucination problem right
Arsh
ArshOPβ€’3mo ago
wasn't reranking done in your vectors by adding that KNN algo?
roger that
roger thatβ€’3mo ago
yeah I guess but there are many ways you can do it
Arsh
ArshOPβ€’3mo ago
you're right; need to make the models more deterministic; and somewhat need to play around with the top k,p values
roger that
roger thatβ€’3mo ago
yeah that sounds fun xD I mean you can do whatever you want, even using something like Zod to make the type checks as long as it works
Arsh
ArshOPβ€’3mo ago
tbh that was one reason I didn't used langchain/graph this time. I remember type checking deepseeks' json outputs; and it was truely a nightmare
roger that
roger thatβ€’3mo ago
all of this is new to everyone so, the bad news is that actually your are on your own and the good news is that no one knows what we are doing so you basically have to be creative and try different approachs until you're satisfied I'd like to help but I dont have a paid model yet so I'm trying to get a job so I can afford that but I'm horrible at job interviews
Arsh
ArshOPβ€’3mo ago
some of the best lines; I've heard on internet today. πŸ˜†
roger that
roger thatβ€’3mo ago
one thing I know, if you have specialized agents, you must have MCP
Arsh
ArshOPβ€’3mo ago
you seem like a fun dude; wish I could offer you a job if I weren’t a bootstrapped college guy πŸ˜‚ and that too from a country with the lowest wages lol
roger that
roger thatβ€’3mo ago
hue br
Arsh
ArshOPβ€’3mo ago
gotta look into this; the thing is we're doing other work than just agents in the fast api server too. like ingesting the github repo, chunking & embedding it into a vectorstore No idea; if all of that can be wrapped inside mcp server
roger that
roger thatβ€’3mo ago
no, the mcp server is to share context between agents this way you save memory and this will help with hallucination I believe you need to separate your backend into microservices and have an event source to call them this way you can have multiple instances that will also help basically the sketch I've sent you earlier is about that, how to scale micro services and then the hallucination part is something else
Arsh
ArshOPβ€’3mo ago
okay now I get what you were saying and your sketch starts to make sense this shared context thing will also bring down the overall token consumptions drastically? isn't it
roger that
roger thatβ€’3mo ago
yeah that's the purpose saving more space to the agent ~think~ and that will also help with hallucination
Arsh
ArshOPβ€’3mo ago
roger that, will update IF i get any success
roger that
roger thatβ€’3mo ago
sure thing bro, I will be trying something similar soon also if I can be of assistance dont hesitate to ask

Did you find this page helpful?