Topics

Theo's Typesafe Cult•3mo ago

how does LLM applications scale?

Sorry for sounding like a layman; The thing is, providers like OpenAI have something known as TPM (tokens per minute) limits. So, how do people build applications around it when multiple users are using it simultaneously? Let's say I'm creating a feature like Deep Research, and due to the TPM limit, I can only run one research process at a time. If I exceed the limit, it will throw a TPM error again; hence making it not so possible for me to even test a pilot for my application. Thanks!

34 Replies

peculiarnewbie•3mo ago

they use the api https://platform.openai.com/docs/overview

OpenAI Platform

Explore developer resources, tutorials, API docs, and dynamic examples to get the most out of OpenAI's platform.

GuillermoB•3mo ago

Most providers have a tier scaling, where TPM increase in higher tiers e.g. Openai tier 5 has 40,000,000 TPM for gpt-5 models via api. 40M tokens per minutes is a lot, in my experience, even with user concurrency you'll not reach easily the TPM limits Which TPM limit do you have, and which is the concurrency you are dealing with?

roger that•3mo ago

recently i've created this architecture which is fully scalable, take a look if it suits you

No description

ArshOP•3mo ago

yeah, So now I'll just be using multiple LLM providers instead of using one and that'll be my monkey patch. ~ my current openai tier --> 2 😆 thanks for this; quite (very very) overwhelming. but definitely saved it

roger that•3mo ago

I need a job

ArshOP•3mo ago

No description

No description

ArshOP•3mo ago

here's some rough things we came up with lol this thing will definitely not scale even for 10 concurrent users 😭

roger that•3mo ago

yeah so you are doing a basic rag and feeding into a mcp server with agents

ArshOP•3mo ago

aahh kinda you can say that now thing is we're running out of TPM limits of llm providers if we try to do like 2-3 evalutions together and that too when agents work one-by-one

roger that•3mo ago

i think your lacking a mcp server but I could be wrong tho, I'm still learning this

ArshOP•3mo ago

here's how our current system works: cloudflare pages (nextjs frontend) --> ec2 (hono backend) --> ec2 (fastapi server ~ does the RAG work) High chances that could be the case; even am fairly new into this

roger that•3mo ago

I see, so this is horrible first thing you need to change the fast api server to something like go or rust and also use queues or change the backend to GO and then connect to fast api with a queue if you link services together without a queue it will never scale unless you get a gigantic computer

ArshOP•3mo ago

for agent orchestration we're using this: https://github.com/crewAIInc/crewAI I doubt it's a python only lib

GitHub

GitHub - crewAIInc/crewAI: Framework for orchestrating role-playing...

Framework for orchestrating role-playing, autonomous AI agents. By fostering collaborative intelligence, CrewAI empowers agents to work together seamlessly, tackling complex tasks. - crewAIInc/crewAI

roger that•3mo ago

bro u need logs and a smoke test try adding logs to every piece of code you got and then use performance 4 all to test and you will see the pain points of your current system and then you solve one by one and retest untill is flawless oh acualy p4all is proprietary software so Im looking for an alternative for you what do you think

ArshOP•3mo ago

hmm, already working on that part. but these LLM agents make my life hell; sometimes they work flawless and the next moment they mess up tool calls

roger that•3mo ago

I see so you mean they mess up the arguments and you get a type error in a call?

ArshOP•3mo ago

sometimes when I shift to a mini model it ignores previous instructions. like with the semantic search tool; it gives paragraphs of text as input param, ignoring all the restrictions on input. then on retry it realizes 'ohh I gave the tool incorrect input lets fix that'

roger that•3mo ago

ok so u should try top d top k to get the distance between an ideal vectorized output and your actual output and do a retry if the distance of greater than some value thats called rerank or reretry basically its a hallucination problem right

ArshOP•3mo ago

wasn't reranking done in your vectors by adding that KNN algo?

roger that•3mo ago

yeah I guess but there are many ways you can do it

ArshOP•3mo ago

you're right; need to make the models more deterministic; and somewhat need to play around with the top k,p values

roger that•3mo ago

yeah that sounds fun xD I mean you can do whatever you want, even using something like Zod to make the type checks as long as it works

ArshOP•3mo ago

tbh that was one reason I didn't used langchain/graph this time. I remember type checking deepseeks' json outputs; and it was truely a nightmare

roger that•3mo ago

all of this is new to everyone so, the bad news is that actually your are on your own and the good news is that no one knows what we are doing so you basically have to be creative and try different approachs until you're satisfied I'd like to help but I dont have a paid model yet so I'm trying to get a job so I can afford that but I'm horrible at job interviews

ArshOP•3mo ago

some of the best lines; I've heard on internet today. 😆

roger that•3mo ago

one thing I know, if you have specialized agents, you must have MCP

ArshOP•3mo ago

you seem like a fun dude; wish I could offer you a job if I weren’t a bootstrapped college guy 😂 and that too from a country with the lowest wages lol

roger that•3mo ago

hue br

ArshOP•3mo ago

gotta look into this; the thing is we're doing other work than just agents in the fast api server too. like ingesting the github repo, chunking & embedding it into a vectorstore No idea; if all of that can be wrapped inside mcp server

roger that•3mo ago

no, the mcp server is to share context between agents this way you save memory and this will help with hallucination I believe you need to separate your backend into microservices and have an event source to call them this way you can have multiple instances that will also help basically the sketch I've sent you earlier is about that, how to scale micro services and then the hallucination part is something else

ArshOP•3mo ago

okay now I get what you were saying and your sketch starts to make sense this shared context thing will also bring down the overall token consumptions drastically? isn't it

roger that•3mo ago

yeah that's the purpose saving more space to the agent ~think~ and that will also help with hallucination

ArshOP•3mo ago

roger that, will update IF i get any success

roger that•3mo ago

sure thing bro, I will be trying something similar soon also if I can be of assistance dont hesitate to ask

Theo's Typesafe Cult Join

26KMembers

View on Discord

Did you find this page helpful?