Memory (RAM) issues
I have a workflow where one step is to enrich every page on a website (sometimes 600 or more) with wider business context information, then after that, a
forEach step that handles each page of the website and examines the HTML / attributes.
With a very large site, I got this error in the Mastra logs (attached).
I have a few questions:
1) Is there any way around this (apart from "buy a server with higher spec")?
I'm wondering especially from the perspective of workflow architecture/orchestration.
2) If I do plan to load in the entire HTML of a given web page (along with business context) and then process it, are there any best practices that I might not be doing?
The Workflow is basically:
(MAX_CONCURRENCY is 1 and it still does this)
The key part from the attached file:
11 Replies
📝 Created GitHub issue: https://github.com/mastra-ai/mastra/issues/7559
GitHub
[DISCORD:1413859688893775892] Memory (RAM) issues · Issue #7559 ·...
This issue was created from Discord post: https://discord.com/channels/1309558646228779139/1413859688893775892 I have a workflow where one step is to enrich every page on a website (sometimes 600 o...
At this moment there might not be any solution, we are adding a event system to our workflows so we can run each step on different hardware. This is in the works but currenlty we are still in the < 1.0 range so we haven't really looked into memory and performance that much.
Understood, okay – Do you have any practical tips (aside from vertical and horizontal hardware scaling) that might mitigate the issue eg, "Use X or Y construct"?
I'm already avoiding some of these issues by batching and using
foreach but I don't know how it can always be avoided.
Any advice would be hugely appreciated.Try not to put any large objects inside the step outputs/inputs
Okay!
I am curious:
In a
foreach, does every iteration accumulate memory usage?
ie, Even with minimal object size (etc) if the collection is large enough, is there the potential for this error (in my original post) ? Or in theory is it avoidable?@joneslloyd maybe you could try using the filesystem, load files when you need them, unload when done and then just pass the URIs around.
Thanks for getting back to me. I have a couple of thoughts:
1) It's a bit tricky because my Mastra workflows intentionally don't do CRUD (they only emit events) so dynamically loading in files from the FS might not be viable.
2) Even if they did do CRUD and could read from the FS (or an S3 bucket) would this not have the same effect as the
foreach memory issue, because as the loop progressed toward the total number of items, each large web page would be in memory (the same as my current solution)?I think what you need to do is keep the large items out of the input/outputs of your workflow steps, and only keep the results of whatever processing you're doing. You'll also save on database storage since workflow snapshots end up in the database.
What about cases where the large item (full HTML of a given web page) are necessary for the workflow step?
(And in this specific instance, it needs to be done for a large percentage of a website's pages).
I'm guessing at some point there'll be better ways to handle large datasets in workflows, but for now you'll need to have a beefy server that has enough RAM to store all that data.
There is no other solution at this stage?