Crawlee uses 500GB of storage
My problem is that after tens of thousands of crawls, I've stored 100s of GBs worth of user profile data in the temp directory. How can I prevent crawlee from storing so much data?
-----
Use case
I'm looking to repeatedly crawl the same page with a different user-agent & proxy each time.
This means that I have retireBrowserAfterPageCount set to 1.
I'll make ~100k website requests over a month
19 Replies
Hello @bret_pat I just created a new project from the Crawlee template runned it with
npm run start
waited till it provided 300 dataset results and then, run it again. It removed the previous results and started the scraping again. This is defautl behaviour for unnamed Storages. Is this also your case? What sotarges do you use and how is your Scraper implemented?crude-lavenderOP•2y ago
No, so what you see is just the result of the crawler.
Profiles are something else crawler also makes and it’s ~16MB each
Let me clarify, what you described happens to me as well but it isn’t my complaint
So, how are you saving the profiles?
crude-lavenderOP•2y ago
I don’t. They are saved automatically if you look at the documentation.
And it might be something from puppeteer
Not sure what are you talking about right now.
crude-lavenderOP•2y ago
Sorry, i was on mobile and couldn't respond well
crude-lavenderOP•2y ago
The user data that I was referring to is this folder, this folder is different than the results that you mentioned earlier.
https://crawlee.dev/api/browser-crawler/interface/BrowserLaunchContext#userDataDir
@bret_pat just advanced to level 1! Thanks for your contributions! 🎉
crude-lavenderOP•2y ago
Then the specific issue that I'm having, is identical to this github issue
crude-lavenderOP•2y ago
GitHub
How can i disable puppeteer dev profile. · Issue #1791 · puppeteer/...
Environment: Puppeteer version: 0.11.0 Platform / OS version: AWS Lambda Node.js version: 6.10 Puppeteer Arguments puppeteer.launch({ args: ['--disable-gpu', '--no-sandbox', '--...
crude-lavenderOP•2y ago
TLDR, is that crawlee is creating new user profiles for every browser that I create and never cleans up. Since each profile is a few MB, it grew to 500GB
crude-lavenderOP•2y ago
I eventually got an identical error to this person,
https://github.com/puppeteer/puppeteer/issues/1791#issuecomment-1202133493
because 500GB worth of data was stored in a "temporary" folder
GitHub
How can i disable puppeteer dev profile. · Issue #1791 · puppeteer/...
Environment: Puppeteer version: 0.11.0 Platform / OS version: AWS Lambda Node.js version: 6.10 Puppeteer Arguments puppeteer.launch({ args: ['--disable-gpu', '--no-sandbox', '--...
crude-lavenderOP•2y ago
Although I'm having this problem on my desktop and not on AWS
Let me know if there are any other questions.
I've created a shell script that is deleting that directory every couple of hours.
and put it on a cron
Ah I see, hmm generally the Crawlers are not intended to run in persistent environment for long time. Since this looks more like Puppeteer related issue, not sure if we can do much about it in Crawlee 🤔
crude-lavenderOP•2y ago
I've been getting Pupepeteer crawler errors saying that it's failed to launch the browser process when I set a custom user directory.
Are you setting a custom user dir or just deleted the original temp dir?
The original
crude-lavenderOP•2y ago
What code are you using to delete? I'm using fs-extra and it's crashing because some temp files are in use
Shell script with bash