Crawlee uses 500GB of storage

My problem is that after tens of thousands of crawls, I've stored 100s of GBs worth of user profile data in the temp directory. How can I prevent crawlee from storing so much data? ----- Use case I'm looking to repeatedly crawl the same page with a different user-agent & proxy each time. This means that I have retireBrowserAfterPageCount set to 1. I'll make ~100k website requests over a month
19 Replies
Pepa J
Pepa J2y ago
Hello @bret_pat I just created a new project from the Crawlee template runned it with npm run start waited till it provided 300 dataset results and then, run it again. It removed the previous results and started the scraping again. This is defautl behaviour for unnamed Storages. Is this also your case? What sotarges do you use and how is your Scraper implemented?
crude-lavender
crude-lavenderOP2y ago
No, so what you see is just the result of the crawler. Profiles are something else crawler also makes and it’s ~16MB each Let me clarify, what you described happens to me as well but it isn’t my complaint
Pepa J
Pepa J2y ago
So, how are you saving the profiles?
crude-lavender
crude-lavenderOP2y ago
I don’t. They are saved automatically if you look at the documentation. And it might be something from puppeteer
Pepa J
Pepa J2y ago
Not sure what are you talking about right now.
crude-lavender
crude-lavenderOP2y ago
Sorry, i was on mobile and couldn't respond well
crude-lavender
crude-lavenderOP2y ago
The user data that I was referring to is this folder, this folder is different than the results that you mentioned earlier. https://crawlee.dev/api/browser-crawler/interface/BrowserLaunchContext#userDataDir
MEE6
MEE62y ago
@bret_pat just advanced to level 1! Thanks for your contributions! 🎉
crude-lavender
crude-lavenderOP2y ago
Then the specific issue that I'm having, is identical to this github issue
crude-lavender
crude-lavenderOP2y ago
GitHub
How can i disable puppeteer dev profile. · Issue #1791 · puppeteer/...
Environment: Puppeteer version: 0.11.0 Platform / OS version: AWS Lambda Node.js version: 6.10 Puppeteer Arguments puppeteer.launch({ args: ['--disable-gpu', '--no-sandbox', '--...
crude-lavender
crude-lavenderOP2y ago
TLDR, is that crawlee is creating new user profiles for every browser that I create and never cleans up. Since each profile is a few MB, it grew to 500GB
crude-lavender
crude-lavenderOP2y ago
I eventually got an identical error to this person, https://github.com/puppeteer/puppeteer/issues/1791#issuecomment-1202133493 because 500GB worth of data was stored in a "temporary" folder
GitHub
How can i disable puppeteer dev profile. · Issue #1791 · puppeteer/...
Environment: Puppeteer version: 0.11.0 Platform / OS version: AWS Lambda Node.js version: 6.10 Puppeteer Arguments puppeteer.launch({ args: ['--disable-gpu', '--no-sandbox', '--...
crude-lavender
crude-lavenderOP2y ago
Although I'm having this problem on my desktop and not on AWS Let me know if there are any other questions.
NeoNomade
NeoNomade2y ago
I've created a shell script that is deleting that directory every couple of hours. and put it on a cron
Pepa J
Pepa J2y ago
Ah I see, hmm generally the Crawlers are not intended to run in persistent environment for long time. Since this looks more like Puppeteer related issue, not sure if we can do much about it in Crawlee 🤔
crude-lavender
crude-lavenderOP2y ago
I've been getting Pupepeteer crawler errors saying that it's failed to launch the browser process when I set a custom user directory. Are you setting a custom user dir or just deleted the original temp dir?
NeoNomade
NeoNomade2y ago
The original
crude-lavender
crude-lavenderOP2y ago
What code are you using to delete? I'm using fs-extra and it's crashing because some temp files are in use
NeoNomade
NeoNomade2y ago
Shell script with bash

Did you find this page helpful?