Crawlee & Apify•2y ago

Crawlee uses 500GB of storage

My problem is that after tens of thousands of crawls, I've stored 100s of GBs worth of user profile data in the temp directory. How can I prevent crawlee from storing so much data? ----- Use case I'm looking to repeatedly crawl the same page with a different user-agent & proxy each time. This means that I have retireBrowserAfterPageCount set to 1. I'll make ~100k website requests over a month

19 Replies

Pepa J•2y ago

Hello @bret_pat I just created a new project from the Crawlee template runned it with npm run start waited till it provided 300 dataset results and then, run it again. It removed the previous results and started the scraping again. This is defautl behaviour for unnamed Storages. Is this also your case? What sotarges do you use and how is your Scraper implemented?

crude-lavenderOP•2y ago

No, so what you see is just the result of the crawler. Profiles are something else crawler also makes and it’s ~16MB each Let me clarify, what you described happens to me as well but it isn’t my complaint

Pepa J•2y ago

So, how are you saving the profiles?

crude-lavenderOP•2y ago

I don’t. They are saved automatically if you look at the documentation. And it might be something from puppeteer

Pepa J•2y ago

Not sure what are you talking about right now.

crude-lavenderOP•2y ago

Sorry, i was on mobile and couldn't respond well

crude-lavenderOP•2y ago

The user data that I was referring to is this folder, this folder is different than the results that you mentioned earlier. https://crawlee.dev/api/browser-crawler/interface/BrowserLaunchContext#userDataDir

BrowserLaunchContext | API | Crawlee

MEE6•2y ago

@bret_pat just advanced to level 1! Thanks for your contributions! 🎉

crude-lavenderOP•2y ago

Then the specific issue that I'm having, is identical to this github issue

crude-lavenderOP•2y ago

https://github.com/puppeteer/puppeteer/issues/1791

GitHub

How can i disable puppeteer dev profile. · Issue #1791 · puppeteer/...

Environment: Puppeteer version: 0.11.0 Platform / OS version: AWS Lambda Node.js version: 6.10 Puppeteer Arguments puppeteer.launch({ args: ['--disable-gpu', '--no-sandbox', '--...

crude-lavenderOP•2y ago

TLDR, is that crawlee is creating new user profiles for every browser that I create and never cleans up. Since each profile is a few MB, it grew to 500GB

crude-lavenderOP•2y ago

I eventually got an identical error to this person, https://github.com/puppeteer/puppeteer/issues/1791#issuecomment-1202133493 because 500GB worth of data was stored in a "temporary" folder

GitHub

How can i disable puppeteer dev profile. · Issue #1791 · puppeteer/...

Environment: Puppeteer version: 0.11.0 Platform / OS version: AWS Lambda Node.js version: 6.10 Puppeteer Arguments puppeteer.launch({ args: ['--disable-gpu', '--no-sandbox', '--...

crude-lavenderOP•2y ago

Although I'm having this problem on my desktop and not on AWS Let me know if there are any other questions.

NeoNomade•2y ago

I've created a shell script that is deleting that directory every couple of hours. and put it on a cron

Pepa J•2y ago

Ah I see, hmm generally the Crawlers are not intended to run in persistent environment for long time. Since this looks more like Puppeteer related issue, not sure if we can do much about it in Crawlee 🤔

crude-lavenderOP•2y ago

I've been getting Pupepeteer crawler errors saying that it's failed to launch the browser process when I set a custom user directory. Are you setting a custom user dir or just deleted the original temp dir?

NeoNomade•2y ago

The original

crude-lavenderOP•2y ago

What code are you using to delete? I'm using fs-extra and it's crashing because some temp files are in use

NeoNomade•2y ago

Shell script with bash

Gaming

Programming

Crawlee uses 500GB of storage

Did you find this page helpful?