What optimizations work for you?

I'm attempting to use crawlee and puppeteer to crawl between 15 and 30 million urls. I'm not rich but I also can't wait forever for the crawl to finish, so I've spent some time over the last few days hunting for different optimizations that might make my crawler faster. This is more challenging that usual when you're crawling a laundry list of unknown sites. First, here's some of the code I'm working with at this point. To get this running you just: npm install crawlee puppeteer-extra puppeteer-extra-stealth @sparticuz/chromium puppeteer-core I'm then working off the default typescript puppeteer template by selecting it after running this command: npx crawlee create your-project-name And in the next response I'll post some of the code for setting up my crawler with as many optimizations as I've found useful. Running out of allowed characters...
6 Replies
absent-sapphire
absent-sapphireOP3y ago
import { log, Dataset, PuppeteerCrawler, Configuration, utils, RequestQueue } from 'crawlee'; import { router } from './routes.js'; import puppeteerExtra from 'puppeteer-extra'; import stealthPlugin from 'puppeteer-extra-plugin-stealth'; import chromium from '@sparticuz/chromium'; // Get the global configuration const config = Configuration.getGlobalConfig(); config.set('purgeOnStart', false); config.set('maxUsedCpuRatio', 0.9); config.set('availableMemoryRatio', 0.7); config.set('memoryMbytes', 30720); config.set('chromeExecutablePath', await chromium.executablePath()); config.set('defaultBrowserPath', await chromium.executablePath()); log.setOptions({ prefix: allDBKeys[0]}); log.options.logger.setOptions({ skipTime: false, }); puppeteerExtra.use(stealthPlugin()); // Open a named request queue with a persistStateKey option const requestQueue = await RequestQueue.open(allDBKeys[0]); const crawler = new PuppeteerCrawler({ launchContext: { launcher: puppeteerExtra, launchOptions: { executablePath: await chromium.executablePath(); // this resolves to headless: "new", headless: chromium.headless, ignoreHTTPSErrors: true, args: ['--disable-features=InterestFeedContentSuggestions', '--no-default-browser-check', '--mute-audio', '--disable-features=Translate', '--disable-client-side-phishing-detection', '--disable-background-timer-throttling', '--disable-backgrounding-occluded-windows', '--disable-features=CalculateNativeWinOcclusion', '--disable-renderer-backgrounding', '--disable-features=AutoExpandDetailsElement', '--use-fake-device-for-media-stream', '--use-fake-ui-for-media-stream', '--deny-permission-prompts', '--disable-notifications', '--disable-background-networking', '--block-new-web-contents', 'site-per-process'], }, useIncognitoPages: true, }, useSessionPool: true}}, config); //I pull in some data and add it to the queue like this await requestQueue.addRequests([{ url: baseUrl, userData: { "id": company.id, "this_db_key": this_db_key, "page_type": "home", "domain": company.properties.domain_unique} }]); //then add the queue to the crawler crawler.requestQueue = requestQueue; //ugly infinite loop code to deal with the fact that the crawler keeps dying on me while (true) { try { await crawler.run(); } catch { log.error(`Error count: ${errorCount} caused cawler to end`); } }
absent-sapphire
absent-sapphireOP3y ago
Performance: with this setup I'm able to get about 78 finished requests per minute on an 8 vcpu google cloud vm with 32 gigs of ram. More machine specs here: Machine type: e2-standard-8 CPU platform: Intel Broadwell Architecture: x86/64 vCPUs to core ratio: — Custom visible cores: — Display device: Enabled (Enable to use screen capturing and recording tools) GPUs: None The disk is just a Standard persistent disk as well, not an SSD so writes might be a tiny bit slow. And it's using IPv4. Some observations: I think the config statements are properly being applied because it doesn't purge on start for example, but when the logs print the state, it doesn't show the customized cpu ratio or memory ratio: PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":16,"desiredConcurrency":17,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0.556},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}} @sparticuz/chromium: https://www.npmjs.com/package/@sparticuz/chromiumis a fork of https://github.com/alixaxel/chrome-aws-lambda that seems to be a bit faster. The arg flags have made a significant performance difference. '--single-process' has been left out thoug hit allowed my concurrency to go 120+ it seemed to prevented any of the requests from finishing, and resulted in a ton of - the target aka the browser tab - is already closed errors. 'site-per-process' seems to be helping a lot. I can't afford to run '--no-sandbox' for security reasons. And some other args require '--no-sandbox' or root. But the list above makes a big difference for me. @NeoNomade suggested I try incognito tabs and useSessionPool: true.
GitHub
GitHub - alixaxel/chrome-aws-lambda: Chromium Binary for AWS Lambda...
Chromium Binary for AWS Lambda and Google Cloud Functions - GitHub - alixaxel/chrome-aws-lambda: Chromium Binary for AWS Lambda and Google Cloud Functions
sunny-green
sunny-green3y ago
Are you sure that scraping for your targeted sites not viable under JSDOM or Cheerio Scrapers? With browser based scraping you forced to waste lile 90% of RAM and CPU on browser engine, so naturally to really improve performance its better to find no-browser approach which works
ratty-blush
ratty-blush3y ago
Sorry - I’m the op but this is the discord account on my phone: I’m not sure, but that’s kinda the issue is I can’t be sure. I could pull data from third parties on which sites use modern frameworks with client side rendering like react/vue etc. but that would still be imperfect in and of itself. In addition the sites on the list will change often. Even if I knew what sites were using what as of 1 month ago, that’s going to change over time so a browser is really the only safe bet in my mind. Otherwise I would just use common crawl data. Generally I’m only pulling about 3 pages per site. So the surface area of the number of sites is just too hard to know those details about.
sunny-green
sunny-green3y ago
In general getting data from raw html or website API is better strategy: https://crawlee.dev/docs/introduction/real-world-project#drawing-a-plan because of performance, Visual interface usually changed even more often than internal data because of human factor: designer might come up with new layout or restyle UI based on existing backend, so relying on it is not better.
ratty-blush
ratty-blush2y ago
Sure, but I’m attempting to build a search engine off text data. I’ve already pulled the structured data for the engine for hard data I want users to be able to formally filter by. And for that, I used the route you’re suggesting. But this has to be done via a browser. Even google does it this way My log files got massive... like 870 gigs+ for one log.txt file. It's printing out a ridiculous number of integers that are somehow associated with the request_queue, how can I turn this off? I've also established that once my request_queue gets into the 1 million in length territory, the puppeteer crawler consistently crashes quickly

Did you find this page helpful?