Saving bandwith using PlaywrightCrawler: to block googletagmanager, google-analytics etc...

I already block images as described in [1] and this helps to save some bandwith. Next step: looking at statistics in my proxy service I see a significant number of requests like these:
https://www.googletagmanager.com/gtag/js?id=...
https://connect.facebook.net/en_US/fbevents.js
https://www.google-analytics.com/analytics.js
https://fonts.googleapis.com/css?family=Lato
https://www.googletagmanager.com/gtag/js?id=...
https://connect.facebook.net/en_US/fbevents.js
https://www.google-analytics.com/analytics.js
https://fonts.googleapis.com/css?family=Lato
Can somebody show me an example of code blocking these domains? (better: to block all domains from a given list) I assume it should be something in PlaywrightCrawler.preNavigationHooks, right? Prerequisites: PlaywrightCrawler, Firefox as launcher (Chrome-specific hacks probably would not work) (I'm not good at writing Javascript from scratch, so need some help) [1] https://discord.com/channels/801163717915574323/1060986956961546320
1 Reply
quickest-silver
quickest-silver3y ago
https://crawlee.dev/api/playwright-crawler/namespace/playwrightUtils#blockRequests but it is only available in Chromium. For Firefox only you need to use the Playwright routing which is less optimized since it disables cache and that can backfire https://playwright.dev/docs/api/class-page#page-route
Page | Playwright
* extends: [EventEmitter]
playwrightUtils | API | Crawlee
A namespace that contains various utilities for Playwright - the headless Chrome Node API. Example usage: ```javascript import { launchPlaywright, playwrightUtils } from 'crawlee'; // Navigate to https://www.example.com in Playwright with a POST request const browser = await launchPlaywright(); c...

Did you find this page helpful?