Saving bandwith using PlaywrightCrawler: to block googletagmanager, google-analytics etc...
I already block images as described in [1] and this helps to save some bandwith.
Next step: looking at statistics in my proxy service I see a significant number of requests like these:
Can somebody show me an example of code blocking these domains? (better: to block all domains from a given list)
I assume it should be something in PlaywrightCrawler.preNavigationHooks, right?
Prerequisites: PlaywrightCrawler, Firefox as launcher (Chrome-specific hacks probably would not work)
(I'm not good at writing Javascript from scratch, so need some help)
[1] https://discord.com/channels/801163717915574323/1060986956961546320
1 Reply
quickest-silver•3y ago
https://crawlee.dev/api/playwright-crawler/namespace/playwrightUtils#blockRequests but it is only available in Chromium.
For Firefox only you need to use the Playwright routing which is less optimized since it disables cache and that can backfire
https://playwright.dev/docs/api/class-page#page-route
Page | Playwright
* extends: [EventEmitter]
playwrightUtils | API | Crawlee
A namespace that contains various utilities for
Playwright - the headless Chrome Node API.
Example usage:
```javascript
import { launchPlaywright, playwrightUtils } from 'crawlee';
// Navigate to https://www.example.com in Playwright with a POST request
const browser = await launchPlaywright();
c...