Disable image in playwright

How can I disable downloading images and videos and other media globally for my scraper?
22 Replies
MEE6
MEE6•3y ago
@Casper just advanced to level 8! Thanks for your contributions! 🎉
grumpy-cyan
grumpy-cyan•3y ago
You can create an array of resourceTypes that you'd like to block.
const BLOCKED = ['image', 'stylesheet', 'media', 'font','other'];
const BLOCKED = ['image', 'stylesheet', 'media', 'font','other'];
Then within your preNavigationHooks of your crawler, add this function:
async ({ page }) => {
await page.route('**/*', (route) => {
if (BLOCKED.includes(route.request().resourceType())) return route.abort();
return route.continue()
});
};
async ({ page }) => {
await page.route('**/*', (route) => {
if (BLOCKED.includes(route.request().resourceType())) return route.abort();
return route.continue()
});
};
fair-rose
fair-roseOP•3y ago
Thanks I will try that
grumpy-cyan
grumpy-cyan•3y ago
You can also check out this article https://scrapingant.com/blog/block-requests-playwright
Block resources with Playwright | ScrapingAnt Blog
This article will show you how to intercept and block requests with Playwright using the request interception API. Learn how to block images, CSS and Javascript loading.
fair-rose
fair-roseOP•3y ago
Thanks
fair-rose
fair-roseOP•3y ago
I have this in my main.ts file:
No description
fair-rose
fair-roseOP•3y ago
it does not work yet, can you spot an error?
fair-rose
fair-roseOP•3y ago
I inject it here:
No description
grumpy-cyan
grumpy-cyan•3y ago
Just add the function directly into the crawler
const playwrightCrawler = new PlaywrightCrawler({
proxyConfiguration,
requestHandler: playwrightRouter,
requestQueue: playwrightRequestQueue,
headless: true,
launchContext: {
launcher: firefox,
},
preNavigationHooks: [
async ({ page }) => {
await page.route('**/*', (route) => {
if (BLOCKED_RESOURCES.includes(route.request().resourceType())) {
return route.abort();
}

return route.continue();
});
},
],
autoscaledPoolOptions: {
desiredConcurrency: 6,
},
navigationTimeoutSecs: 45,
requestHandlerTimeoutSecs: PLACE_ID_REQUESTS_CHUNK_SIZE * 15,
maxRequestRetries: 4,
// ! development only
// maxRequestsPerCrawl: 1,
});
const playwrightCrawler = new PlaywrightCrawler({
proxyConfiguration,
requestHandler: playwrightRouter,
requestQueue: playwrightRequestQueue,
headless: true,
launchContext: {
launcher: firefox,
},
preNavigationHooks: [
async ({ page }) => {
await page.route('**/*', (route) => {
if (BLOCKED_RESOURCES.includes(route.request().resourceType())) {
return route.abort();
}

return route.continue();
});
},
],
autoscaledPoolOptions: {
desiredConcurrency: 6,
},
navigationTimeoutSecs: 45,
requestHandlerTimeoutSecs: PLACE_ID_REQUESTS_CHUNK_SIZE * 15,
maxRequestRetries: 4,
// ! development only
// maxRequestsPerCrawl: 1,
});
Here's one of my crawlers using the preNavigationHook
fair-rose
fair-roseOP•3y ago
thanks it works however I dont get why I consume so much bandwidth
fair-rose
fair-roseOP•3y ago
is it possible to see all the requests made for each url eg: https://dk.trustpilot.com/review/www.diba.dk
Trustpilot
Diba Billån er bedømt "Fremragende" med 4,8 / 5 på Trustpilot
Er du enig i TrustScoren for Diba Billån? Del din mening i dag, og find ud af, hvad 665 kunder allerede har sagt.
fair-rose
fair-roseOP•3y ago
so I can inspect and see which requests are unnecessary in playwright or do I need to use chrome dev tools for that
grumpy-cyan
grumpy-cyan•3y ago
The reason is because request interception disables cache in Playwright, so you are downloading everything every single time
grumpy-cyan
grumpy-cyan•3y ago
Apify
Cache responses in Puppeteer · Apify
Why and how to cache responses in memory using Puppeteer
grumpy-cyan
grumpy-cyan•3y ago
It is possible to see them all! Just add this function to your prenavigation hooks:
async ({ page }) => {
page.on('request', (req) => console.log(req))
};
async ({ page }) => {
page.on('request', (req) => console.log(req))
};
grumpy-cyan
grumpy-cyan•3y ago
All of this stuff is covered in our Playwright/Puppeteer course in the academy: https://developers.apify.com/academy/puppeteer-playwright
Apify
Puppeteer & Playwright · Apify Developers
Learn in-depth how to use two of the most popular Node.js libraries for controlling a headless browser - Puppeteer and Playwright.
fair-rose
fair-roseOP•3y ago
thanks. I have this but I can not get access to the url, I pre sume because I need to await it, but I cant use await there:
No description
grumpy-cyan
grumpy-cyan•3y ago
req.url() is a function and does not need to be awaited.
page.on('request', (req) => console.log(req.url()));
page.on('request', (req) => console.log(req.url()));
fair-rose
fair-roseOP•3y ago
thanks. I missed the ()
grumpy-cyan
grumpy-cyan•3y ago
I agree that it should be a getter instead of a function. req.url makes much more sense than req.url().
fair-rose
fair-roseOP•3y ago
yeah but it is a small issue amazing how much bandwidth is saved by cache: 98 requests 1.6 MB without cache 96 requests 54 KB with cache Is there a better option to not download unnecessary files than manually intercepting requests?
grumpy-cyan
grumpy-cyan•3y ago
Nope sadly

Did you find this page helpful?