Disable image in playwright

How can I disable downloading images and videos and other media globally for my scraper?

22 Replies

MEE6•3y ago

@Casper just advanced to level 8! Thanks for your contributions! 🎉

grumpy-cyan•3y ago

You can create an array of resourceTypes that you'd like to block.

const BLOCKED = ['image', 'stylesheet', 'media', 'font','other'];

const BLOCKED = ['image', 'stylesheet', 'media', 'font','other'];

Then within your preNavigationHooks of your crawler, add this function:

async ({ page }) => {
    await page.route('**/*', (route) => {
        if (BLOCKED.includes(route.request().resourceType())) return route.abort();
        return route.continue()
    });
};

async ({ page }) => {
    await page.route('**/*', (route) => {
        if (BLOCKED.includes(route.request().resourceType())) return route.abort();
        return route.continue()
    });
};

fair-roseOP•3y ago

Thanks I will try that

grumpy-cyan•3y ago

You can also check out this article https://scrapingant.com/blog/block-requests-playwright

Block resources with Playwright | ScrapingAnt Blog

This article will show you how to intercept and block requests with Playwright using the request interception API. Learn how to block images, CSS and Javascript loading.

fair-roseOP•3y ago

Thanks

fair-roseOP•3y ago

I have this in my main.ts file:

fair-roseOP•3y ago

it does not work yet, can you spot an error?

fair-roseOP•3y ago

I inject it here:

grumpy-cyan•3y ago

Just add the function directly into the crawler

const playwrightCrawler = new PlaywrightCrawler({
    proxyConfiguration,
    requestHandler: playwrightRouter,
    requestQueue: playwrightRequestQueue,
    headless: true,
    launchContext: {
        launcher: firefox,
    },
    preNavigationHooks: [
        async ({ page }) => {
            await page.route('**/*', (route) => {
                if (BLOCKED_RESOURCES.includes(route.request().resourceType())) {
                    return route.abort();
                }

                return route.continue();
            });
        },
    ],
    autoscaledPoolOptions: {
        desiredConcurrency: 6,
    },
    navigationTimeoutSecs: 45,
    requestHandlerTimeoutSecs: PLACE_ID_REQUESTS_CHUNK_SIZE * 15,
    maxRequestRetries: 4,
    // ! development only
    // maxRequestsPerCrawl: 1,
});

const playwrightCrawler = new PlaywrightCrawler({
    proxyConfiguration,
    requestHandler: playwrightRouter,
    requestQueue: playwrightRequestQueue,
    headless: true,
    launchContext: {
        launcher: firefox,
    },
    preNavigationHooks: [
        async ({ page }) => {
            await page.route('**/*', (route) => {
                if (BLOCKED_RESOURCES.includes(route.request().resourceType())) {
                    return route.abort();
                }

                return route.continue();
            });
        },
    ],
    autoscaledPoolOptions: {
        desiredConcurrency: 6,
    },
    navigationTimeoutSecs: 45,
    requestHandlerTimeoutSecs: PLACE_ID_REQUESTS_CHUNK_SIZE * 15,
    maxRequestRetries: 4,
    // ! development only
    // maxRequestsPerCrawl: 1,
});

Here's one of my crawlers using the preNavigationHook

fair-roseOP•3y ago

thanks it works however I dont get why I consume so much bandwidth

fair-roseOP•3y ago

is it possible to see all the requests made for each url eg: https://dk.trustpilot.com/review/www.diba.dk

Trustpilot

Diba Billån er bedømt "Fremragende" med 4,8 / 5 på Trustpilot

Er du enig i TrustScoren for Diba Billån? Del din mening i dag, og find ud af, hvad 665 kunder allerede har sagt.

fair-roseOP•3y ago

so I can inspect and see which requests are unnecessary in playwright or do I need to use chrome dev tools for that

grumpy-cyan•3y ago

The reason is because request interception disables cache in Playwright, so you are downloading everything every single time

grumpy-cyan•3y ago

Check this out: https://help.apify.com/en/articles/2424032-cache-responses-in-puppeteer

Apify

Cache responses in Puppeteer · Apify

Why and how to cache responses in memory using Puppeteer

grumpy-cyan•3y ago

It is possible to see them all! Just add this function to your prenavigation hooks:

async ({ page }) => {
    page.on('request', (req) => console.log(req))
};

async ({ page }) => {
    page.on('request', (req) => console.log(req))
};

grumpy-cyan•3y ago

All of this stuff is covered in our Playwright/Puppeteer course in the academy: https://developers.apify.com/academy/puppeteer-playwright

Apify

Puppeteer & Playwright · Apify Developers

Learn in-depth how to use two of the most popular Node.js libraries for controlling a headless browser - Puppeteer and Playwright.

fair-roseOP•3y ago

thanks. I have this but I can not get access to the url, I pre sume because I need to await it, but I cant use await there:

grumpy-cyan•3y ago

req.url() is a function and does not need to be awaited.

page.on('request', (req) => console.log(req.url()));

page.on('request', (req) => console.log(req.url()));

fair-roseOP•3y ago

thanks. I missed the ()

grumpy-cyan•3y ago

I agree that it should be a getter instead of a function. req.url makes much more sense than req.url().

fair-roseOP•3y ago

yeah but it is a small issue amazing how much bandwidth is saved by cache: 98 requests 1.6 MB without cache 96 requests 54 KB with cache Is there a better option to not download unnecessary files than manually intercepting requests?

grumpy-cyan•3y ago

Nope sadly

Gaming

Programming

Disable image in playwright

Did you find this page helpful?