How to avoid requesting some static resources?

When crawling with Playwright or Puppeteer, a lot of static assets (eg js, css, png, jpg) are loaded. Is it possible to only request static resources for the first time, and use the last cached data for the next crawling without making a request.
4 Replies
MEE6
MEE63y ago
@Dillian just advanced to level 1! Thanks for your contributions! 🎉
harsh-harlequin
harsh-harlequin3y ago
You can create an array of resourceTypes that you'd like to block. Example for Playwright: const BLOCKED = ['image', 'stylesheet', 'media', 'font','other']; Then within your preNavigationHooks of your crawler, add this function:
async ({ page }) => {
await page.route('**/*', (route) => {
if (BLOCKED.includes(route.request().resourceType())) return route.abort();
return route.continue()
});
};
async ({ page }) => {
await page.route('**/*', (route) => {
if (BLOCKED.includes(route.request().resourceType())) return route.abort();
return route.continue()
});
};
Or you can try to use Crawlee util functions (also in preNavigationHooks option): https://crawlee.dev/api/3.0/playwright-crawler/namespace/playwrightUtils#blockRequests https://crawlee.dev/api/3.0/puppeteer-crawler/namespace/puppeteerUtils#blockRequests
ratty-blush
ratty-blushOP3y ago
@Oleg V. thanks for help ❤️。Is it possible to only request static resources for the first time, and use the last cached data for the next response without making a request?
absent-sapphire
absent-sapphire3y ago
I think that is what most browsers do by default (including if run in Crawlee). If not for those specific cases, you can maintain your own cache object and intercept requests to feed them cached response

Did you find this page helpful?