How to avoid requesting some static resources?
When crawling with
Playwright
or Puppeteer
, a lot of static assets (eg js, css, png, jpg) are loaded.
Is it possible to only request static resources for the first time, and use the last cached data for the next crawling without making a request.4 Replies
@Dillian just advanced to level 1! Thanks for your contributions! 🎉
harsh-harlequin•3y ago
You can create an array of resourceTypes that you'd like to block.
Example for Playwright:
const BLOCKED = ['image', 'stylesheet', 'media', 'font','other'];
Then within your preNavigationHooks
of your crawler, add this function:
Or you can try to use Crawlee util functions (also in preNavigationHooks
option):
https://crawlee.dev/api/3.0/playwright-crawler/namespace/playwrightUtils#blockRequests
https://crawlee.dev/api/3.0/puppeteer-crawler/namespace/puppeteerUtils#blockRequestsratty-blushOP•3y ago
@Oleg V. thanks for help ❤️。Is it possible to only request static resources for the first time, and use the last cached data for the next response without making a request?
absent-sapphire•3y ago
I think that is what most browsers do by default (including if run in Crawlee). If not for those specific cases, you can maintain your own cache object and intercept requests to feed them cached response