Share cache between multiple crawlee instances
I am using Crawlee with Chromium Playwright to scrape information about products from various retailers. For some of the information I need to extract, I have to run a headless browser to be able to interact with the page.
I noticed that for one of my targets I have a lot of network transfers happening for scripts (js, json, css) that are the same for all the products. So if I scrape a long list of products these resources are getting cached and their impact on the overall transferred data size is not big. On the other hand if for every session I scrape only a few pages at the target, all this script resources need to be loaded because the cache is initially empty for every playwright session / context.
Does anyone have an idea about how I could reuse the same cache in playwright / crawlee between 2 or multiple runs of my script?
2 Replies
absent-sapphire•3y ago
I don't have the exact answer but 2 ideas:
1. Implement a cache yourself by storing cashed resources to in-memory object and then serving them to the browser with page.on('request'). This is probably not great solution.
2. Point out the browser launcher to the cached instance. Basically, you would have just one window. It stores data to user data dir but I'm not sure if pointing there will reuse the cache. Some googling will probably do.
frail-apricot•3y ago
I am also looking into caching everything in my runs during development to speed up debugging.
This is a good start but I'll need to implement file storage.
https://docs.apify.com/academy/node-js/caching-responses-in-puppeteer
I'm mostly interested in caching the html/js/api requests so I can replay the exact same run when debugging.
How to optimize Puppeteer by caching responses | Apify Documentation
Learn why it is important for performance to cache responses in memory when intercepting requests in Puppeteer and how to implement it in your code.