Crawlee & Apify•4w ago

Caching requests for development and testing

Hi, I'm wondering what people are doing (if anything) to record and replay requests while building scrapers. A lot of building scrapers is trial and error, making sure you have the right selectors, json paths, etc, so I end up running my code a fair few times. I'd ideally cache the initial request to each endpoint and replay it when it's requested again, just for development, so I'm not continually hitting the website (both for politeness, and also to reduce the chances of triggering any antibot provisions). Thinking back to my ruby days there was a package called VCR which would do this if you instantiated it before HTTP requests, with ways to invalidate the cache. In JS there's netflix's polly which I'm going to try out shortly, but I'm interested to hear what other people are doing/using, if anything. I'm using a mix of crawlers (Basic, Cheerio, Playwright), so looking for something flexible. Cheers!

4 Replies

MEE6•4w ago

@je just advanced to level 1! Thanks for your contributions! 🎉

Hall•4w ago

Someone will reply to you shortly. In the meantime, this might help:

azzouzana•4w ago

Definitely a good point. I typically use a named key value store and before running the request or adding to the queue, I check whether its response is already stored, if not I'd let it run and then store the result to the Key value store with a URL hash as a key and the response as a value

wise-whiteOP•4w ago

@azzouzana interesting. I'm looking to do this anyway as I want all requests stored so they can be audited if a scrape fails

Gaming

Programming

Caching requests for development and testing

Did you find this page helpful?