Caching requests for development and testing
Hi,
I'm wondering what people are doing (if anything) to record and replay requests while building scrapers. A lot of building scrapers is trial and error, making sure you have the right selectors, json paths, etc, so I end up running my code a fair few times. I'd ideally cache the initial request to each endpoint and replay it when it's requested again, just for development, so I'm not continually hitting the website (both for politeness, and also to reduce the chances of triggering any antibot provisions).
Thinking back to my ruby days there was a package called VCR which would do this if you instantiated it before HTTP requests, with ways to invalidate the cache. In JS there's netflix's polly which I'm going to try out shortly, but I'm interested to hear what other people are doing/using, if anything.
I'm using a mix of crawlers (Basic, Cheerio, Playwright), so looking for something flexible.
Cheers!
4 Replies
@je just advanced to level 1! Thanks for your contributions! 🎉
Someone will reply to you shortly. In the meantime, this might help:
Definitely a good point.
I typically use a named key value store and before running the request or adding to the queue, I check whether its response is already stored, if not I'd let it run and then store the result to the Key value store with a URL hash as a key and the response as a value
wise-whiteOP•4w ago
@azzouzana interesting. I'm looking to do this anyway as I want all requests stored so they can be audited if a scrape fails