Crawlee & Apify•5mo ago

Incremental Web scraping using Crawlee

Hey everyone. :perfecto: :crawlee: Currently, I am working on scraping one website where new content (pages) is added frequently (as an example we can say like a blog). So when I run my scraper it scrapes all pages successfully but when I run it for example tomorrow (when new pages are added to websites) it will start scraping everything again. I would be thankful if you could give me some advice, ideas, solutions, or examples out there of efficiently re-scraping without crawling the entire site again. Thank you in advance. 🙏🏻

3 Replies

Hall•5mo ago

Someone will reply to you shortly. In the meantime, this might help:

memo23•5mo ago

@titavilanova2 dm me

azzouzana•5mo ago

You can save your previously scrapped in some file (could be a simple file or a named key value store if you're using crawlee) then on next executions you'd collect all URLs, filter on the new ones and scrape the delta Or may be check if there's some sitemap file

Gaming

Programming

Incremental Web scraping using Crawlee

Did you find this page helpful?