Incremental Web scraping using Crawlee

Hey everyone. :perfecto: :crawlee: Currently, I am working on scraping one website where new content (pages) is added frequently (as an example we can say like a blog). So when I run my scraper it scrapes all pages successfully but when I run it for example tomorrow (when new pages are added to websites) it will start scraping everything again. I would be thankful if you could give me some advice, ideas, solutions, or examples out there of efficiently re-scraping without crawling the entire site again. Thank you in advance. 🙏🏻
3 Replies
Hall
Hall5mo ago
Someone will reply to you shortly. In the meantime, this might help:
memo23
memo235mo ago
@titavilanova2 dm me
azzouzana
azzouzana5mo ago
You can save your previously scrapped in some file (could be a simple file or a named key value store if you're using crawlee) then on next executions you'd collect all URLs, filter on the new ones and scrape the delta Or may be check if there's some sitemap file

Did you find this page helpful?