Hello everyone, Is there a way to avoid
Hello everyone, Is there a way to avoid scraping same pages even if the crawler is restared ??? because I'm currently working on a news website crawler, However, with each run of the scraper, I'm encountering up to 80% duplicated news from previous runs. Any suggestions on how to address this issue effectively?
3 Replies
correct-apricot•12mo ago
If running on local you can set enviroment variable CRAWLEE_PURGE_ON_START to false and then the crawler will use the same request queue all over again.
https://crawlee.dev/api/3.8/core/interface/ConfigurationOptions#purgeOnStart
flat-fuchsia•12mo ago
Thanks
correct-apricot•12mo ago
If running in Apify, try naming your requests queue.