Ignore previously crawled URLs
Is there a simple way to go about ignoring previously crawled URLs? Or should I implement logic for detecting if a URL has been previously crawled and skip it? My current approach is to store the items in a separate database, and then use https://crawlee.dev/docs/introduction/adding-urls#transform-requests transform requests to determine whether or not to crawl the link
Adding more URLs | Crawlee
Your first steps into the world of scraping with Crawlee
2 Replies
Hi @Eitus depends on your use-case. You may use single named queue that all of the runs gonna use - that means that urls that are already in the queue would not be enqueued again. 🙂
https://docs.apify.com/platform/storage/request-queue
Request queue | Platform | Apify Documentation
Queue URLs for an actor to visit in its run. Learn how to share your queues between actor runs. Access and manage request queues from Apify Console or via API.
flat-fuchsiaOP•2y ago
Awesome, thank you!