Crawlee & Apify•2y ago

Ignore previously crawled URLs

Is there a simple way to go about ignoring previously crawled URLs? Or should I implement logic for detecting if a URL has been previously crawled and skip it? My current approach is to store the items in a separate database, and then use https://crawlee.dev/docs/introduction/adding-urls#transform-requests transform requests to determine whether or not to crawl the link

Adding more URLs | Crawlee

Your first steps into the world of scraping with Crawlee

2 Replies

Pepa J•2y ago

Hi @Eitus depends on your use-case. You may use single named queue that all of the runs gonna use - that means that urls that are already in the queue would not be enqueued again. 🙂 https://docs.apify.com/platform/storage/request-queue

Request queue | Platform | Apify Documentation

Queue URLs for an actor to visit in its run. Learn how to share your queues between actor runs. Access and manage request queues from Apify Console or via API.

flat-fuchsiaOP•2y ago

Awesome, thank you!

Gaming

Programming

Ignore previously crawled URLs

Did you find this page helpful?