Crawlee & Apify•2y ago

Best practices to not crawl links that are already crawled when Actor is run as CRON

Hi, I'm building an actor that goes through a list and then goes to each individual item's page to extract information. The items themselves don't really change. New items can appear in the list and old ones can get removed. But if item details were extracted once, there's no need to repeatedly extract them on next Actor runs. E.g. Actor is run twice a day. I'm planning to use Postgresql and Prisma to store extracted items details. Wondering, if it is a fine decision to access the target database while doing crawls within Actor (e.g. to check if URL was already scraped previously)? Or is there some better solution, possibly with built-in tools of Apify? Thanks

2 Replies

dependent-tanOP•2y ago

For now I resolved this by storing URL slug in named key-value store. And then checking if it's already in there. Not sure if that's the correct approach. Any thoughts?

like-gold•2y ago

I guess you could use named request queue for detail pages. It is automatically ignoring the urls that were already handled. Checking the DB during the run is also fine but slower.

Gaming

Programming

Best practices to not crawl links that are already crawled when Actor is run as CRON

Did you find this page helpful?