Best practices to not crawl links that are already crawled when Actor is run as CRON

Hi, I'm building an actor that goes through a list and then goes to each individual item's page to extract information. The items themselves don't really change. New items can appear in the list and old ones can get removed. But if item details were extracted once, there's no need to repeatedly extract them on next Actor runs. E.g. Actor is run twice a day. I'm planning to use Postgresql and Prisma to store extracted items details. Wondering, if it is a fine decision to access the target database while doing crawls within Actor (e.g. to check if URL was already scraped previously)? Or is there some better solution, possibly with built-in tools of Apify? Thanks
2 Replies
dependent-tan
dependent-tanOP2y ago
For now I resolved this by storing URL slug in named key-value store. And then checking if it's already in there. Not sure if that's the correct approach. Any thoughts?
like-gold
like-gold2y ago
I guess you could use named request queue for detail pages. It is automatically ignoring the urls that were already handled. Checking the DB during the run is also fine but slower.

Did you find this page helpful?