Crawlee scrapper invoking the same handler multiple times

Hey all! I've built a Crawlee scrapper, but for some reason it invokes the same handler multiple times, creating a lot of duplicate requests and entries in my dataset. Also: - I've already tried manually setting uniqueKeys for all my requests. - I've also tried setting maxConcurrency: 1 for the crawler. - As you can see from the logs below, the issue is not that I'm adding the same requests multiple times. It's Crawlee who's invoking handlers multiple times with the same request. Has anyone experienced the same issue? Any clue about what could be happening here? I've posted the question and all the details (code and logs) on StackOverflow: https://stackoverflow.com/questions/77358550/crawlee-scrapper-visiting-the-same-url-multiple-times
Stack Overflow
Crawlee scrapper invoking the same handler multiple times
I've built a Crawlee scrapper, but for some reason it invokes the same handler multiple times, creating a lot of duplicate requests and entries in my dataset. Also: I've already tried manually set...
7 Replies
like-gold
like-gold•2y ago
You can try to log uniqueKey of each request when being processed. That way we can be sure if it is bug in the crawlee or in your code.
narrow-beige
narrow-beigeOP•2y ago
I already did. In main.ts I have:
const originalAddRequestsFn = crawler.addRequests.bind(crawler);

crawler.addRequests = function(requests: Source[], options: CrawlerAddRequestsOptions) {
if (requests.length > 1) {
log.info(`INITIAL REQUESTS = ${ requests.length }`);
} else {
log.info(`${ requests[0].label } | ${ requests[0].uniqueKey || '-' } = ${ requests[0].url }`);
}

return originalAddRequestsFn(requests, options);
}
const originalAddRequestsFn = crawler.addRequests.bind(crawler);

crawler.addRequests = function(requests: Source[], options: CrawlerAddRequestsOptions) {
if (requests.length > 1) {
log.info(`INITIAL REQUESTS = ${ requests.length }`);
} else {
log.info(`${ requests[0].label } | ${ requests[0].uniqueKey || '-' } = ${ requests[0].url }`);
}

return originalAddRequestsFn(requests, options);
}
Is this what you mean? Or is there a better way to log them?
like-gold
like-gold•2y ago
this is in the add request, no? I meant in the handler function
narrow-beige
narrow-beigeOP•2y ago
Ah, sorry. These are the updated logs with the uniqueKeys being logged from both addRequest as well as the handlers. I've simplified a bit the keys, so now they are just the target URL (but they are still added manually). You can see it starts with 2 requests with keys https://site.com/page-a/user-0 and https://site.com/page-a/user-1. Those two are processed first and second, but for some reason the same handler is invoked later with the same key https://site.com/page-a/user-1 (but no additional request for this was added).
like-gold
like-gold•2y ago
ok, so it looks like it is this issue https://github.com/apify/crawlee/issues/2078 try to remove sameDomainDelaySecs
narrow-beige
narrow-beigeOP•2y ago
Ok, thanks. Good to know what it is then, I'll try to just add an await sleep() in the handlers and see if it works the same 😛 Ok, JFYI, I had the same issue with version 3.5.2 and even 3.5.0. Removing sameDomainDelaySecs and adding a sleep at the end of the handlers work well though, so I'll stick to that.
MEE6
MEE6•2y ago
@Dani just advanced to level 1! Thanks for your contributions! 🎉

Did you find this page helpful?