CA
sunny-green
To eliminate duplicates of "request retries," may need to set a "timeout" between them?
The issue is that when the "job" fails, it gets restarted as many times as specified in "maxRequestRetries." However, if the restarted "jobs" are successful, I end up with multiple identical results in the output, whereas I only need one.
For example: the first job fails, and it gets restarted (which is intended), but since it successfully restarts, for instance, two times, I receive two identical results. But I actually need only one result.
2 Replies
sunny-greenOP•2y ago
For example: input length = 200 links, output = 215 objects (must be 200 ).
rival-black•2y ago
Hey there! Request queue deduplicates the URLs, but I see you're explicitly setting the uniqueKey for the requests - why? From what I see the problem is that there are probably duplicate URLs which are fed to crawler as different requests, and thus when they succeed - they are expectedly produce duplicates in the dataset.