CA
initial-rose

Duplicated Requests due to Migration of Host

sometimes, I saw this from my actor
2022-11-16T20:17:23.429Z ACTOR: Notifying actor process about imminent migration to another host.
2022-11-16T20:17:43.435Z ERROR PuppeteerCrawler: The crawler was paused due to migration to another host, but some requests did not finish in time. Those requests' results may be duplicated.
2022-11-16T20:18:03.481Z ACTOR: Sending Docker container SIGTERM signal.
2022-11-16T20:18:06.919Z ACTOR: Pulling Docker image from repository.
2022-11-16T20:18:15.914Z ACTOR: Creating Docker container.
2022-11-16T20:18:17.083Z ACTOR: Starting Docker container.
2022-11-16T20:18:19.077Z Executing main command
2022-11-16T20:18:22.694Z INFO System info {"apifyVersion":"3.1.0","apifyClientVersion":"2.6.1","osType":"Linux","nodeVersion":"v16.15.0"}
2022-11-16T20:18:23.445Z INFO PuppeteerCrawler: Starting the crawl
2022-11-16T20:18:25.152Z INFO PuppeteerCrawler: starting request: https://example.com
2022-11-16T20:18:25.291Z INFO PuppeteerCrawler: starting request: https://example.com
2022-11-16T20:17:23.429Z ACTOR: Notifying actor process about imminent migration to another host.
2022-11-16T20:17:43.435Z ERROR PuppeteerCrawler: The crawler was paused due to migration to another host, but some requests did not finish in time. Those requests' results may be duplicated.
2022-11-16T20:18:03.481Z ACTOR: Sending Docker container SIGTERM signal.
2022-11-16T20:18:06.919Z ACTOR: Pulling Docker image from repository.
2022-11-16T20:18:15.914Z ACTOR: Creating Docker container.
2022-11-16T20:18:17.083Z ACTOR: Starting Docker container.
2022-11-16T20:18:19.077Z Executing main command
2022-11-16T20:18:22.694Z INFO System info {"apifyVersion":"3.1.0","apifyClientVersion":"2.6.1","osType":"Linux","nodeVersion":"v16.15.0"}
2022-11-16T20:18:23.445Z INFO PuppeteerCrawler: Starting the crawl
2022-11-16T20:18:25.152Z INFO PuppeteerCrawler: starting request: https://example.com
2022-11-16T20:18:25.291Z INFO PuppeteerCrawler: starting request: https://example.com
Is there a way to prevent the migration? If that's not possible, is there a way to fail the request? instead of creating duplicates?
2 Replies
Alexey Udovydchenko
No way to prevent migration, so to resolve request duplicates logically correct you should save data just before handleFunction finished, this way when request restarted your crawler will parse data again but will save it as unique data item.
continuing-cyan
continuing-cyan3y ago
There is Actor.on('migrating event that you can respond to. It is super rare this would produce any duplicates though. Usually, you push data at the very end of the request which means it will be immediately marked as done. If it migrates before the request is fully done, it will be retried. Of course, there is still small chance this happens but I would probably deduplicate these rare cases afterwards.

Did you find this page helpful?