CheerioCrawler Timeout after 320 Seconds Error/Exception

In some of our CheerioCrawler actors, we continue to get some random timeout errors after 320 seconds that cause them to crash. This is an example of the error:
2023-06-08T07:28:54.464Z ERROR CheerioCrawler: An exception occurred during handling of failed request. This places the crawler and its underlying storages into an unknown state and crawling will be terminated. This may have happened due to an internal error of Apify's API or due to a misconfigured crawler.
2023-06-08T07:28:54.464Z ERROR CheerioCrawler: An exception occurred during handling of failed request. This places the crawler and its underlying storages into an unknown state and crawling will be terminated. This may have happened due to an internal error of Apify's API or due to a misconfigured crawler.
This is an example of another error message that occurs:
2023-06-08T07:28:54.467Z Error: Handling request failure of http://www.yellowpages.com/search?search_terms=9724388344 (VrMLvREYl5fRBJQ) timed out after 320 seconds.
2023-06-08T07:28:54.469Z at Timeout._onTimeout (/usr/src/app/node_modules/@apify/timeout/index.js:62:68)
2023-06-08T07:28:54.471Z at listOnTimeout (node:internal/timers:559:17)
2023-06-08T07:28:54.473Z at processTimers (node:internal/timers:502:7)
2023-06-08T07:28:54.475Z ERROR CheerioCrawler:AutoscaledPool: runTaskFunction failed.
2023-06-08T07:28:54.478Z Error: Handling request failure of http://www.yellowpages.com/search?search_terms=9724388344 (VrMLvREYl5fRBJQ) timed out after 320 seconds.
2023-06-08T07:28:54.480Z at Timeout._onTimeout (/usr/src/app/node_modules/@apify/timeout/index.js:62:68)
2023-06-08T07:28:54.482Z at listOnTimeout (node:internal/timers:559:17)
2023-06-08T07:28:54.484Z at processTimers (node:internal/timers:502:7)
2023-06-08T07:28:54.467Z Error: Handling request failure of http://www.yellowpages.com/search?search_terms=9724388344 (VrMLvREYl5fRBJQ) timed out after 320 seconds.
2023-06-08T07:28:54.469Z at Timeout._onTimeout (/usr/src/app/node_modules/@apify/timeout/index.js:62:68)
2023-06-08T07:28:54.471Z at listOnTimeout (node:internal/timers:559:17)
2023-06-08T07:28:54.473Z at processTimers (node:internal/timers:502:7)
2023-06-08T07:28:54.475Z ERROR CheerioCrawler:AutoscaledPool: runTaskFunction failed.
2023-06-08T07:28:54.478Z Error: Handling request failure of http://www.yellowpages.com/search?search_terms=9724388344 (VrMLvREYl5fRBJQ) timed out after 320 seconds.
2023-06-08T07:28:54.480Z at Timeout._onTimeout (/usr/src/app/node_modules/@apify/timeout/index.js:62:68)
2023-06-08T07:28:54.482Z at listOnTimeout (node:internal/timers:559:17)
2023-06-08T07:28:54.484Z at processTimers (node:internal/timers:502:7)
I have tried wrapping the code in the failedRequestHandler in a try/catch block, but that doesn't provide any additional information.
I manually resurrected two recent jobs that failed, and their ids are
3NgQTkqcodOxjuDGZ
3NgQTkqcodOxjuDGZ
and
p185ASvE8SiStkvZX
p185ASvE8SiStkvZX
. Any insight would be greatly appeciated, as this is impacting production. Thank you!
7 Replies
other-emerald
other-emerald2y ago
Firstly - I see you are sending the request twice. Cheerio crawler is already using got-scraping underneath and you have access to the response body directly requestHandler. Then you send request again wiht response = await gotScraping({ url: request.url, proxyUrl: proxyUrl}); - this line is not needed. Secondly - I would change node version in dockerfile to 18. But generally speaking - I would start by changing two things I pointed above. Will also recommend to remove unused dependencies from package.json
dependent-tan
dependent-tanOP2y ago
Thank you so much for your assistance. I made the changes, and they appear to have resolved the issue. I will try making the changes to our other actors, too. Thanks again!
other-emerald
other-emerald2y ago
Happy to help 👍
dependent-tan
dependent-tanOP2y ago
Hi, @Andrey Bykov! I thought I had resolved this issue by making the changes you recommended. However, another of our actors is experiencing similar behavior. This is the the run ID:
xvUNRj7JExnWzfl4V
xvUNRj7JExnWzfl4V
. Any other suggestions of what I could try?
other-emerald
other-emerald2y ago
in the beginning of the run I only see proxy errors (which means probably bad proxy or some network issues). In the end - it's 30 seconds timeout, which in theory could also be network issues (in case server response takes too much time)
dependent-tan
dependent-tanOP2y ago
Thank you for this suggestion. I have tried specifying residential proxies, and that seems to be working.
other-emerald
other-emerald2y ago
Perfect, happy to help!

Did you find this page helpful?