isTaskReadyFunction failing randomly

I've built a Cheerio Crawler that doesn't do anything super fancy, it takes a start url, than it has 2 enqueue links functions, and another handler that saves to the dataset the url and the body of the page. I've exposed the GC and running it after both of the request handlers, and also where I'm saving the body , I'm assigning the body to null after saving it. But I get this error randomly, sometimes at the beginning of the script, sometimes after 20k items scraped sometimes after 50k items scraped, but I could never pass the 50-55k items. MacOS Ventura 13.1 Node v19.6.0 || npm 9.4.1
ERROR CheerioCrawler:AutoscaledPool: isTaskReadyFunction failed
SyntaxError: Unexpected end of JSON input
at JSON.parse (<anonymous>)
at RequestQueueFileSystemEntry.get (/Users/user/project/node_modules/@crawlee/memory-storage/fs/request-queue/fs.js:19:21)
at async RequestQueueClient.listHead (/Users/user/project/node_modules/@crawlee/memory-storage/resource-clients/request-queue.js:147:29)
at async RequestQueue._ensureHeadIsNonEmpty (/Users/user/project/node_modules/@crawlee/core/storages/request_queue.js:610:101)
at async RequestQueue.isEmpty (/Users/user/project/node_modules/@crawlee/core/storages/request_queue.js:526:9)
at async CheerioCrawler._isTaskReadyFunction (/Users/user/project/node_modules/@crawlee/basic/internals/basic-crawler.js:710:38)
at async AutoscaledPool._maybeRunTask (/Users/user/project/node_modules/@crawlee/core/autoscaling/autoscaled_pool.js:481:27)
ERROR CheerioCrawler:AutoscaledPool: isTaskReadyFunction failed
SyntaxError: Unexpected end of JSON input
at JSON.parse (<anonymous>)
at RequestQueueFileSystemEntry.get (/Users/user/project/node_modules/@crawlee/memory-storage/fs/request-queue/fs.js:19:21)
at async RequestQueueClient.listHead (/Users/user/project/node_modules/@crawlee/memory-storage/resource-clients/request-queue.js:147:29)
at async RequestQueue._ensureHeadIsNonEmpty (/Users/user/project/node_modules/@crawlee/core/storages/request_queue.js:610:101)
at async RequestQueue.isEmpty (/Users/user/project/node_modules/@crawlee/core/storages/request_queue.js:526:9)
at async CheerioCrawler._isTaskReadyFunction (/Users/user/project/node_modules/@crawlee/basic/internals/basic-crawler.js:710:38)
at async AutoscaledPool._maybeRunTask (/Users/user/project/node_modules/@crawlee/core/autoscaling/autoscaled_pool.js:481:27)
10 Replies
Pepa J
Pepa J3y ago
Seems to me like one of those bugs, that are caused by creating empty file in filesystem storage when pushing request to the RequestQueue. It happened in cases when you do run locally and interrupt it in some point while it is still running. The next run (using the same requestQueue) then contains these empty request files, that are not parseable. There may be another reason/issue why these empty files are created, but it is hard to tell without possibility to reproduce it.
NeoNomade
NeoNomadeOP3y ago
I'm not interrupting it and restarting
NeoNomade
NeoNomadeOP3y ago
The issue happens on fresh starts, with purging local storage when starting. https://github.com/apify/crawlee/issues/1792 Here I've also posted my code, I've just removed the domain that I'm scraping.
GitHub
isTaskReadyFunction failing randomly · Issue #1792 · apify/crawlee
Which package is this bug report for? If unsure which one to select, leave blank @crawlee/core Issue description How to reproduce: Running the code with gc exposed. (it is the same without gc expos...
NeoNomade
NeoNomadeOP3y ago
I'm retrying the script with the enqueue links functions encapsulated in a try catch block. to see if they pop-up any errors, because they are the only places where I'm adding requests to the RequestQueue I managed to do a barbarian easy fix... it's ugly but it works. I've searched in the request_queues for an empty json, I've deleted it and restarted the process, it works like a charm. Indeed I might have 1 duplicate request, but when we are speaking of hundreds of thousands... it's not so bad
Pepa J
Pepa J3y ago
@NeoNomade hopefully it would be possible to reproduce the issue on your provided code and it is not something website dependent, so we may fix it on level of crawlee. 🙂
NeoNomade
NeoNomadeOP3y ago
For me it happens on any CheerioCrawler , that is taking more than a few hours.
Pepa J
Pepa J3y ago
Even for multiple websites?
NeoNomade
NeoNomadeOP3y ago
I’ve tried Any possible configuration
robust-apricot
robust-apricot3y ago
I have also encountered this problem when running a Puppeteer Crawler. So essentially, when your crawler fails, you delete empty json files and restart?
NeoNomade
NeoNomadeOP3y ago
useSessionPool: true (helps a lot to reduce this)

Did you find this page helpful?