Crawlee & Apify•3y ago

isTaskReadyFunction failing randomly

I've built a Cheerio Crawler that doesn't do anything super fancy, it takes a start url, than it has 2 enqueue links functions, and another handler that saves to the dataset the url and the body of the page. I've exposed the GC and running it after both of the request handlers, and also where I'm saving the body , I'm assigning the body to null after saving it. But I get this error randomly, sometimes at the beginning of the script, sometimes after 20k items scraped sometimes after 50k items scraped, but I could never pass the 50-55k items. MacOS Ventura 13.1 Node v19.6.0 || npm 9.4.1

ERROR CheerioCrawler:AutoscaledPool: isTaskReadyFunction failed
  SyntaxError: Unexpected end of JSON input
      at JSON.parse (<anonymous>)
      at RequestQueueFileSystemEntry.get (/Users/user/project/node_modules/@crawlee/memory-storage/fs/request-queue/fs.js:19:21)
      at async RequestQueueClient.listHead (/Users/user/project/node_modules/@crawlee/memory-storage/resource-clients/request-queue.js:147:29)
      at async RequestQueue._ensureHeadIsNonEmpty (/Users/user/project/node_modules/@crawlee/core/storages/request_queue.js:610:101)
      at async RequestQueue.isEmpty (/Users/user/project/node_modules/@crawlee/core/storages/request_queue.js:526:9)
      at async CheerioCrawler._isTaskReadyFunction (/Users/user/project/node_modules/@crawlee/basic/internals/basic-crawler.js:710:38)
      at async AutoscaledPool._maybeRunTask (/Users/user/project/node_modules/@crawlee/core/autoscaling/autoscaled_pool.js:481:27)

ERROR CheerioCrawler:AutoscaledPool: isTaskReadyFunction failed
  SyntaxError: Unexpected end of JSON input
      at JSON.parse (<anonymous>)
      at RequestQueueFileSystemEntry.get (/Users/user/project/node_modules/@crawlee/memory-storage/fs/request-queue/fs.js:19:21)
      at async RequestQueueClient.listHead (/Users/user/project/node_modules/@crawlee/memory-storage/resource-clients/request-queue.js:147:29)
      at async RequestQueue._ensureHeadIsNonEmpty (/Users/user/project/node_modules/@crawlee/core/storages/request_queue.js:610:101)
      at async RequestQueue.isEmpty (/Users/user/project/node_modules/@crawlee/core/storages/request_queue.js:526:9)
      at async CheerioCrawler._isTaskReadyFunction (/Users/user/project/node_modules/@crawlee/basic/internals/basic-crawler.js:710:38)
      at async AutoscaledPool._maybeRunTask (/Users/user/project/node_modules/@crawlee/core/autoscaling/autoscaled_pool.js:481:27)

10 Replies

Pepa J•3y ago

Seems to me like one of those bugs, that are caused by creating empty file in filesystem storage when pushing request to the RequestQueue. It happened in cases when you do run locally and interrupt it in some point while it is still running. The next run (using the same requestQueue) then contains these empty request files, that are not parseable. There may be another reason/issue why these empty files are created, but it is hard to tell without possibility to reproduce it.

NeoNomadeOP•3y ago

I'm not interrupting it and restarting

NeoNomadeOP•3y ago

The issue happens on fresh starts, with purging local storage when starting. https://github.com/apify/crawlee/issues/1792 Here I've also posted my code, I've just removed the domain that I'm scraping.

GitHub

isTaskReadyFunction failing randomly · Issue #1792 · apify/crawlee

Which package is this bug report for? If unsure which one to select, leave blank @crawlee/core Issue description How to reproduce: Running the code with gc exposed. (it is the same without gc expos...

NeoNomadeOP•3y ago

I'm retrying the script with the enqueue links functions encapsulated in a try catch block. to see if they pop-up any errors, because they are the only places where I'm adding requests to the RequestQueue I managed to do a barbarian easy fix... it's ugly but it works. I've searched in the request_queues for an empty json, I've deleted it and restarted the process, it works like a charm. Indeed I might have 1 duplicate request, but when we are speaking of hundreds of thousands... it's not so bad

Pepa J•3y ago

@NeoNomade hopefully it would be possible to reproduce the issue on your provided code and it is not something website dependent, so we may fix it on level of crawlee. 🙂

NeoNomadeOP•3y ago

For me it happens on any CheerioCrawler , that is taking more than a few hours.

Pepa J•3y ago

Even for multiple websites?

NeoNomadeOP•3y ago

I’ve tried Any possible configuration

robust-apricot•3y ago

I have also encountered this problem when running a Puppeteer Crawler. So essentially, when your crawler fails, you delete empty json files and restart?

NeoNomadeOP•3y ago

useSessionPool: true (helps a lot to reduce this)

Gaming

Programming

isTaskReadyFunction failing randomly

Did you find this page helpful?