isTaskReadyFunction failing randomly
I've built a Cheerio Crawler that doesn't do anything super fancy, it takes a start url, than it has 2 enqueue links functions, and another handler that saves to the dataset the url and the body of the page.
I've exposed the GC and running it after both of the request handlers, and also where I'm saving the body , I'm assigning the body to null after saving it.
But I get this error randomly, sometimes at the beginning of the script, sometimes after 20k items scraped sometimes after 50k items scraped, but I could never pass the 50-55k items.
MacOS Ventura 13.1
Node v19.6.0 || npm 9.4.1
10 Replies
Seems to me like one of those bugs, that are caused by creating empty file in filesystem storage when pushing
request
to the RequestQueue
. It happened in cases when you do run locally and interrupt it in some point while it is still running. The next run (using the same requestQueue
) then contains these empty request files, that are not parseable.
There may be another reason/issue why these empty files are created, but it is hard to tell without possibility to reproduce it.I'm not interrupting it and restarting
The issue happens on fresh starts, with purging local storage when starting.
https://github.com/apify/crawlee/issues/1792
Here I've also posted my code, I've just removed the domain that I'm scraping.
GitHub
isTaskReadyFunction failing randomly · Issue #1792 · apify/crawlee
Which package is this bug report for? If unsure which one to select, leave blank @crawlee/core Issue description How to reproduce: Running the code with gc exposed. (it is the same without gc expos...
I'm retrying the script with the enqueue links functions encapsulated in a try catch block. to see if they pop-up any errors, because they are the only places where I'm adding requests to the RequestQueue
I managed to do a barbarian easy fix... it's ugly but it works.
I've searched in the request_queues for an empty json, I've deleted it and restarted the process, it works like a charm. Indeed I might have 1 duplicate request, but when we are speaking of hundreds of thousands... it's not so bad
@NeoNomade hopefully it would be possible to reproduce the issue on your provided code and it is not something website dependent, so we may fix it on level of crawlee. 🙂
For me it happens on any CheerioCrawler , that is taking more than a few hours.
Even for multiple websites?
I’ve tried Any possible configuration
robust-apricot•3y ago
I have also encountered this problem when running a Puppeteer Crawler. So essentially, when your crawler fails, you delete empty json files and restart?
useSessionPool: true (helps a lot to reduce this)