Crawlee & Apify•3y ago

Node-cron with CheerioCrawler

I'm currently using CheerioCrawler along with node-cron to run my web scraping tasks. However, I've been having an issue with the crawler not stopping once it's done with its tasks. I set up a node-cron job for every 30 seconds, but the problem occurs when the crawler stays open after finishing its tasks. When the cron job arrives, it seems to creates another instance, and the first iteration works fine. However, subsequent iterations do not start scraping all the pages as expected. In the terminal, it says that the crawler has finished its tasks, but it does not start scraping again. I know that I will eventually need to run it for a longer duration, such as 30 minutes. Therefore, I need to figure out how to make it stop once the crawler has finished its tasks. Could anyone provide some guidance on how to make the crawler stop running after it has completed its tasks, so that the node-cron job can create a new instance and start the process again? Thank you so much for your help

7 Replies

MEE6•3y ago

@Vince just advanced to level 1! Thanks for your contributions! 🎉

extended-salmonOP•3y ago

Here's my crawler file. I looked for a way to reset the crawler once it's done, like running crawler.sessionPool.abort() and similar functions. I also tried process.exit(0).

myCrawler.ts

extended-salmonOP•3y ago

Instead of using node-cron to schedule the crawler, do you have any alternatives ?

exotic-emerald•3y ago

Usually, it should stop automatically. crawler.run() will resolve once the request queue is empty. docs: https://crawlee.dev/api/next/cheerio-crawler/class/CheerioCrawler#run Also, you can try to use crawler.teardown() to stop it. Try to add different console.log() to each handler. Maybe actor gets stuck somewhere. what is articleQueue.addArticle(article);? Is it a promise? Maybe it should have await ?

extended-salmonOP•3y ago

Hi Oleg, Thank you for your answer. I tried crawler.teardown() , i added also added multiple console.log() and the crawler doesn't seem to get stuck. I think the problem lies not in the fact that the crawler isn't stopped, but rather in the fact that it fails to restart because it retains the memory of its completed state. Do you know if there's a way to start it completely again ? Without keeping the finished state. Concerning the articleQueue.addArticle(article);, it's a class that i made to insert the articles in the database, but throttled. Articles get added to the queue, when 5000ms passes, it inserts all of those article in my mongodb database. I felt that i was the optimal way to insert multiple articles. A db operation for each article scrape wouldn't be optimal in my opinion. I'm still a beginner, so maybe it's not even necessary. Instead of calling myCrawler() every 30 second, would it be better to make my crawler a class ? Then every 30 seconds i would create a new instance, const newCrawler = new myCrawler(); Thank you for your suggestions!

extended-salmonOP•3y ago

I've made this codesandbox if you'd like to have a look https://codesandbox.io/p/sandbox/crazy-stitch-p759zo?file=%2Findex.js Cheers

crazy-stitch-p759zo

CodeSandbox is an online editor tailored for web applications.

exotic-emerald•3y ago

Maybe try to use purgeDefaultStorages() function to clear state: https://crawlee.dev/api/next/core/function/purgeDefaultStorages Also, is it necessary to use "node-cron"? Try to use Apify platfrom. It should be easier, as it was created exactly for crawling purposes. Apify has Scheduler. Check it out: https://docs.apify.com/platform/schedules

Gaming

Programming

Node-cron with CheerioCrawler

Did you find this page helpful?