Crawlee & Apify•2y ago

Crawler skipping Jobs after processing 5,000-6,000 Requests

Since a few days, I have been running the crawler with a high number of jobs. As a result, I have run into a problem. I have found, that not all jobs are processed by the CheerioCrawler despite these jobs being added to the queue through addRequest([job]). I can't really reproduce it, it happens approximately after 5000 - 6000 number of jobs. My code doesn't crash, it continues to the next jobs (BullMQ job queue) without scraping the link. This is normal behavior, since it reaches the requestHandler (CheerioInfo logger)

message.txt

6 Replies

conscious-sapphireOP•2y ago

Here is where it starts misbehaving, and I have no idea why because the job/urls are valid. Seems like it doesn't reach crawler anymore

conscious-sapphireOP•2y ago

And at this point, my Kafka consumer doesn't receive new data (product) from the scraper This issue is still there, does anyone know how to solve this?

other-emerald•2y ago

did you use loop to run your crawler continuously? as long as there is a urls? how do you do that?

MEE6•2y ago

@Banul; just advanced to level 2! Thanks for your contributions! 🎉

conscious-sapphireOP•2y ago

I’m using BullMQ and my CheerioCrawler has keepAlive set on true. I use Cron to dispatch jobs to the crawler. The worker in this case stops working after the next batch of jobs.

MEE6•2y ago

@LARGO just advanced to level 2! Thanks for your contributions! 🎉

Gaming

Programming

Crawler skipping Jobs after processing 5,000-6,000 Requests

Did you find this page helpful?