Crawler skipping Jobs after processing 5,000-6,000 Requests

Since a few days, I have been running the crawler with a high number of jobs. As a result, I have run into a problem. I have found, that not all jobs are processed by the CheerioCrawler despite these jobs being added to the queue through addRequest([job]). I can't really reproduce it, it happens approximately after 5000 - 6000 number of jobs. My code doesn't crash, it continues to the next jobs (BullMQ job queue) without scraping the link. This is normal behavior, since it reaches the requestHandler (CheerioInfo logger)
6 Replies
conscious-sapphire
conscious-sapphireOP•2y ago
Here is where it starts misbehaving, and I have no idea why because the job/urls are valid. Seems like it doesn't reach crawler anymore
No description
conscious-sapphire
conscious-sapphireOP•2y ago
And at this point, my Kafka consumer doesn't receive new data (product) from the scraper This issue is still there, does anyone know how to solve this?
other-emerald
other-emerald•2y ago
did you use loop to run your crawler continuously? as long as there is a urls? how do you do that?
MEE6
MEE6•2y ago
@Banul; just advanced to level 2! Thanks for your contributions! šŸŽ‰
conscious-sapphire
conscious-sapphireOP•2y ago
I’m using BullMQ and my CheerioCrawler has keepAlive set on true. I use Cron to dispatch jobs to the crawler. The worker in this case stops working after the next batch of jobs.
MEE6
MEE6•2y ago
@LARGO just advanced to level 2! Thanks for your contributions! šŸŽ‰

Did you find this page helpful?