One or multiple instances of CheerioCrawler?

Hi community! I'm new to Crawlee, and I'm building a script that scrapes a lot of specific, different domains. These domains each have a different number of pages to scrape; some have 2 to 3 thousand pages, while others might have just a few hundred (or even less). The thing I have doubts about is: if I put all starting URLs in the same crawler instance, it might finish scraping a domain way before another one. I thought about separating domains, creating a crawler instance for each domain, just so that I can run each crawler separately and let them run their own course. Is there any downside to this, e.g. will it need significantly more resources? Is there a better strategy? TIA
4 Replies
Hall
Hall4mo ago
Someone will reply to you shortly. In the meantime, this might help:
broad-brown
broad-brown4mo ago
For your use case, creating a separate crawler instance for each domain could work, but it has potential downsides. Here's a breakdown to help you decide: Downsides of Multiple Crawler Instances: 1. Increased Resource Usage: Each crawler instance runs its own event loop, maintains its own RequestQueue, and consumes memory. If you have many domains, this approach might significantly increase resource consumption. 2. Coordination Complexity: Managing multiple crawlers can become complicated, especially when you need to monitor or restart them individually. 3. Potential Limits on Concurrency: Depending on your system, running many instances in parallel might lead to bottlenecks (CPU, memory, network). You can use one crawler instance with a shared RequestQueue and utilize domain-specific logic. Crawlee's flexibility makes this approach efficient: some points: 1. Efficiency: A single instance uses resources more effectively. 2. Simpler Monitoring: You have only one crawler to monitor, restart, or debug. 3. Better Concurrency Management: Crawlee lets you adjust maxConcurrency and maxRequestsPerCrawl, so you can balance the load across domains.
genetic-orange
genetic-orangeOP4mo ago
How do you recommend handling domains with lots of pages? I wanna run the crawler every hour, but those domains take more than 2 hours sometimes to finish.
metropolitan-bronze
metropolitan-bronze2mo ago
Did you find a solution @Vice ?

Did you find this page helpful?