One or multiple instances of CheerioCrawler?
Hi community! I'm new to Crawlee, and I'm building a script that scrapes a lot of specific, different domains. These domains each have a different number of pages to scrape; some have 2 to 3 thousand pages, while others might have just a few hundred (or even less).
The thing I have doubts about is: if I put all starting URLs in the same crawler instance, it might finish scraping a domain way before another one. I thought about separating domains, creating a crawler instance for each domain, just so that I can run each crawler separately and let them run their own course.
Is there any downside to this, e.g. will it need significantly more resources? Is there a better strategy?
TIA
4 Replies
Someone will reply to you shortly. In the meantime, this might help:
broad-brown•4mo ago
For your use case, creating a separate crawler instance for each domain could work, but it has potential downsides. Here's a breakdown to help you decide:
Downsides of Multiple Crawler Instances:
1. Increased Resource Usage: Each crawler instance runs its own event loop, maintains its own RequestQueue, and consumes memory. If you have many domains, this approach might significantly increase resource consumption.
2. Coordination Complexity: Managing multiple crawlers can become complicated, especially when you need to monitor or restart them individually.
3. Potential Limits on Concurrency: Depending on your system, running many instances in parallel might lead to bottlenecks (CPU, memory, network).
You can use one crawler instance with a shared RequestQueue and utilize domain-specific logic. Crawlee's flexibility makes this approach efficient:
some points:
1. Efficiency: A single instance uses resources more effectively.
2. Simpler Monitoring: You have only one crawler to monitor, restart, or debug.
3. Better Concurrency Management: Crawlee lets you adjust maxConcurrency and maxRequestsPerCrawl, so you can balance the load across domains.
genetic-orangeOP•4mo ago
How do you recommend handling domains with lots of pages? I wanna run the crawler every hour, but those domains take more than 2 hours sometimes to finish.
metropolitan-bronze•2mo ago
Did you find a solution @Vice ?