Why the CPU utilization rate of crawlee is going down and seem like stop processing any requests

This always happens after one hour running. Here is the link for most of the code https://medium.com/p/how-to-efficiently-scrape-millions-of-google-businesses-on-a-large-scale-using-a-distributed-35b9140030eb Most of them are c5d.2xlarge SPOT instances. In order to solve this issue, I have to build a cron task to find out the instances with low cpu utilization and terminate them.
CRAWLEE_MIN_CONCURRENCY: "3"
CRAWLEE_MAX_CONCURRENCY: "15"
CRAWLEE_MEMORY_MBYTES: "4096"
CRAWLEE_MIN_CONCURRENCY: "3"
CRAWLEE_MAX_CONCURRENCY: "15"
CRAWLEE_MEMORY_MBYTES: "4096"
3 Replies
Pepa J
Pepa J2y ago
Hello @Tony Wang , I briefly checked the code and see few spots:
minConcurrency: CRAWLEE_MIN_CONCURRENCY | 1,
maxConcurrency: CRAWLEE_MAX_CONCURRENCY | 3,
minConcurrency: CRAWLEE_MIN_CONCURRENCY | 1,
maxConcurrency: CRAWLEE_MAX_CONCURRENCY | 3,
You probably don't want to use binary add operator there since "16" | 1 = 17 Rather use minConcurrency: parseInt(CRAWLEE_MIN_CONCURRENCY) || 1, I cannot see the implementation of functions like googleMapConsentCheck(...) It is hard to determine what is happening without the run or log, There are several places from where you call crawler.run() running the same crawler twice is not good idea, are you sure you are not running the Crawler several times at once? What is the motivation for this?
frequent-plum
frequent-plumOP2y ago
@Pepa J Hi Pepa, Thanks for your helpful reply and pointint out the bugs. Do you think the "binary add operator" or the "crawler.run()" is the key that leads to the performance downgrade? I only call crawler.run() when crawler.running is false, is this the right way to do so? There is nothing special inside "googleMapConsentCheck", the issue is still there even that part is removed.
Pepa J
Pepa J2y ago
I would suggest to improve logging so you know what is happening in the run and where it gets stuck. To keep the Crawler running there is:
autoscaledPoolOptions: {
isFinishedFunction: () => {
return Promise.resolve(false);
},
},
autoscaledPoolOptions: {
isFinishedFunction: () => {
return Promise.resolve(false);
},
},
option on crawler constructor.

Did you find this page helpful?