Crawlee & Apify•3y ago

CheerioCrawler hangs with 12 million urls

const requestList = await RequestList.open('My-ReqList', allUrls, { persistStateKey: 'My-ReqList' });
console.log(requestList.length())
const crawler = new CheerioCrawler({
  requestList,
  proxyConfiguration,
  requestHandler: router,
  minConcurrency: 32,
  maxConcurrency: 256,
  maxRequestRetries: 20,
  navigationTimeoutSecs: 6,
  loggingInterval: 30,
  useSessionPool: true,
  failedRequestHandler({ request }) {
      log.debug(`Request ${request.url} failed 20 times.`);
  },
});
await crawler.run()

const requestList = await RequestList.open('My-ReqList', allUrls, { persistStateKey: 'My-ReqList' });
console.log(requestList.length())
const crawler = new CheerioCrawler({
  requestList,
  proxyConfiguration,
  requestHandler: router,
  minConcurrency: 32,
  maxConcurrency: 256,
  maxRequestRetries: 20,
  navigationTimeoutSecs: 6,
  loggingInterval: 30,
  useSessionPool: true,
  failedRequestHandler({ request }) {
      log.debug(`Request ${request.url} failed 20 times.`);
  },
});
await crawler.run()

allUrls is a list that contains 12 million urls, I'm trying to load them in the CheerioCrawler, but the process hangs using 14gb of ram memory and it doesn't even logs the requestList.length(). Can anybody help, please ?

7 Replies

NeoNomadeOP•3y ago

changed the code to :

console.log(`Starting to add ${allUrls.length} urls to the RequestList`)
const ReqList = new RequestList({
  sources: allUrls,
  persistRequestsKey: 'My-ReqList',
  keepDuplicateUrls: false
});
await ReqList.initialize()
console.log(ReqList.length)

const crawler = new CheerioCrawler({
  requestList: ReqList,
  proxyConfiguration,
  requestHandler: router,
  minConcurrency: 32,
  maxConcurrency: 256,
  maxRequestRetries: 20,
  navigationTimeoutSecs: 6,
  loggingInterval: 30,
  useSessionPool: true,
  failedRequestHandler({ request }) {
      log.debug(`Request ${request.url} failed 20 times.`);
  },
});

await crawler.run()

console.log(`Starting to add ${allUrls.length} urls to the RequestList`)
const ReqList = new RequestList({
  sources: allUrls,
  persistRequestsKey: 'My-ReqList',
  keepDuplicateUrls: false
});
await ReqList.initialize()
console.log(ReqList.length)

const crawler = new CheerioCrawler({
  requestList: ReqList,
  proxyConfiguration,
  requestHandler: router,
  minConcurrency: 32,
  maxConcurrency: 256,
  maxRequestRetries: 20,
  navigationTimeoutSecs: 6,
  loggingInterval: 30,
  useSessionPool: true,
  failedRequestHandler({ request }) {
      log.debug(`Request ${request.url} failed 20 times.`);
  },
});

await crawler.run()

Tested on smaller batch of 100k urls, it works perfect. with 12M urls, it has been running for 64minutes right now, stuck at 14.9Gb of memory usage (increased max node memory to 32gb, I have 128 available). I will let it run more because I still see CPU activity, but it looks like it hanged. It takes 10 minutes to get all the urls... but enqueueing them... it's a pain that for the moment doesn't work at all.

foreign-sapphire•3y ago

Are you running crawler on the Apify platform? Can you share link to your run please? Also, can you please share your code where you assign allUrls variable? Maybe there is some memory leak... Are you getting it from the input?

NeoNomadeOP•3y ago

Running locally not on Apify . Just a sec, will share the code in a paste bin

NeoNomadeOP•3y ago

@Oleg V. https://pastebin.com/mvHX1yMa

Pastebin

import { CheerioCrawler, ProxyConfiguration, purgeDefaultStorages, ...

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

foreign-sapphire•3y ago

Try to use RequestList instead of await crawler.addRequests(chunk); It represents a big static list of URLs to crawl. https://crawlee.dev/api/next/core/class/RequestList I guess, the issue is that your chunkSize is way too big and scraper runs out of memory because of it. Or you can try to pass your array to crawler.run(). Like here: https://crawlee.dev/docs/next/examples/crawl-multiple-urls Example:

// Run the crawler with initial request
await crawler.run([ // put your allUrls var 
    'http://www.example.com/page-1',
    'http://www.example.com/page-2',
    'http://www.example.com/page-3',
]);

// Run the crawler with initial request
await crawler.run([ // put your allUrls var 
    'http://www.example.com/page-1',
    'http://www.example.com/page-2',
    'http://www.example.com/page-3',
]);

NeoNomadeOP•3y ago

I have 128gb of ram memory and I allow 64gb to node. I've tried requestList it also fails similar. I will try to put the array in crawler run this one I didn't tried until now

foreign-sapphire•3y ago

Try then to decrease chunkSize. to 50-100k or something like that. Please let's continue discussion in one ticket (it's the same issue, right?): https://discord.com/channels/801163717915574323/1092208304660414597

Gaming

Programming

CheerioCrawler hangs with 12 million urls

Did you find this page helpful?