CheerioCrawler hangs with 12 million urls
allUrls is a list that contains 12 million urls, I'm trying to load them in the CheerioCrawler, but the process hangs using 14gb of ram memory and it doesn't even logs the requestList.length().
Can anybody help, please ?
7 Replies
changed the code to :
Tested on smaller batch of 100k urls, it works perfect.
with 12M urls, it has been running for 64minutes right now, stuck at 14.9Gb of memory usage (increased max node memory to 32gb, I have 128 available).
I will let it run more because I still see CPU activity, but it looks like it hanged.
It takes 10 minutes to get all the urls... but enqueueing them... it's a pain that for the moment doesn't work at all.
foreign-sapphire•3y ago
Are you running crawler on the Apify platform?
Can you share link to your run please?
Also, can you please share your code where you assign allUrls variable? Maybe there is some memory leak...
Are you getting it from the input?
Running locally not on Apify .
Just a sec, will share the code in a paste bin
@Oleg V. https://pastebin.com/mvHX1yMa
Pastebin
import { CheerioCrawler, ProxyConfiguration, purgeDefaultStorages, ...
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
foreign-sapphire•3y ago
Try to use
RequestList
instead of await crawler.addRequests(chunk)
;
It represents a big static list of URLs to crawl.
https://crawlee.dev/api/next/core/class/RequestList
I guess, the issue is that your chunkSize
is way too big and scraper runs out of memory because of it.
Or you can try to pass your array to crawler.run()
.
Like here:
https://crawlee.dev/docs/next/examples/crawl-multiple-urls
Example:
I have 128gb of ram memory and I allow 64gb to node.
I've tried requestList it also fails similar.
I will try to put the array in crawler run
this one I didn't tried until now
foreign-sapphire•3y ago
Try then to decrease chunkSize. to 50-100k or something like that.
Please let's continue discussion in one ticket (it's the same issue, right?):
https://discord.com/channels/801163717915574323/1092208304660414597