Node running out of memory
I'm scraping some e-commerce stores in a single project, and after about 30k products node crashes because it runs out of memory. Raising the amount of memory allocated to node is not a good solution, as I plan to increase the incoming data to at least 10x. The most obvious solution seems to scale horizontally and run a node instance for each e-commerce store I want to scrape.
However, is there any way to decrease the load of memory that crawlee uses? I would be happy to use streaming for exporting the datasets and the dataset items are already persisted through local files.
69 Replies
Playwright puppeteer cheerio ?
constant-blueOP•3y ago
Jsdom
Not familiar with it, but maybe somehow you need to close/delete the used windows
flat-fuchsia•3y ago
you can lower the maxConcurrency
@ᗜˬᗜ There has to be something in your implementation that is causing these OOM issues, in most cases it is using recursion wrongly, processing big files through Buffers instead of Streams.
But it is hard to investigate this without source code or log from the run.
constant-blueOP•3y ago
Here's how one of the crawlers looks like.
@ᗜˬᗜ just advanced to level 1! Thanks for your contributions! 🎉
constant-blueOP•3y ago
And the wrapper class
@ᗜˬᗜ What is the size of the dataset in MB how many items you have there, how much memory had you set to the run? I could check it if you would provide me RunId (feel free to do it in PM).
constant-blueOP•3y ago
Thanks. I'll replicate the error again when I get home and send the data.
are you running it locally or on the platform?
constant-blueOP•3y ago
locally, since I want to have integration with azure
you may log the amount of memory consumed and see where or what is increasing it https://www.geeksforgeeks.org/node-js-process-memoryusage-method/
GeeksforGeeks
Node.js process.memoryUsage() Method - GeeksforGeeks
A Computer Science portal for geeks. It contains well written, well thought and well explained computer science and programming articles, quizzes and practice/competitive programming/company interview Questions.
constant-blueOP•3y ago
I did some more testing on my existing crawlers, and individually they (most often) do not crash, and if they crash it's not because of heap allocation. For 40k products the node process is consuming less than 1 gig of ram on a particular crawler.
I'll run the crawlers individually in azure functions I suppose, since I would want parallelism and scheduling either way.
But my guess is that it runs out of memory because of huge queues, the scrapers are crawl intensive and traverse 1000+ links each
1000+ links should be fine. Some of our actors are handling more 500 000+ pages without issues.
Can you post the error you are getting?
constant-blueOP•3y ago
It reached 5.5 gigs of RAM (out of 32 total) and almost 40% of CPU on an i7 11850H. The system I first got the error on had 16 gigs of ram and crashed faster.
28,5k products collected from 6 websites. The queue has 6500 files left.
I can send you the
storage
folder if that gives some insight.
And here are the latest logs from crawlee
@ᗜˬᗜ just advanced to level 2! Thanks for your contributions! 🎉
@ᗜˬᗜ From my point of view, there are probably some memory leaks in your implementation, most common is using and keeping buffers in memory and using recursion.
Are you increasing the memory limit with:
?
constant-blueOP•3y ago
I am not increasing the max memory (that should work, but only until a certain point), and neither am I working with buffers. Anyway, I will move away to running each crawler separately.
I'm having a similar issue, I'm scraping a site that has 12 million product pages (extracted from sitemaps) :)).
I increased the max_old_space_size to 64gb , it's still loading them and it's at 16gb at the moment and raising.
Still failing, tried to divide the main list of 12 millions urls in multiple lists and adding them sequentially, but still no luck
now I'm stuck with this : terminate called after throwing an instance of 'std::bad_alloc'
fascinating-indigo•3y ago
@NeoNomade So you can't even add the requests? Or where exactly does it fail?
crawler.addRequests()
should add requests in batches, but given the number of requests - I would recommend using the requestList - have you tried that? https://crawlee.dev/api/core/class/RequestListI can't even add all the requests it crashes before starting, basically crawler.addRequests is crashing.
I will try with requestList
I've restarted the process using the RequestList, it downloaded all the urls now it should start, it's using 14gb of ram for the moment still waiting to see if it manages to start or it crashes
it takes around 10 minutes to get all the urls, and it has been running for 20 minutes
I'm watching the process with btop, and it's still doing single threaded activities, so I think it's still processing the huge queue of urls
@Andrey Bykov it's not working, it keeps getting stuck.
After I add the urls to the request list, the process remains stuck.
@NeoNomade just advanced to level 6! Thanks for your contributions! 🎉
I've put that console log and I don't see it
fascinating-indigo•3y ago
How long were you waiting? Given you have millions of URLs - it would still take some time to create and save the urls to the disk
Hmm, I see your response in a different thread. RequestList (when you initialiaze it) - saves the 'dump' of all URLs to the disk, as a buffer from what I remember correctly. But I am not sure whether it's ok or not that it has GBs of data...
4 hours
how I'm using a for loop that iterates over the big list and calls addRequests for each url
now watching the logs, at least I see that is working
I've made logs like how many were added, how many were left.
in 5 minutes it only added 170k
now I've changed the approach into creating chunks from the big list, because adding 1 by 1 looks like it will take a few hours
now it crashed at around 16gb of ram, even though I used max_old_space=35000
fascinating-indigo•3y ago
I passed it to the team. On one hand it's not the most trivial use-case, on the other hand, it should be able to handle millions of URLs.
last time I've started the script in pm2, changed the restart settings for pm2 to be at 32gb of ram, used also the max_old_space 350000, but I still get bad alloc at around 16gb of ram used.
fascinating-indigo•3y ago
I got some news. So - this is expected behavior with memory-storage, which is is used by default with crawlee now (just due to number or URLs). For your use-case you need to use the local-storage. To set it: https://crawlee.dev/api/core/interface/ConfigurationOptions#storageClient
Should be somewhat like:
Could you give it a try and let me know whether it helped or not?
yes, I'll add this to my main and let you know
@Andrey Bykov Cannot find package '@apify/storage-local' imported from /my_project/main.js
fascinating-indigo•3y ago
I mean - you have to
npm install @apify/storage-local
https://www.npmjs.com/package/@apify/storage-local
https://github.com/apify/apify-storage-local-jsthis is very interesting, somehow I think in the future redis can be involved as storage client
for large&fast queues
started the process
it adds request to the queues in batches of 1 mil .
it takes around 10-15 minutes to gather the urls
@Andrey Bykov
I've added a 5 second delay between the batches, it helped in the past, I was left with only 3 millions to add.
with 5 second delay between the batches
I will try to raise the delay more, to see if it helps
Raised the delay but stuck in the same place as above
fascinating-indigo•3y ago
wait,
Starting to add 12011440 urls to the RequestList
- you're not actually adding those to the list, you add them to the queue right? RequestList could be initialized only once...
If I am right and you're adding these to the queue - how exactly are you doing it? сrawler.addRequest() enqueues the first batch of 1000, and then continues in the background... Maybe try adding in batches, but instead of waiting for a few seconds - use the following option - https://crawlee.dev/api/basic-crawler/interface/CrawlerAddRequestsOptions#waitForAllRequestsToBeAdded - this will increase the runtime without doubts, but it will not have those 10 calls in the background....@Andrey Bykov yes to the queue, sorry forgot to change the log, tried the requestlist too.
https://pastebin.com/mvHX1yMa here is the current status of the code.
Pastebin
import { CheerioCrawler, ProxyConfiguration, purgeDefaultStorages, ...
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
tried this option, is suuuuper slow. cpu usage dropped between 10-20 percent, watching the ram usage in pm2 seems to be stuck at 4gb...
fascinating-indigo•3y ago
Hmm, another idea - basically with
crwaler.addRequests()
, as mentioned - it starts with 1000 requests, starts the crawl, and then adds the rest. What you can do - it just do await crawler.run(chunk)
. Basically keep it in a way that first chunk goes with 1 millions URLs - crawler runs. Then run promise resolves, you call crawler.run() with second batch, etc. It will eventually use the same queue, same crawler, etc, but instead of adding all 12 millions URLs in chunks and then starting the crawler - you could add chunk - process it. Add chunk process it, etc.Yes but I’m writing to a csv file in my routes.js .
If I do it like that it will probably overwrite it each time
@Oleg V. so should I try the same thing as in my code but with smaller chunks, or add the entire chunk to crawler run ?
conscious-sapphire•3y ago
Try both, I guess)
I would try to add the entire chunk to
crawler.run()
first.
If no success > try smaller chunks.ok
@NeoNomade just advanced to level 7! Thanks for your contributions! 🎉
putting the entire chunk to crawler.run() and will update with the results here.
should I also keep the local storage ?
it takes around 10-15 minutes to collect all the urls
there are 481 sitemaps , to decompress and add to the list :))
std::bad_alloc
Trying the old code with chunks of 100k urls
2 millions left to add
arghhhhh std::bad_alloc with only 1.8 millions left
should I lower the batch even more @Oleg V. ?
conscious-sapphire•3y ago
Yes, Let's try. As we can see, it's much better with smaller batches.
60k urls in a batch
crashed in the same place
hmm seems like we hit a limit here
don't know exactly how to tackle this limit...
fascinating-indigo•3y ago
Hmm, what exactly is written to CSV?
two columns .
one column is the url, the other column is the raw html encoded in base64.
Working with super huge domains, I moved the parsing to offline scripts.
Because any mistake in parsing can cost me days of scraping.
Which I can't afford in this project.
fascinating-indigo•3y ago
I still don't get it. Basically each request is a separate call (which means a separate write to CSV). So if you add the way I proposed yesterday - it should not really break things, it basically just feeds another batch or URLs to the crawler (unless I am missing something)
with crawler.addReqeusts() it fails at the same place because you are overloading the queue, trying to push the new requests in parallel from 12 calls, which continue to work in the background, and eventually the memory gets overloaded
here is my routes.js
fascinating-indigo•3y ago
I don't know how csvWrite works exactly, but as I mentioned, every URL is basically a separate
csvWriter.writeRecords
call. createObjectCsvWriter
is still called once, even if you would process first batch, then have another crwaler.run(), etc - you are still using the same instance of csvWriter. And also btw about CSV - you could use the default Actor.pushData()
- it writes separate JSONs, and once all requests are finished - you could do exportToCSV: https://crawlee.dev/api/core/class/Dataset#exportToCSVwriting separate jsons and doing the export csv after will require too much space.
last time scraping this domain, with the method from above, for 3 million products the produced csv had 1tb.
the issue with your ideea is that if I put the crawler.run() in a for loop, and for each loop the crawler restarts, the file will be overwritten.
the createObjectCsvWriter also initiates the file
fascinating-indigo•3y ago
oh wow, ok, not an option apparently 😄 but still - if you call the script once, and just have several
crawler.run()
- it should still use the same csvWriter
instance - so it would not initiate the file
if it would do it - it would do it for each request
You're still running the same app/script. csvWriter is basically created in global context once. Then crawler instance is created once. And then you just feed the URLsso your idea is something like this ?
fascinating-indigo•3y ago
yep
you could try (just to test it) with like a chunk of 100 to confirm the file will still be in place
ok, changing the chunksize to 100 and starting to see if it works
fascinating-indigo•3y ago
So bascially this way each crawler.run call will add 1000 requests to queue and will start processing them, while adding the rest in the background. Once the chunk is processed - promise is resolved and it goes to another circle....
I've also reduced the total amount of urls because it takes to long to get all of them
now this looks a bit strange :)) it already scraped 400 items
this somehow means that they get added to the queue in the back-end and the main process keeps running
fascinating-indigo•3y ago
Well - that's how it suppose to work - add 1000, start scraping, add the rest is added in the background. And btw I guess with this scenario - you could try to ditch the local storage again
let's see
I will put the entire queue of urls also
and let it run if it works
fascinating-indigo•3y ago
fingers crossed, hopefully it will finally work as expected 🙂
fingers crossed ! thanks a lot for all your help !
really appreciate it
started the monster
constant-blueOP•3y ago
@NeoNomade by the way, what node version do you use?
18.15.0
@Andrey Bykov
I think I have to keep the local storage :))
fascinating-indigo•3y ago
Frankly - I don't know even closely understand why this error is here. It's about wrong JSON in the queue, but I don't know where exactly does it come from. So yeah - if local storage works for you - just keep it 👍
sometimes it happens random
I've been using crawlee quite intense for quite some time, and it happens to just bump into it, restart the scraper and it works.
I've restarted it now it works
it started to scrape
it started super fast, with around 500 products per minute, but now it goes like 100-150 products per minute.
I think at this point the scraping is faster than the queue :))
wise-white•3y ago
I'm having similar memory problems as @NeoNomade but using a puppeteer crawler with ~2 million urls. Getting different errors but that all seem to be plausibly caused by overloaded memory and I'm also getting memory usage warnings from crawlee, so I suspect that's the issue. New to crawlee, how can I divide up my requestQueue and load it in batches the way @NeoNomade has? The very basic existing code I have is below.
const requestQueue = await RequestQueue.open('my-saved-request-que-file');
crawler.requestQueue = requestQueue;
log.info("starting crawler");
await crawler.run();
In my case, there's userData in each json file in the folder where the request_queue data is saved. An example of one of the json files being:
{
"id": "CUXKCDUF0o4ot1O",
"json": {"id":"CUXKCDUF0o4ot1O","url":"http://pendletonartcenter.com","uniqueKey":"http://pendletonartcenter.com","method":"GET","noRetry":false,"retryCount":0,"errorMessages":[],"headers":{},"userData":{"id":"14958955938","this_db_key":"master - PQR","page_type":"home","domain":"pendletonartcenter.com"}},
"method": "GET",
"orderNo": 1683159099449,
"retryCount": 0,
"uniqueKey": "http://pendletonartcenter.com",
"url": "http://pendletonartcenter.com",
}
@companyData just advanced to level 1! Thanks for your contributions! 🎉
wise-white•3y ago
Maybe if I rearranged the contents of the requests_queue folder into a bunch of folders that each have a certain sized batch of requests (from rearranging the one massive folder were all the json files currently sit), I could then loop over those folders, run await RequestQueue.open('smaller-folder-of-request-queue-json-files'); wait for the crawler to finish with those using some kind of await statement that waits until the crawler is done (or just brute force wait a certain number of seconds that I'm estimating it will take the crawler to process those requests), then keep going in that loop until it's done? This run would going take ~30+ days 😅 if it ran at the speeds it's currently running at (takes ~1.3 seconds per url even when running 17-25 browser tabs in parallel).
Ok, rewrote it to put 200 urls into a requestQueue, then set crawler.RequestQueue to the requestQueue saved value, then call crawler.run(); It's surprising to me that within visiting the first 31 pages it gives this warning:
2023-05-04 18:42:34.910 INFO master - PQR:PuppeteerCrawler: Status message: Crawled 31 pages, 0 errors.
2023-05-04 18:42:35.057 INFO master - PQR:PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":12,"desiredConcurrency":13,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0.16},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
2023-05-04 18:42:35.061 WARN master - PQR:PuppeteerCrawler:AutoscaledPool:Snapshotter: Memory is critically overloaded. Using 7553 MB of 8025 MB (94%). Consider increasing available memory.
Ok, after some testing, my current theory is that, once I add process.env.CRAWLEE_MEMORY_MBYTES = "30720"; to my main.ts file, my 8 vcpu machine's cpu resources become the bottleneck, so it scales to ~36 browser tabs at which point the cpu is overloaded, so the memory issues inherent in the current RequestQueue implementation may not be an issue for me anymore. Will take 24 hours or so of my scripts running for me to determine if I'm right, but I'll try to remember to report back.36 browser tabs is a lot for 8 vcpus.
try to use incognito windows.
wise-white•3y ago
Interesting, didn't think incognito windows would make things any better. Any reason why that would help?
I tried many things today, changing the args (at least the ones I can safely use, in my case it's not safe to use no sandbox), and got a version working that uses a fork of chrome-aws-lambda. I'm entirely new to nodejs though, I'm python guy, so not sure how to optimize anything around the eventloop and it seems to be the limiter at the moment:
{"currentConcurrency":10,"desiredConcurrency":15,"systemStatus":{"isSystemIdle":false,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":true,"limitRatio":0.6,"actualRatio":0.724},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
Hmm, despite changning:
process.env.CRAWLEE_AVAILABLE_MEMORY_RATIO = "0.7";
I'm still seeing memInfo... limitRatio 0.2... should I be specifying this somewhere else
Hmmm, still haven't gotten through this issue is seems. Running 20k urls at a time and it's crashing with the error:
And the previous state reported was:
PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":16,"desiredConcurrency":17,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0.556},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
So maybe it's not a memory issue after all? I'll try dropping things down from 20k to 5k and report back but I need to get solved.Could be memory issue, your tabs may be closed by the browser having not enough resources or could be issue with the code, that you are using old
page
reference after you navigated to different url and this one doesn't anymore.