Node running out of memory

I'm scraping some e-commerce stores in a single project, and after about 30k products node crashes because it runs out of memory. Raising the amount of memory allocated to node is not a good solution, as I plan to increase the incoming data to at least 10x. The most obvious solution seems to scale horizontally and run a node instance for each e-commerce store I want to scrape. However, is there any way to decrease the load of memory that crawlee uses? I would be happy to use streaming for exporting the datasets and the dataset items are already persisted through local files.
69 Replies
NeoNomade
NeoNomade3y ago
Playwright puppeteer cheerio ?
constant-blue
constant-blueOP3y ago
Jsdom
NeoNomade
NeoNomade3y ago
Not familiar with it, but maybe somehow you need to close/delete the used windows
flat-fuchsia
flat-fuchsia3y ago
you can lower the maxConcurrency
Pepa J
Pepa J3y ago
@ᗜˬᗜ There has to be something in your implementation that is causing these OOM issues, in most cases it is using recursion wrongly, processing big files through Buffers instead of Streams. But it is hard to investigate this without source code or log from the run.
constant-blue
constant-blueOP3y ago
Here's how one of the crawlers looks like.
MEE6
MEE63y ago
@ᗜˬᗜ just advanced to level 1! Thanks for your contributions! 🎉
constant-blue
constant-blueOP3y ago
And the wrapper class
Pepa J
Pepa J3y ago
@ᗜˬᗜ What is the size of the dataset in MB how many items you have there, how much memory had you set to the run? I could check it if you would provide me RunId (feel free to do it in PM).
constant-blue
constant-blueOP3y ago
Thanks. I'll replicate the error again when I get home and send the data.
Pepa J
Pepa J3y ago
are you running it locally or on the platform?
constant-blue
constant-blueOP3y ago
locally, since I want to have integration with azure
Pepa J
Pepa J3y ago
you may log the amount of memory consumed and see where or what is increasing it https://www.geeksforgeeks.org/node-js-process-memoryusage-method/
GeeksforGeeks
Node.js process.memoryUsage() Method - GeeksforGeeks
A Computer Science portal for geeks. It contains well written, well thought and well explained computer science and programming articles, quizzes and practice/competitive programming/company interview Questions.
constant-blue
constant-blueOP3y ago
I did some more testing on my existing crawlers, and individually they (most often) do not crash, and if they crash it's not because of heap allocation. For 40k products the node process is consuming less than 1 gig of ram on a particular crawler. I'll run the crawlers individually in azure functions I suppose, since I would want parallelism and scheduling either way. But my guess is that it runs out of memory because of huge queues, the scrapers are crawl intensive and traverse 1000+ links each
Pepa J
Pepa J3y ago
1000+ links should be fine. Some of our actors are handling more 500 000+ pages without issues. Can you post the error you are getting?
constant-blue
constant-blueOP3y ago
<--- Last few GCs --->

[4020:000001EAE77572B0] 978439 ms: Mark-Compact 4034.1 (4137.2) -> 4020.2 (4139.7) MB, 1516.3 / 0.6 ms (average mu = 0.177, current mu = 0.087) allocation failure; scavenge might not succeed
[4020:000001EAE77572B0] 981284 ms: Mark-Compact 4036.4 (4139.9) -> 4024.9 (4144.2) MB, 2790.5 / 0.0 ms (average mu = 0.080, current mu = 0.019) allocation failure; scavenge might not succeed


<--- JS stacktrace --->

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
1: 00007FF7CE58234F node_api_throw_syntax_error+179983
2: 00007FF7CE506986 v8::internal::MicrotaskQueue::GetMicrotasksScopeDepth+61942
3: 00007FF7CE508693 v8::internal::MicrotaskQueue::GetMicrotasksScopeDepth+69379
4: 00007FF7CF046411 v8::Isolate::ReportExternalAllocationLimitReached+65
5: 00007FF7CF031066 v8::internal::V8::FatalProcessOutOfMemory+662
6: 00007FF7CEE97770 v8::internal::EmbedderStackStateScope::ExplicitScopeForTesting+144
7: 00007FF7CEE9478D v8::internal::Heap::CollectGarbage+4749
8: 00007FF7CEEAA3B6 v8::internal::HeapAllocator::AllocateRawWithLightRetrySlowPath+2150
9: 00007FF7CEEAACEF v8::internal::HeapAllocator::AllocateRawWithRetryOrFailSlowPath+95
10: 00007FF7CEEB9F10 v8::internal::Factory::NewFillerObject+448
11: 00007FF7CEB6F835 v8::internal::Runtime::SetObjectProperty+20997
12: 00007FF7CF0EFA61 v8::internal::SetupIsolateDelegate::SetupHeap+606705
13: 00007FF7CF13FE93 v8::internal::SetupIsolateDelegate::SetupHeap+935459
14: 00007FF74F65079A
 ELIFECYCLE  Command failed with exit code 134.
 ELIFECYCLE  Command failed with exit code 1.
<--- Last few GCs --->

[4020:000001EAE77572B0] 978439 ms: Mark-Compact 4034.1 (4137.2) -> 4020.2 (4139.7) MB, 1516.3 / 0.6 ms (average mu = 0.177, current mu = 0.087) allocation failure; scavenge might not succeed
[4020:000001EAE77572B0] 981284 ms: Mark-Compact 4036.4 (4139.9) -> 4024.9 (4144.2) MB, 2790.5 / 0.0 ms (average mu = 0.080, current mu = 0.019) allocation failure; scavenge might not succeed


<--- JS stacktrace --->

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
1: 00007FF7CE58234F node_api_throw_syntax_error+179983
2: 00007FF7CE506986 v8::internal::MicrotaskQueue::GetMicrotasksScopeDepth+61942
3: 00007FF7CE508693 v8::internal::MicrotaskQueue::GetMicrotasksScopeDepth+69379
4: 00007FF7CF046411 v8::Isolate::ReportExternalAllocationLimitReached+65
5: 00007FF7CF031066 v8::internal::V8::FatalProcessOutOfMemory+662
6: 00007FF7CEE97770 v8::internal::EmbedderStackStateScope::ExplicitScopeForTesting+144
7: 00007FF7CEE9478D v8::internal::Heap::CollectGarbage+4749
8: 00007FF7CEEAA3B6 v8::internal::HeapAllocator::AllocateRawWithLightRetrySlowPath+2150
9: 00007FF7CEEAACEF v8::internal::HeapAllocator::AllocateRawWithRetryOrFailSlowPath+95
10: 00007FF7CEEB9F10 v8::internal::Factory::NewFillerObject+448
11: 00007FF7CEB6F835 v8::internal::Runtime::SetObjectProperty+20997
12: 00007FF7CF0EFA61 v8::internal::SetupIsolateDelegate::SetupHeap+606705
13: 00007FF7CF13FE93 v8::internal::SetupIsolateDelegate::SetupHeap+935459
14: 00007FF74F65079A
 ELIFECYCLE  Command failed with exit code 134.
 ELIFECYCLE  Command failed with exit code 1.
It reached 5.5 gigs of RAM (out of 32 total) and almost 40% of CPU on an i7 11850H. The system I first got the error on had 16 gigs of ram and crashed faster. 28,5k products collected from 6 websites. The queue has 6500 files left. I can send you the storage folder if that gives some insight. And here are the latest logs from crawlee
INFO Statistics: JSDOMCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":16758,"requestsFinishedPerMinute":20,"requestsFailedPerMinute":0,"requestTotalDurationMillis":5161454,"requestsTotal":308,"crawlerRuntimeMillis":906972,"retryHistogram":[308]}
INFO Statistics: JSDOMCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":22207,"requestsFinishedPerMinute":37,"requestsFailedPerMinute":0,"requestTotalDurationMillis":12258059,"requestsTotal":552,"crawlerRuntimeMillis":906968,"retryHistogram":[547,5]}
INFO Statistics: JSDOMCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":21397,"requestsFinishedPerMinute":34,"requestsFailedPerMinute":0,"requestTotalDurationMillis":11105217,"requestsTotal":519,"crawlerRuntimeMillis":906973,"retryHistogram":[519]}
INFO Statistics: JSDOMCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":7455,"requestsFinishedPerMinute":44,"requestsFailedPerMinute":0,"requestTotalDurationMillis":4964858,"requestsTotal":666,"crawlerRuntimeMillis":903320,"retryHistogram":[662,4]}
INFO Statistics: JSDOMCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":13541,"requestsFinishedPerMinute":13,"requestsFailedPerMinute":0,"requestTotalDurationMillis":2613361,"requestsTotal":193,"crawlerRuntimeMillis":903295,"retryHistogram":[193]}
INFO Statistics: JSDOMCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":16758,"requestsFinishedPerMinute":20,"requestsFailedPerMinute":0,"requestTotalDurationMillis":5161454,"requestsTotal":308,"crawlerRuntimeMillis":906972,"retryHistogram":[308]}
INFO Statistics: JSDOMCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":22207,"requestsFinishedPerMinute":37,"requestsFailedPerMinute":0,"requestTotalDurationMillis":12258059,"requestsTotal":552,"crawlerRuntimeMillis":906968,"retryHistogram":[547,5]}
INFO Statistics: JSDOMCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":21397,"requestsFinishedPerMinute":34,"requestsFailedPerMinute":0,"requestTotalDurationMillis":11105217,"requestsTotal":519,"crawlerRuntimeMillis":906973,"retryHistogram":[519]}
INFO Statistics: JSDOMCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":7455,"requestsFinishedPerMinute":44,"requestsFailedPerMinute":0,"requestTotalDurationMillis":4964858,"requestsTotal":666,"crawlerRuntimeMillis":903320,"retryHistogram":[662,4]}
INFO Statistics: JSDOMCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":13541,"requestsFinishedPerMinute":13,"requestsFailedPerMinute":0,"requestTotalDurationMillis":2613361,"requestsTotal":193,"crawlerRuntimeMillis":903295,"retryHistogram":[193]}
MEE6
MEE63y ago
@ᗜˬᗜ just advanced to level 2! Thanks for your contributions! 🎉
Pepa J
Pepa J3y ago
@ᗜˬᗜ From my point of view, there are probably some memory leaks in your implementation, most common is using and keeping buffers in memory and using recursion. Are you increasing the memory limit with:
export NODE_OPTIONS=--max_old_space_size=16384
export NODE_OPTIONS=--max_old_space_size=16384
?
constant-blue
constant-blueOP3y ago
I am not increasing the max memory (that should work, but only until a certain point), and neither am I working with buffers. Anyway, I will move away to running each crawler separately.
NeoNomade
NeoNomade3y ago
I'm having a similar issue, I'm scraping a site that has 12 million product pages (extracted from sitemaps) :)). I increased the max_old_space_size to 64gb , it's still loading them and it's at 16gb at the moment and raising. Still failing, tried to divide the main list of 12 millions urls in multiple lists and adding them sequentially, but still no luck now I'm stuck with this : terminate called after throwing an instance of 'std::bad_alloc'
fascinating-indigo
fascinating-indigo3y ago
@NeoNomade So you can't even add the requests? Or where exactly does it fail? crawler.addRequests() should add requests in batches, but given the number of requests - I would recommend using the requestList - have you tried that? https://crawlee.dev/api/core/class/RequestList
NeoNomade
NeoNomade3y ago
I can't even add all the requests it crashes before starting, basically crawler.addRequests is crashing. I will try with requestList I've restarted the process using the RequestList, it downloaded all the urls now it should start, it's using 14gb of ram for the moment still waiting to see if it manages to start or it crashes it takes around 10 minutes to get all the urls, and it has been running for 20 minutes I'm watching the process with btop, and it's still doing single threaded activities, so I think it's still processing the huge queue of urls @Andrey Bykov it's not working, it keeps getting stuck. After I add the urls to the request list, the process remains stuck.
MEE6
MEE63y ago
@NeoNomade just advanced to level 6! Thanks for your contributions! 🎉
NeoNomade
NeoNomade3y ago
const requestList = await RequestList.open(null, allUrls, {
// Persist the state to avoid re-crawling which can lead to data duplications.
// Keep in mind that the sources have to be immutable or this will throw an error.
persistStateKey: 'My-ReqList',
});
console.log(requestList.length())
const crawler = new CheerioCrawler({
requestList,
proxyConfiguration,
requestHandler: router,
minConcurrency: 32,
maxConcurrency: 256,
maxRequestRetries: 20,
navigationTimeoutSecs: 6,
loggingInterval: 30,
useSessionPool: true,
failedRequestHandler({ request }) {
log.debug(`Request ${request.url} failed 20 times.`);
},
});
await crawler.run()
const requestList = await RequestList.open(null, allUrls, {
// Persist the state to avoid re-crawling which can lead to data duplications.
// Keep in mind that the sources have to be immutable or this will throw an error.
persistStateKey: 'My-ReqList',
});
console.log(requestList.length())
const crawler = new CheerioCrawler({
requestList,
proxyConfiguration,
requestHandler: router,
minConcurrency: 32,
maxConcurrency: 256,
maxRequestRetries: 20,
navigationTimeoutSecs: 6,
loggingInterval: 30,
useSessionPool: true,
failedRequestHandler({ request }) {
log.debug(`Request ${request.url} failed 20 times.`);
},
});
await crawler.run()
I've put that console log and I don't see it
fascinating-indigo
fascinating-indigo3y ago
How long were you waiting? Given you have millions of URLs - it would still take some time to create and save the urls to the disk Hmm, I see your response in a different thread. RequestList (when you initialiaze it) - saves the 'dump' of all URLs to the disk, as a buffer from what I remember correctly. But I am not sure whether it's ok or not that it has GBs of data...
NeoNomade
NeoNomade3y ago
4 hours how I'm using a for loop that iterates over the big list and calls addRequests for each url now watching the logs, at least I see that is working I've made logs like how many were added, how many were left. in 5 minutes it only added 170k now I've changed the approach into creating chunks from the big list, because adding 1 by 1 looks like it will take a few hours now it crashed at around 16gb of ram, even though I used max_old_space=35000
fascinating-indigo
fascinating-indigo3y ago
I passed it to the team. On one hand it's not the most trivial use-case, on the other hand, it should be able to handle millions of URLs.
NeoNomade
NeoNomade3y ago
last time I've started the script in pm2, changed the restart settings for pm2 to be at 32gb of ram, used also the max_old_space 350000, but I still get bad alloc at around 16gb of ram used.
fascinating-indigo
fascinating-indigo3y ago
I got some news. So - this is expected behavior with memory-storage, which is is used by default with crawlee now (just due to number or URLs). For your use-case you need to use the local-storage. To set it: https://crawlee.dev/api/core/interface/ConfigurationOptions#storageClient Should be somewhat like:
import { Configuration } from 'crawlee';
import { ApifyStorageLocal } from '@apify/storage-local';

const storageLocal = new ApifyStorageLocal();
Configuration.getGlobalConfig().set('storageClient', storageLocal);
import { Configuration } from 'crawlee';
import { ApifyStorageLocal } from '@apify/storage-local';

const storageLocal = new ApifyStorageLocal();
Configuration.getGlobalConfig().set('storageClient', storageLocal);
Could you give it a try and let me know whether it helped or not?
NeoNomade
NeoNomade3y ago
yes, I'll add this to my main and let you know @Andrey Bykov Cannot find package '@apify/storage-local' imported from /my_project/main.js
fascinating-indigo
fascinating-indigo3y ago
NeoNomade
NeoNomade3y ago
this is very interesting, somehow I think in the future redis can be involved as storage client for large&fast queues started the process it adds request to the queues in batches of 1 mil . it takes around 10-15 minutes to gather the urls
Starting to add 12011440 urls to the queue
Added 1000000 to the queue. 11011440 left.

<--- Last few GCs --->

[866838:0x55d350008610] 702393 ms: Mark-sweep 3999.3 (4138.1) -> 3986.6 (4141.3) MB, 2922.0 / 0.0 ms (average mu = 0.224, current mu = 0.029) allocation failure; scavenge might not succeed
[866838:0x55d350008610] 706547 ms: Mark-sweep 4002.7 (4141.3) -> 3989.8 (4144.6) MB, 4061.1 / 0.0 ms (average mu = 0.116, current mu = 0.022) allocation failure; scavenge might not succeed
Starting to add 12011440 urls to the queue
Added 1000000 to the queue. 11011440 left.

<--- Last few GCs --->

[866838:0x55d350008610] 702393 ms: Mark-sweep 3999.3 (4138.1) -> 3986.6 (4141.3) MB, 2922.0 / 0.0 ms (average mu = 0.224, current mu = 0.029) allocation failure; scavenge might not succeed
[866838:0x55d350008610] 706547 ms: Mark-sweep 4002.7 (4141.3) -> 3989.8 (4144.6) MB, 4061.1 / 0.0 ms (average mu = 0.116, current mu = 0.022) allocation failure; scavenge might not succeed
@Andrey Bykov I've added a 5 second delay between the batches, it helped in the past, I was left with only 3 millions to add. with 5 second delay between the batches
main > Starting to add 12011440 urls to the RequestList
Added 1000000 to the queue. 11011440 left.
Added 1000000 to the queue. 10011440 left.
Added 1000000 to the queue. 9011440 left.
Added 1000000 to the queue. 8011440 left.
Added 1000000 to the queue. 7011440 left.
Added 1000000 to the queue. 6011440 left.
Added 1000000 to the queue. 5011440 left.
Added 1000000 to the queue. 4011440 left.
Added 1000000 to the queue. 3011440 left.
Added 1000000 to the queue. 2011440 left.
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
main > Starting to add 12011440 urls to the RequestList
Added 1000000 to the queue. 11011440 left.
Added 1000000 to the queue. 10011440 left.
Added 1000000 to the queue. 9011440 left.
Added 1000000 to the queue. 8011440 left.
Added 1000000 to the queue. 7011440 left.
Added 1000000 to the queue. 6011440 left.
Added 1000000 to the queue. 5011440 left.
Added 1000000 to the queue. 4011440 left.
Added 1000000 to the queue. 3011440 left.
Added 1000000 to the queue. 2011440 left.
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
I will try to raise the delay more, to see if it helps Raised the delay but stuck in the same place as above
fascinating-indigo
fascinating-indigo3y ago
wait, Starting to add 12011440 urls to the RequestList - you're not actually adding those to the list, you add them to the queue right? RequestList could be initialized only once... If I am right and you're adding these to the queue - how exactly are you doing it? сrawler.addRequest() enqueues the first batch of 1000, and then continues in the background... Maybe try adding in batches, but instead of waiting for a few seconds - use the following option - https://crawlee.dev/api/basic-crawler/interface/CrawlerAddRequestsOptions#waitForAllRequestsToBeAdded - this will increase the runtime without doubts, but it will not have those 10 calls in the background....
NeoNomade
NeoNomade3y ago
@Andrey Bykov yes to the queue, sorry forgot to change the log, tried the requestlist too. https://pastebin.com/mvHX1yMa here is the current status of the code.
Pastebin
import { CheerioCrawler, ProxyConfiguration, purgeDefaultStorages, ...
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
NeoNomade
NeoNomade3y ago
tried this option, is suuuuper slow. cpu usage dropped between 10-20 percent, watching the ram usage in pm2 seems to be stuck at 4gb...
fascinating-indigo
fascinating-indigo3y ago
Hmm, another idea - basically with crwaler.addRequests(), as mentioned - it starts with 1000 requests, starts the crawl, and then adds the rest. What you can do - it just do await crawler.run(chunk). Basically keep it in a way that first chunk goes with 1 millions URLs - crawler runs. Then run promise resolves, you call crawler.run() with second batch, etc. It will eventually use the same queue, same crawler, etc, but instead of adding all 12 millions URLs in chunks and then starting the crawler - you could add chunk - process it. Add chunk process it, etc.
NeoNomade
NeoNomade3y ago
Yes but I’m writing to a csv file in my routes.js . If I do it like that it will probably overwrite it each time @Oleg V. so should I try the same thing as in my code but with smaller chunks, or add the entire chunk to crawler run ?
conscious-sapphire
conscious-sapphire3y ago
Try both, I guess) I would try to add the entire chunk to crawler.run() first. If no success > try smaller chunks.
NeoNomade
NeoNomade3y ago
ok
MEE6
MEE63y ago
@NeoNomade just advanced to level 7! Thanks for your contributions! 🎉
NeoNomade
NeoNomade3y ago
putting the entire chunk to crawler.run() and will update with the results here. should I also keep the local storage ? it takes around 10-15 minutes to collect all the urls there are 481 sitemaps , to decompress and add to the list :)) std::bad_alloc Trying the old code with chunks of 100k urls 2 millions left to add arghhhhh std::bad_alloc with only 1.8 millions left should I lower the batch even more @Oleg V. ?
conscious-sapphire
conscious-sapphire3y ago
Yes, Let's try. As we can see, it's much better with smaller batches.
NeoNomade
NeoNomade3y ago
60k urls in a batch crashed in the same place hmm seems like we hit a limit here don't know exactly how to tackle this limit...
fascinating-indigo
fascinating-indigo3y ago
Hmm, what exactly is written to CSV?
NeoNomade
NeoNomade3y ago
two columns . one column is the url, the other column is the raw html encoded in base64. Working with super huge domains, I moved the parsing to offline scripts. Because any mistake in parsing can cost me days of scraping. Which I can't afford in this project.
fascinating-indigo
fascinating-indigo3y ago
I still don't get it. Basically each request is a separate call (which means a separate write to CSV). So if you add the way I proposed yesterday - it should not really break things, it basically just feeds another batch or URLs to the crawler (unless I am missing something) with crawler.addReqeusts() it fails at the same place because you are overloading the queue, trying to push the new requests in parallel from 12 calls, which continue to work in the background, and eventually the memory gets overloaded
NeoNomade
NeoNomade3y ago
import { Dataset, createCheerioRouter, EnqueueStrategy } from 'crawlee';
import {createObjectCsvWriter} from 'csv-writer';

global.productsCount = 0
global.visitedUrls = new Set();

const csvWriter = createObjectCsvWriter({
path: 'file.csv',
header: [
{id: 'url', title: 'url'},
{id: 'html', title: 'html'}
]
});
export const router = createCheerioRouter();
router.addDefaultHandler(async ({ enqueueLinks, log, request, body, $ }) => {
log.debug(`${$('title').toString()} || scraped`)
const exportDict = {
url: request.url,
html: Buffer.from(body, "utf-8").toString('base64'),
}
await csvWriter.writeRecords([exportDict])
// await Dataset.pushData();
global.productsCount ++
log.info(`${global.productsCount} products scraped`)
$ = null;
body = null;



});
import { Dataset, createCheerioRouter, EnqueueStrategy } from 'crawlee';
import {createObjectCsvWriter} from 'csv-writer';

global.productsCount = 0
global.visitedUrls = new Set();

const csvWriter = createObjectCsvWriter({
path: 'file.csv',
header: [
{id: 'url', title: 'url'},
{id: 'html', title: 'html'}
]
});
export const router = createCheerioRouter();
router.addDefaultHandler(async ({ enqueueLinks, log, request, body, $ }) => {
log.debug(`${$('title').toString()} || scraped`)
const exportDict = {
url: request.url,
html: Buffer.from(body, "utf-8").toString('base64'),
}
await csvWriter.writeRecords([exportDict])
// await Dataset.pushData();
global.productsCount ++
log.info(`${global.productsCount} products scraped`)
$ = null;
body = null;



});
here is my routes.js
fascinating-indigo
fascinating-indigo3y ago
I don't know how csvWrite works exactly, but as I mentioned, every URL is basically a separate csvWriter.writeRecords call. createObjectCsvWriter is still called once, even if you would process first batch, then have another crwaler.run(), etc - you are still using the same instance of csvWriter. And also btw about CSV - you could use the default Actor.pushData() - it writes separate JSONs, and once all requests are finished - you could do exportToCSV: https://crawlee.dev/api/core/class/Dataset#exportToCSV
NeoNomade
NeoNomade3y ago
writing separate jsons and doing the export csv after will require too much space. last time scraping this domain, with the method from above, for 3 million products the produced csv had 1tb. the issue with your ideea is that if I put the crawler.run() in a for loop, and for each loop the crawler restarts, the file will be overwritten. the createObjectCsvWriter also initiates the file
fascinating-indigo
fascinating-indigo3y ago
oh wow, ok, not an option apparently 😄 but still - if you call the script once, and just have several crawler.run() - it should still use the same csvWriter instance - so it would not initiate the file if it would do it - it would do it for each request You're still running the same app/script. csvWriter is basically created in global context once. Then crawler instance is created once. And then you just feed the URLs
NeoNomade
NeoNomade3y ago
so your idea is something like this ?
const chunkSize = 60000;
for (let i = 0; i < allUrls.length; i += chunkSize) {
var chunk = allUrls.slice(i, i + chunkSize);
await crawler.run(chunk);
console.log(`Added ${chunk.length} to the queue. ${totalCount -= chunk.length} left.`)
}
const chunkSize = 60000;
for (let i = 0; i < allUrls.length; i += chunkSize) {
var chunk = allUrls.slice(i, i + chunkSize);
await crawler.run(chunk);
console.log(`Added ${chunk.length} to the queue. ${totalCount -= chunk.length} left.`)
}
fascinating-indigo
fascinating-indigo3y ago
yep you could try (just to test it) with like a chunk of 100 to confirm the file will still be in place
NeoNomade
NeoNomade3y ago
ok, changing the chunksize to 100 and starting to see if it works
fascinating-indigo
fascinating-indigo3y ago
So bascially this way each crawler.run call will add 1000 requests to queue and will start processing them, while adding the rest in the background. Once the chunk is processed - promise is resolved and it goes to another circle....
NeoNomade
NeoNomade3y ago
I've also reduced the total amount of urls because it takes to long to get all of them now this looks a bit strange :)) it already scraped 400 items this somehow means that they get added to the queue in the back-end and the main process keeps running
fascinating-indigo
fascinating-indigo3y ago
Well - that's how it suppose to work - add 1000, start scraping, add the rest is added in the background. And btw I guess with this scenario - you could try to ditch the local storage again
NeoNomade
NeoNomade3y ago
let's see I will put the entire queue of urls also and let it run if it works
fascinating-indigo
fascinating-indigo3y ago
fingers crossed, hopefully it will finally work as expected 🙂
NeoNomade
NeoNomade3y ago
fingers crossed ! thanks a lot for all your help ! really appreciate it started the monster
constant-blue
constant-blueOP3y ago
@NeoNomade by the way, what node version do you use?
NeoNomade
NeoNomade3y ago
18.15.0 @Andrey Bykov
0|main | SyntaxError: Unexpected end of JSON input
0|main | at JSON.parse (<anonymous>)
0|main | at RequestQueueFileSystemEntry.get (/run/media/neonomade/work/technitool_scrapers/Zoro_Cheerio/node_modules/@crawlee/memory-storage/fs/request-queue/fs.js:28:25)
0|main | at async RequestQueueClient.listHead (/run/media/neonomade/work/technitool_scrapers/Zoro_Cheerio/node_modules/@crawlee/memory-storage/resource-clients/request-queue.js:147:29)
0|main | at async RequestQueue._ensureHeadIsNonEmpty (/run/media/neonomade/work/technitool_scrapers/Zoro_Cheerio/node_modules/@crawlee/core/storages/request_queue.js:610:101)
0|main | at async RequestQueue.isEmpty (/run/media/neonomade/work/technitool_scrapers/Zoro_Cheerio/node_modules/@crawlee/core/storages/request_queue.js:526:9)
0|main | at async CheerioCrawler._isTaskReadyFunction (/run/media/neonomade/work/technitool_scrapers/Zoro_Cheerio/node_modules/@crawlee/basic/internals/basic-crawler.js:762:38)
0|main | at async AutoscaledPool._maybeRunTask (/run/media/neonomade/work/technitool_scrapers/Zoro_Cheerio/node_modules/@crawlee/core/autoscaling/autoscaled_pool.js:481:27)
0|main | SyntaxError: Unexpected end of JSON input
0|main | at JSON.parse (<anonymous>)
0|main | at RequestQueueFileSystemEntry.get (/run/media/neonomade/work/technitool_scrapers/Zoro_Cheerio/node_modules/@crawlee/memory-storage/fs/request-queue/fs.js:28:25)
0|main | at async RequestQueueClient.listHead (/run/media/neonomade/work/technitool_scrapers/Zoro_Cheerio/node_modules/@crawlee/memory-storage/resource-clients/request-queue.js:147:29)
0|main | at async RequestQueue._ensureHeadIsNonEmpty (/run/media/neonomade/work/technitool_scrapers/Zoro_Cheerio/node_modules/@crawlee/core/storages/request_queue.js:610:101)
0|main | at async RequestQueue.isEmpty (/run/media/neonomade/work/technitool_scrapers/Zoro_Cheerio/node_modules/@crawlee/core/storages/request_queue.js:526:9)
0|main | at async CheerioCrawler._isTaskReadyFunction (/run/media/neonomade/work/technitool_scrapers/Zoro_Cheerio/node_modules/@crawlee/basic/internals/basic-crawler.js:762:38)
0|main | at async AutoscaledPool._maybeRunTask (/run/media/neonomade/work/technitool_scrapers/Zoro_Cheerio/node_modules/@crawlee/core/autoscaling/autoscaled_pool.js:481:27)
I think I have to keep the local storage :))
fascinating-indigo
fascinating-indigo3y ago
Frankly - I don't know even closely understand why this error is here. It's about wrong JSON in the queue, but I don't know where exactly does it come from. So yeah - if local storage works for you - just keep it 👍
NeoNomade
NeoNomade3y ago
sometimes it happens random I've been using crawlee quite intense for quite some time, and it happens to just bump into it, restart the scraper and it works. I've restarted it now it works it started to scrape it started super fast, with around 500 products per minute, but now it goes like 100-150 products per minute. I think at this point the scraping is faster than the queue :))
wise-white
wise-white3y ago
I'm having similar memory problems as @NeoNomade but using a puppeteer crawler with ~2 million urls. Getting different errors but that all seem to be plausibly caused by overloaded memory and I'm also getting memory usage warnings from crawlee, so I suspect that's the issue. New to crawlee, how can I divide up my requestQueue and load it in batches the way @NeoNomade has? The very basic existing code I have is below. const requestQueue = await RequestQueue.open('my-saved-request-que-file'); crawler.requestQueue = requestQueue; log.info("starting crawler"); await crawler.run(); In my case, there's userData in each json file in the folder where the request_queue data is saved. An example of one of the json files being: { "id": "CUXKCDUF0o4ot1O", "json": {"id":"CUXKCDUF0o4ot1O","url":"http://pendletonartcenter.com","uniqueKey":"http://pendletonartcenter.com","method":"GET","noRetry":false,"retryCount":0,"errorMessages":[],"headers":{},"userData":{"id":"14958955938","this_db_key":"master - PQR","page_type":"home","domain":"pendletonartcenter.com"}}, "method": "GET", "orderNo": 1683159099449, "retryCount": 0, "uniqueKey": "http://pendletonartcenter.com", "url": "http://pendletonartcenter.com", }
MEE6
MEE63y ago
@companyData just advanced to level 1! Thanks for your contributions! 🎉
wise-white
wise-white3y ago
Maybe if I rearranged the contents of the requests_queue folder into a bunch of folders that each have a certain sized batch of requests (from rearranging the one massive folder were all the json files currently sit), I could then loop over those folders, run await RequestQueue.open('smaller-folder-of-request-queue-json-files'); wait for the crawler to finish with those using some kind of await statement that waits until the crawler is done (or just brute force wait a certain number of seconds that I'm estimating it will take the crawler to process those requests), then keep going in that loop until it's done? This run would going take ~30+ days 😅 if it ran at the speeds it's currently running at (takes ~1.3 seconds per url even when running 17-25 browser tabs in parallel). Ok, rewrote it to put 200 urls into a requestQueue, then set crawler.RequestQueue to the requestQueue saved value, then call crawler.run(); It's surprising to me that within visiting the first 31 pages it gives this warning: 2023-05-04 18:42:34.910 INFO master - PQR:PuppeteerCrawler: Status message: Crawled 31 pages, 0 errors. 2023-05-04 18:42:35.057 INFO master - PQR:PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":12,"desiredConcurrency":13,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0.16},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}} 2023-05-04 18:42:35.061 WARN master - PQR:PuppeteerCrawler:AutoscaledPool:Snapshotter: Memory is critically overloaded. Using 7553 MB of 8025 MB (94%). Consider increasing available memory. Ok, after some testing, my current theory is that, once I add process.env.CRAWLEE_MEMORY_MBYTES = "30720"; to my main.ts file, my 8 vcpu machine's cpu resources become the bottleneck, so it scales to ~36 browser tabs at which point the cpu is overloaded, so the memory issues inherent in the current RequestQueue implementation may not be an issue for me anymore. Will take 24 hours or so of my scripts running for me to determine if I'm right, but I'll try to remember to report back.
NeoNomade
NeoNomade3y ago
36 browser tabs is a lot for 8 vcpus. try to use incognito windows.
wise-white
wise-white3y ago
Interesting, didn't think incognito windows would make things any better. Any reason why that would help? I tried many things today, changing the args (at least the ones I can safely use, in my case it's not safe to use no sandbox), and got a version working that uses a fork of chrome-aws-lambda. I'm entirely new to nodejs though, I'm python guy, so not sure how to optimize anything around the eventloop and it seems to be the limiter at the moment: {"currentConcurrency":10,"desiredConcurrency":15,"systemStatus":{"isSystemIdle":false,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":true,"limitRatio":0.6,"actualRatio":0.724},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}} Hmm, despite changning: process.env.CRAWLEE_AVAILABLE_MEMORY_RATIO = "0.7"; I'm still seeing memInfo... limitRatio 0.2... should I be specifying this somewhere else Hmmm, still haven't gotten through this issue is seems. Running 20k urls at a time and it's crashing with the error:
ProtocolError: Protocol error (Page.addScriptToEvaluateOnNewDocument): Target closed
at new Callback (/node_modules/puppeteer-core/src/common/Connection.ts:65:12)
at CallbackRegistry.create (/node_modules/puppeteer-core/src/common/Connection.ts:126:22)
at Connection._rawSend (/node_modules/puppeteer-core/src/common/Connection.ts:266:22)
at CDPSessionImpl.send (/node_modules/puppeteer-core/src/common/Connection.ts:525:29)
at CDPPage.evaluateOnNewDocument (/node_modules/puppeteer-core/src/common/Page.ts:1268:24)
at Plugin.onPageCreated (/node_modules/puppeteer-extra-plugin-stealth/evasions/navigator.webdriver/index.js:19:16)
at Plugin._onTargetCreated (/node_modules/puppeteer-extra-plugin/src/index.ts:544:22)
at processTicksAndRejections (node:internal/process/task_queues:95:5)
ProtocolError: Protocol error (Page.addScriptToEvaluateOnNewDocument): Target closed
at new Callback (/node_modules/puppeteer-core/src/common/Connection.ts:65:12)
at CallbackRegistry.create (/node_modules/puppeteer-core/src/common/Connection.ts:126:22)
at Connection._rawSend (/node_modules/puppeteer-core/src/common/Connection.ts:266:22)
at CDPSessionImpl.send (/node_modules/puppeteer-core/src/common/Connection.ts:525:29)
at CDPPage.evaluateOnNewDocument (/node_modules/puppeteer-core/src/common/Page.ts:1268:24)
at Plugin.onPageCreated (/node_modules/puppeteer-extra-plugin-stealth/evasions/navigator.webdriver/index.js:19:16)
at Plugin._onTargetCreated (/node_modules/puppeteer-extra-plugin/src/index.ts:544:22)
at processTicksAndRejections (node:internal/process/task_queues:95:5)
And the previous state reported was: PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":16,"desiredConcurrency":17,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0.556},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}} So maybe it's not a memory issue after all? I'll try dropping things down from 20k to 5k and report back but I need to get solved.
Pepa J
Pepa J3y ago
Could be memory issue, your tabs may be closed by the browser having not enough resources or could be issue with the code, that you are using old page reference after you navigated to different url and this one doesn't anymore.

Did you find this page helpful?