Crawlee & Apify•3y ago

Node running out of memory

I'm scraping some e-commerce stores in a single project, and after about 30k products node crashes because it runs out of memory. Raising the amount of memory allocated to node is not a good solution, as I plan to increase the incoming data to at least 10x. The most obvious solution seems to scale horizontally and run a node instance for each e-commerce store I want to scrape. However, is there any way to decrease the load of memory that crawlee uses? I would be happy to use streaming for exporting the datasets and the dataset items are already persisted through local files.

69 Replies

NeoNomade•3y ago

Playwright puppeteer cheerio ?

constant-blueOP•3y ago

Jsdom

NeoNomade•3y ago

Not familiar with it, but maybe somehow you need to close/delete the used windows

flat-fuchsia•3y ago

you can lower the maxConcurrency

Pepa J•3y ago

@ᗜˬᗜ There has to be something in your implementation that is causing these OOM issues, in most cases it is using recursion wrongly, processing big files through Buffers instead of Streams. But it is hard to investigate this without source code or log from the run.

constant-blueOP•3y ago

Here's how one of the crawlers looks like.

cosmo.ts

MEE6•3y ago

@ᗜˬᗜ just advanced to level 1! Thanks for your contributions! 🎉

constant-blueOP•3y ago

And the wrapper class

product-crawler.ts

Pepa J•3y ago

@ᗜˬᗜ What is the size of the dataset in MB how many items you have there, how much memory had you set to the run? I could check it if you would provide me RunId (feel free to do it in PM).

constant-blueOP•3y ago

Thanks. I'll replicate the error again when I get home and send the data.

Pepa J•3y ago

are you running it locally or on the platform?

constant-blueOP•3y ago

locally, since I want to have integration with azure

Pepa J•3y ago

you may log the amount of memory consumed and see where or what is increasing it https://www.geeksforgeeks.org/node-js-process-memoryusage-method/

GeeksforGeeks

Node.js process.memoryUsage() Method - GeeksforGeeks

A Computer Science portal for geeks. It contains well written, well thought and well explained computer science and programming articles, quizzes and practice/competitive programming/company interview Questions.

constant-blueOP•3y ago

I did some more testing on my existing crawlers, and individually they (most often) do not crash, and if they crash it's not because of heap allocation. For 40k products the node process is consuming less than 1 gig of ram on a particular crawler. I'll run the crawlers individually in azure functions I suppose, since I would want parallelism and scheduling either way. But my guess is that it runs out of memory because of huge queues, the scrapers are crawl intensive and traverse 1000+ links each

Pepa J•3y ago

1000+ links should be fine. Some of our actors are handling more 500 000+ pages without issues. Can you post the error you are getting?

constant-blueOP•3y ago

<--- Last few GCs --->

[4020:000001EAE77572B0]   978439 ms: Mark-Compact 4034.1 (4137.2) -> 4020.2 (4139.7) MB, 1516.3 / 0.6 ms  (average mu = 0.177, current mu = 0.087) allocation failure; scavenge might not succeed
[4020:000001EAE77572B0]   981284 ms: Mark-Compact 4036.4 (4139.9) -> 4024.9 (4144.2) MB, 2790.5 / 0.0 ms  (average mu = 0.080, current mu = 0.019) allocation failure; scavenge might not succeed


<--- JS stacktrace --->

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
 1: 00007FF7CE58234F node_api_throw_syntax_error+179983
 2: 00007FF7CE506986 v8::internal::MicrotaskQueue::GetMicrotasksScopeDepth+61942
 3: 00007FF7CE508693 v8::internal::MicrotaskQueue::GetMicrotasksScopeDepth+69379
 4: 00007FF7CF046411 v8::Isolate::ReportExternalAllocationLimitReached+65
 5: 00007FF7CF031066 v8::internal::V8::FatalProcessOutOfMemory+662
 6: 00007FF7CEE97770 v8::internal::EmbedderStackStateScope::ExplicitScopeForTesting+144
 7: 00007FF7CEE9478D v8::internal::Heap::CollectGarbage+4749
 8: 00007FF7CEEAA3B6 v8::internal::HeapAllocator::AllocateRawWithLightRetrySlowPath+2150
 9: 00007FF7CEEAACEF v8::internal::HeapAllocator::AllocateRawWithRetryOrFailSlowPath+95
10: 00007FF7CEEB9F10 v8::internal::Factory::NewFillerObject+448
11: 00007FF7CEB6F835 v8::internal::Runtime::SetObjectProperty+20997
12: 00007FF7CF0EFA61 v8::internal::SetupIsolateDelegate::SetupHeap+606705
13: 00007FF7CF13FE93 v8::internal::SetupIsolateDelegate::SetupHeap+935459
14: 00007FF74F65079A
 ELIFECYCLE  Command failed with exit code 134.
 ELIFECYCLE  Command failed with exit code 1.

<--- Last few GCs --->

[4020:000001EAE77572B0]   978439 ms: Mark-Compact 4034.1 (4137.2) -> 4020.2 (4139.7) MB, 1516.3 / 0.6 ms  (average mu = 0.177, current mu = 0.087) allocation failure; scavenge might not succeed
[4020:000001EAE77572B0]   981284 ms: Mark-Compact 4036.4 (4139.9) -> 4024.9 (4144.2) MB, 2790.5 / 0.0 ms  (average mu = 0.080, current mu = 0.019) allocation failure; scavenge might not succeed


<--- JS stacktrace --->

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
 1: 00007FF7CE58234F node_api_throw_syntax_error+179983
 2: 00007FF7CE506986 v8::internal::MicrotaskQueue::GetMicrotasksScopeDepth+61942
 3: 00007FF7CE508693 v8::internal::MicrotaskQueue::GetMicrotasksScopeDepth+69379
 4: 00007FF7CF046411 v8::Isolate::ReportExternalAllocationLimitReached+65
 5: 00007FF7CF031066 v8::internal::V8::FatalProcessOutOfMemory+662
 6: 00007FF7CEE97770 v8::internal::EmbedderStackStateScope::ExplicitScopeForTesting+144
 7: 00007FF7CEE9478D v8::internal::Heap::CollectGarbage+4749
 8: 00007FF7CEEAA3B6 v8::internal::HeapAllocator::AllocateRawWithLightRetrySlowPath+2150
 9: 00007FF7CEEAACEF v8::internal::HeapAllocator::AllocateRawWithRetryOrFailSlowPath+95
10: 00007FF7CEEB9F10 v8::internal::Factory::NewFillerObject+448
11: 00007FF7CEB6F835 v8::internal::Runtime::SetObjectProperty+20997
12: 00007FF7CF0EFA61 v8::internal::SetupIsolateDelegate::SetupHeap+606705
13: 00007FF7CF13FE93 v8::internal::SetupIsolateDelegate::SetupHeap+935459
14: 00007FF74F65079A
 ELIFECYCLE  Command failed with exit code 134.
 ELIFECYCLE  Command failed with exit code 1.

It reached 5.5 gigs of RAM (out of 32 total) and almost 40% of CPU on an i7 11850H. The system I first got the error on had 16 gigs of ram and crashed faster. 28,5k products collected from 6 websites. The queue has 6500 files left. I can send you the storage folder if that gives some insight. And here are the latest logs from crawlee

INFO  Statistics: JSDOMCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":16758,"requestsFinishedPerMinute":20,"requestsFailedPerMinute":0,"requestTotalDurationMillis":5161454,"requestsTotal":308,"crawlerRuntimeMillis":906972,"retryHistogram":[308]}
INFO  Statistics: JSDOMCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":22207,"requestsFinishedPerMinute":37,"requestsFailedPerMinute":0,"requestTotalDurationMillis":12258059,"requestsTotal":552,"crawlerRuntimeMillis":906968,"retryHistogram":[547,5]}
INFO  Statistics: JSDOMCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":21397,"requestsFinishedPerMinute":34,"requestsFailedPerMinute":0,"requestTotalDurationMillis":11105217,"requestsTotal":519,"crawlerRuntimeMillis":906973,"retryHistogram":[519]}
INFO  Statistics: JSDOMCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":7455,"requestsFinishedPerMinute":44,"requestsFailedPerMinute":0,"requestTotalDurationMillis":4964858,"requestsTotal":666,"crawlerRuntimeMillis":903320,"retryHistogram":[662,4]}
INFO  Statistics: JSDOMCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":13541,"requestsFinishedPerMinute":13,"requestsFailedPerMinute":0,"requestTotalDurationMillis":2613361,"requestsTotal":193,"crawlerRuntimeMillis":903295,"retryHistogram":[193]}

INFO  Statistics: JSDOMCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":16758,"requestsFinishedPerMinute":20,"requestsFailedPerMinute":0,"requestTotalDurationMillis":5161454,"requestsTotal":308,"crawlerRuntimeMillis":906972,"retryHistogram":[308]}
INFO  Statistics: JSDOMCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":22207,"requestsFinishedPerMinute":37,"requestsFailedPerMinute":0,"requestTotalDurationMillis":12258059,"requestsTotal":552,"crawlerRuntimeMillis":906968,"retryHistogram":[547,5]}
INFO  Statistics: JSDOMCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":21397,"requestsFinishedPerMinute":34,"requestsFailedPerMinute":0,"requestTotalDurationMillis":11105217,"requestsTotal":519,"crawlerRuntimeMillis":906973,"retryHistogram":[519]}
INFO  Statistics: JSDOMCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":7455,"requestsFinishedPerMinute":44,"requestsFailedPerMinute":0,"requestTotalDurationMillis":4964858,"requestsTotal":666,"crawlerRuntimeMillis":903320,"retryHistogram":[662,4]}
INFO  Statistics: JSDOMCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":13541,"requestsFinishedPerMinute":13,"requestsFailedPerMinute":0,"requestTotalDurationMillis":2613361,"requestsTotal":193,"crawlerRuntimeMillis":903295,"retryHistogram":[193]}

MEE6•3y ago

@ᗜˬᗜ just advanced to level 2! Thanks for your contributions! 🎉

Pepa J•3y ago

@ᗜˬᗜ From my point of view, there are probably some memory leaks in your implementation, most common is using and keeping buffers in memory and using recursion. Are you increasing the memory limit with:

export NODE_OPTIONS=--max_old_space_size=16384

export NODE_OPTIONS=--max_old_space_size=16384

constant-blueOP•3y ago

I am not increasing the max memory (that should work, but only until a certain point), and neither am I working with buffers. Anyway, I will move away to running each crawler separately.

NeoNomade•3y ago

I'm having a similar issue, I'm scraping a site that has 12 million product pages (extracted from sitemaps) :)). I increased the max_old_space_size to 64gb , it's still loading them and it's at 16gb at the moment and raising. Still failing, tried to divide the main list of 12 millions urls in multiple lists and adding them sequentially, but still no luck now I'm stuck with this : terminate called after throwing an instance of 'std::bad_alloc'

fascinating-indigo•3y ago

@NeoNomade So you can't even add the requests? Or where exactly does it fail? crawler.addRequests() should add requests in batches, but given the number of requests - I would recommend using the requestList - have you tried that? https://crawlee.dev/api/core/class/RequestList

NeoNomade•3y ago

I can't even add all the requests it crashes before starting, basically crawler.addRequests is crashing. I will try with requestList I've restarted the process using the RequestList, it downloaded all the urls now it should start, it's using 14gb of ram for the moment still waiting to see if it manages to start or it crashes it takes around 10 minutes to get all the urls, and it has been running for 20 minutes I'm watching the process with btop, and it's still doing single threaded activities, so I think it's still processing the huge queue of urls @Andrey Bykov it's not working, it keeps getting stuck. After I add the urls to the request list, the process remains stuck.

MEE6•3y ago

@NeoNomade just advanced to level 6! Thanks for your contributions! 🎉

NeoNomade•3y ago

const requestList = await RequestList.open(null, allUrls, {
  // Persist the state to avoid re-crawling which can lead to data duplications.
  // Keep in mind that the sources have to be immutable or this will throw an error.
  persistStateKey: 'My-ReqList',
});
console.log(requestList.length())
const crawler = new CheerioCrawler({
  requestList,
  proxyConfiguration,
  requestHandler: router,
  minConcurrency: 32,
  maxConcurrency: 256,
  maxRequestRetries: 20,
  navigationTimeoutSecs: 6,
  loggingInterval: 30,
  useSessionPool: true,
  failedRequestHandler({ request }) {
      log.debug(`Request ${request.url} failed 20 times.`);
  },
});
await crawler.run()

const requestList = await RequestList.open(null, allUrls, {
  // Persist the state to avoid re-crawling which can lead to data duplications.
  // Keep in mind that the sources have to be immutable or this will throw an error.
  persistStateKey: 'My-ReqList',
});
console.log(requestList.length())
const crawler = new CheerioCrawler({
  requestList,
  proxyConfiguration,
  requestHandler: router,
  minConcurrency: 32,
  maxConcurrency: 256,
  maxRequestRetries: 20,
  navigationTimeoutSecs: 6,
  loggingInterval: 30,
  useSessionPool: true,
  failedRequestHandler({ request }) {
      log.debug(`Request ${request.url} failed 20 times.`);
  },
});
await crawler.run()

I've put that console log and I don't see it

fascinating-indigo•3y ago

How long were you waiting? Given you have millions of URLs - it would still take some time to create and save the urls to the disk Hmm, I see your response in a different thread. RequestList (when you initialiaze it) - saves the 'dump' of all URLs to the disk, as a buffer from what I remember correctly. But I am not sure whether it's ok or not that it has GBs of data...

NeoNomade•3y ago

4 hours how I'm using a for loop that iterates over the big list and calls addRequests for each url now watching the logs, at least I see that is working I've made logs like how many were added, how many were left. in 5 minutes it only added 170k now I've changed the approach into creating chunks from the big list, because adding 1 by 1 looks like it will take a few hours now it crashed at around 16gb of ram, even though I used max_old_space=35000

fascinating-indigo•3y ago

I passed it to the team. On one hand it's not the most trivial use-case, on the other hand, it should be able to handle millions of URLs.

NeoNomade•3y ago

last time I've started the script in pm2, changed the restart settings for pm2 to be at 32gb of ram, used also the max_old_space 350000, but I still get bad alloc at around 16gb of ram used.

fascinating-indigo•3y ago

I got some news. So - this is expected behavior with memory-storage, which is is used by default with crawlee now (just due to number or URLs). For your use-case you need to use the local-storage. To set it: https://crawlee.dev/api/core/interface/ConfigurationOptions#storageClient Should be somewhat like:

import { Configuration } from 'crawlee';
import { ApifyStorageLocal } from '@apify/storage-local';

const storageLocal = new ApifyStorageLocal();
Configuration.getGlobalConfig().set('storageClient', storageLocal);

import { Configuration } from 'crawlee';
import { ApifyStorageLocal } from '@apify/storage-local';

const storageLocal = new ApifyStorageLocal();
Configuration.getGlobalConfig().set('storageClient', storageLocal);

Could you give it a try and let me know whether it helped or not?

NeoNomade•3y ago

yes, I'll add this to my main and let you know @Andrey Bykov Cannot find package '@apify/storage-local' imported from /my_project/main.js

fascinating-indigo•3y ago

I mean - you have to npm install @apify/storage-local https://www.npmjs.com/package/@apify/storage-local https://github.com/apify/apify-storage-local-js

NeoNomade•3y ago

this is very interesting, somehow I think in the future redis can be involved as storage client for large&fast queues started the process it adds request to the queues in batches of 1 mil . it takes around 10-15 minutes to gather the urls

Starting to add 12011440 urls to the queue
Added 1000000 to the queue. 11011440 left.

<--- Last few GCs --->

[866838:0x55d350008610]   702393 ms: Mark-sweep 3999.3 (4138.1) -> 3986.6 (4141.3) MB, 2922.0 / 0.0 ms  (average mu = 0.224, current mu = 0.029) allocation failure; scavenge might not succeed
[866838:0x55d350008610]   706547 ms: Mark-sweep 4002.7 (4141.3) -> 3989.8 (4144.6) MB, 4061.1 / 0.0 ms  (average mu = 0.116, current mu = 0.022) allocation failure; scavenge might not succeed

Starting to add 12011440 urls to the queue
Added 1000000 to the queue. 11011440 left.

<--- Last few GCs --->

[866838:0x55d350008610]   702393 ms: Mark-sweep 3999.3 (4138.1) -> 3986.6 (4141.3) MB, 2922.0 / 0.0 ms  (average mu = 0.224, current mu = 0.029) allocation failure; scavenge might not succeed
[866838:0x55d350008610]   706547 ms: Mark-sweep 4002.7 (4141.3) -> 3989.8 (4144.6) MB, 4061.1 / 0.0 ms  (average mu = 0.116, current mu = 0.022) allocation failure; scavenge might not succeed

@Andrey Bykov I've added a 5 second delay between the batches, it helped in the past, I was left with only 3 millions to add. with 5 second delay between the batches

main > Starting to add 12011440 urls to the RequestList
Added 1000000 to the queue. 11011440 left.
Added 1000000 to the queue. 10011440 left.
Added 1000000 to the queue. 9011440 left.
Added 1000000 to the queue. 8011440 left.
Added 1000000 to the queue. 7011440 left.
Added 1000000 to the queue. 6011440 left.
Added 1000000 to the queue. 5011440 left.
Added 1000000 to the queue. 4011440 left.
Added 1000000 to the queue. 3011440 left.
Added 1000000 to the queue. 2011440 left.
terminate called after throwing an instance of 'std::bad_alloc'
what():  std::bad_alloc

main > Starting to add 12011440 urls to the RequestList
Added 1000000 to the queue. 11011440 left.
Added 1000000 to the queue. 10011440 left.
Added 1000000 to the queue. 9011440 left.
Added 1000000 to the queue. 8011440 left.
Added 1000000 to the queue. 7011440 left.
Added 1000000 to the queue. 6011440 left.
Added 1000000 to the queue. 5011440 left.
Added 1000000 to the queue. 4011440 left.
Added 1000000 to the queue. 3011440 left.
Added 1000000 to the queue. 2011440 left.
terminate called after throwing an instance of 'std::bad_alloc'
what():  std::bad_alloc

I will try to raise the delay more, to see if it helps Raised the delay but stuck in the same place as above

fascinating-indigo•3y ago

wait, Starting to add 12011440 urls to the RequestList - you're not actually adding those to the list, you add them to the queue right? RequestList could be initialized only once... If I am right and you're adding these to the queue - how exactly are you doing it? сrawler.addRequest() enqueues the first batch of 1000, and then continues in the background... Maybe try adding in batches, but instead of waiting for a few seconds - use the following option - https://crawlee.dev/api/basic-crawler/interface/CrawlerAddRequestsOptions#waitForAllRequestsToBeAdded - this will increase the runtime without doubts, but it will not have those 10 calls in the background....

NeoNomade•3y ago

@Andrey Bykov yes to the queue, sorry forgot to change the log, tried the requestlist too. https://pastebin.com/mvHX1yMa here is the current status of the code.

Pastebin

import { CheerioCrawler, ProxyConfiguration, purgeDefaultStorages, ...

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

NeoNomade•3y ago

tried this option, is suuuuper slow. cpu usage dropped between 10-20 percent, watching the ram usage in pm2 seems to be stuck at 4gb...

fascinating-indigo•3y ago

Hmm, another idea - basically with crwaler.addRequests(), as mentioned - it starts with 1000 requests, starts the crawl, and then adds the rest. What you can do - it just do await crawler.run(chunk). Basically keep it in a way that first chunk goes with 1 millions URLs - crawler runs. Then run promise resolves, you call crawler.run() with second batch, etc. It will eventually use the same queue, same crawler, etc, but instead of adding all 12 millions URLs in chunks and then starting the crawler - you could add chunk - process it. Add chunk process it, etc.

NeoNomade•3y ago

Yes but I’m writing to a csv file in my routes.js . If I do it like that it will probably overwrite it each time @Oleg V. so should I try the same thing as in my code but with smaller chunks, or add the entire chunk to crawler run ?

conscious-sapphire•3y ago

Try both, I guess) I would try to add the entire chunk to crawler.run() first. If no success > try smaller chunks.

NeoNomade•3y ago

MEE6•3y ago

@NeoNomade just advanced to level 7! Thanks for your contributions! 🎉

NeoNomade•3y ago

putting the entire chunk to crawler.run() and will update with the results here. should I also keep the local storage ? it takes around 10-15 minutes to collect all the urls there are 481 sitemaps , to decompress and add to the list :)) std::bad_alloc Trying the old code with chunks of 100k urls 2 millions left to add arghhhhh std::bad_alloc with only 1.8 millions left should I lower the batch even more @Oleg V. ?

conscious-sapphire•3y ago

Yes, Let's try. As we can see, it's much better with smaller batches.

NeoNomade•3y ago

60k urls in a batch crashed in the same place hmm seems like we hit a limit here don't know exactly how to tackle this limit...

fascinating-indigo•3y ago

Hmm, what exactly is written to CSV?

NeoNomade•3y ago

two columns . one column is the url, the other column is the raw html encoded in base64. Working with super huge domains, I moved the parsing to offline scripts. Because any mistake in parsing can cost me days of scraping. Which I can't afford in this project.

fascinating-indigo•3y ago

I still don't get it. Basically each request is a separate call (which means a separate write to CSV). So if you add the way I proposed yesterday - it should not really break things, it basically just feeds another batch or URLs to the crawler (unless I am missing something) with crawler.addReqeusts() it fails at the same place because you are overloading the queue, trying to push the new requests in parallel from 12 calls, which continue to work in the background, and eventually the memory gets overloaded

NeoNomade•3y ago

import { Dataset, createCheerioRouter, EnqueueStrategy } from 'crawlee';
import {createObjectCsvWriter} from 'csv-writer';

global.productsCount = 0
global.visitedUrls = new Set();

const csvWriter = createObjectCsvWriter({
    path: 'file.csv',
    header: [
        {id: 'url', title: 'url'},
        {id: 'html', title: 'html'}
    ]
});
export const router = createCheerioRouter();
router.addDefaultHandler(async ({ enqueueLinks, log, request, body, $ }) => {
    log.debug(`${$('title').toString()} || scraped`)
    const exportDict = {
        url: request.url,
        html: Buffer.from(body, "utf-8").toString('base64'),
    }
    await csvWriter.writeRecords([exportDict])
    // await Dataset.pushData();
    global.productsCount ++
    log.info(`${global.productsCount} products scraped`)
    $ = null;
    body = null;



});

import { Dataset, createCheerioRouter, EnqueueStrategy } from 'crawlee';
import {createObjectCsvWriter} from 'csv-writer';

global.productsCount = 0
global.visitedUrls = new Set();

const csvWriter = createObjectCsvWriter({
    path: 'file.csv',
    header: [
        {id: 'url', title: 'url'},
        {id: 'html', title: 'html'}
    ]
});
export const router = createCheerioRouter();
router.addDefaultHandler(async ({ enqueueLinks, log, request, body, $ }) => {
    log.debug(`${$('title').toString()} || scraped`)
    const exportDict = {
        url: request.url,
        html: Buffer.from(body, "utf-8").toString('base64'),
    }
    await csvWriter.writeRecords([exportDict])
    // await Dataset.pushData();
    global.productsCount ++
    log.info(`${global.productsCount} products scraped`)
    $ = null;
    body = null;



});

here is my routes.js

fascinating-indigo•3y ago

I don't know how csvWrite works exactly, but as I mentioned, every URL is basically a separate csvWriter.writeRecords call. createObjectCsvWriter is still called once, even if you would process first batch, then have another crwaler.run(), etc - you are still using the same instance of csvWriter. And also btw about CSV - you could use the default Actor.pushData() - it writes separate JSONs, and once all requests are finished - you could do exportToCSV: https://crawlee.dev/api/core/class/Dataset#exportToCSV

NeoNomade•3y ago

writing separate jsons and doing the export csv after will require too much space. last time scraping this domain, with the method from above, for 3 million products the produced csv had 1tb. the issue with your ideea is that if I put the crawler.run() in a for loop, and for each loop the crawler restarts, the file will be overwritten. the createObjectCsvWriter also initiates the file

fascinating-indigo•3y ago

oh wow, ok, not an option apparently 😄 but still - if you call the script once, and just have several crawler.run() - it should still use the same csvWriter instance - so it would not initiate the file if it would do it - it would do it for each request You're still running the same app/script. csvWriter is basically created in global context once. Then crawler instance is created once. And then you just feed the URLs

NeoNomade•3y ago

so your idea is something like this ?

const chunkSize = 60000;
for (let i = 0; i < allUrls.length; i += chunkSize) {
    var chunk = allUrls.slice(i, i + chunkSize);
    await crawler.run(chunk);
    console.log(`Added ${chunk.length} to the queue. ${totalCount -= chunk.length} left.`)
}

const chunkSize = 60000;
for (let i = 0; i < allUrls.length; i += chunkSize) {
    var chunk = allUrls.slice(i, i + chunkSize);
    await crawler.run(chunk);
    console.log(`Added ${chunk.length} to the queue. ${totalCount -= chunk.length} left.`)
}

fascinating-indigo•3y ago

yep you could try (just to test it) with like a chunk of 100 to confirm the file will still be in place

NeoNomade•3y ago

ok, changing the chunksize to 100 and starting to see if it works

fascinating-indigo•3y ago

So bascially this way each crawler.run call will add 1000 requests to queue and will start processing them, while adding the rest in the background. Once the chunk is processed - promise is resolved and it goes to another circle....

NeoNomade•3y ago

I've also reduced the total amount of urls because it takes to long to get all of them now this looks a bit strange :)) it already scraped 400 items this somehow means that they get added to the queue in the back-end and the main process keeps running

fascinating-indigo•3y ago

Well - that's how it suppose to work - add 1000, start scraping, add the rest is added in the background. And btw I guess with this scenario - you could try to ditch the local storage again

NeoNomade•3y ago

let's see I will put the entire queue of urls also and let it run if it works

fascinating-indigo•3y ago

fingers crossed, hopefully it will finally work as expected 🙂

NeoNomade•3y ago

fingers crossed ! thanks a lot for all your help ! really appreciate it started the monster

constant-blueOP•3y ago

@NeoNomade by the way, what node version do you use?

NeoNomade•3y ago

18.15.0 @Andrey Bykov

0|main     | SyntaxError: Unexpected end of JSON input
0|main     |     at JSON.parse (<anonymous>)
0|main     |     at RequestQueueFileSystemEntry.get (/run/media/neonomade/work/technitool_scrapers/Zoro_Cheerio/node_modules/@crawlee/memory-storage/fs/request-queue/fs.js:28:25)
0|main     |     at async RequestQueueClient.listHead (/run/media/neonomade/work/technitool_scrapers/Zoro_Cheerio/node_modules/@crawlee/memory-storage/resource-clients/request-queue.js:147:29)
0|main     |     at async RequestQueue._ensureHeadIsNonEmpty (/run/media/neonomade/work/technitool_scrapers/Zoro_Cheerio/node_modules/@crawlee/core/storages/request_queue.js:610:101)
0|main     |     at async RequestQueue.isEmpty (/run/media/neonomade/work/technitool_scrapers/Zoro_Cheerio/node_modules/@crawlee/core/storages/request_queue.js:526:9)
0|main     |     at async CheerioCrawler._isTaskReadyFunction (/run/media/neonomade/work/technitool_scrapers/Zoro_Cheerio/node_modules/@crawlee/basic/internals/basic-crawler.js:762:38)
0|main     |     at async AutoscaledPool._maybeRunTask (/run/media/neonomade/work/technitool_scrapers/Zoro_Cheerio/node_modules/@crawlee/core/autoscaling/autoscaled_pool.js:481:27)

0|main     | SyntaxError: Unexpected end of JSON input
0|main     |     at JSON.parse (<anonymous>)
0|main     |     at RequestQueueFileSystemEntry.get (/run/media/neonomade/work/technitool_scrapers/Zoro_Cheerio/node_modules/@crawlee/memory-storage/fs/request-queue/fs.js:28:25)
0|main     |     at async RequestQueueClient.listHead (/run/media/neonomade/work/technitool_scrapers/Zoro_Cheerio/node_modules/@crawlee/memory-storage/resource-clients/request-queue.js:147:29)
0|main     |     at async RequestQueue._ensureHeadIsNonEmpty (/run/media/neonomade/work/technitool_scrapers/Zoro_Cheerio/node_modules/@crawlee/core/storages/request_queue.js:610:101)
0|main     |     at async RequestQueue.isEmpty (/run/media/neonomade/work/technitool_scrapers/Zoro_Cheerio/node_modules/@crawlee/core/storages/request_queue.js:526:9)
0|main     |     at async CheerioCrawler._isTaskReadyFunction (/run/media/neonomade/work/technitool_scrapers/Zoro_Cheerio/node_modules/@crawlee/basic/internals/basic-crawler.js:762:38)
0|main     |     at async AutoscaledPool._maybeRunTask (/run/media/neonomade/work/technitool_scrapers/Zoro_Cheerio/node_modules/@crawlee/core/autoscaling/autoscaled_pool.js:481:27)

I think I have to keep the local storage :))

fascinating-indigo•3y ago

Frankly - I don't know even closely understand why this error is here. It's about wrong JSON in the queue, but I don't know where exactly does it come from. So yeah - if local storage works for you - just keep it 👍

NeoNomade•3y ago

sometimes it happens random I've been using crawlee quite intense for quite some time, and it happens to just bump into it, restart the scraper and it works. I've restarted it now it works it started to scrape it started super fast, with around 500 products per minute, but now it goes like 100-150 products per minute. I think at this point the scraping is faster than the queue :))

wise-white•3y ago

I'm having similar memory problems as @NeoNomade but using a puppeteer crawler with ~2 million urls. Getting different errors but that all seem to be plausibly caused by overloaded memory and I'm also getting memory usage warnings from crawlee, so I suspect that's the issue. New to crawlee, how can I divide up my requestQueue and load it in batches the way @NeoNomade has? The very basic existing code I have is below.

const requestQueue = await RequestQueue.open('my-saved-request-que-file');

crawler.requestQueue =  requestQueue;
log.info("starting crawler");
await crawler.run();

In my case, there's userData in each json file in the folder where the request_queue data is saved. An example of one of the json files being:

{
    "id": "CUXKCDUF0o4ot1O",
    "json": {"id":"CUXKCDUF0o4ot1O","url":"http://pendletonartcenter.com","uniqueKey":"http://pendletonartcenter.com","method":"GET","noRetry":false,"retryCount":0,"errorMessages":[],"headers":{},"userData":{"id":"14958955938","this_db_key":"master - PQR","page_type":"home","domain":"pendletonartcenter.com"}},
    "method": "GET",
    "orderNo": 1683159099449,
    "retryCount": 0,
    "uniqueKey": "http://pendletonartcenter.com",
    "url": "http://pendletonartcenter.com",
}

MEE6•3y ago

@companyData just advanced to level 1! Thanks for your contributions! 🎉

wise-white•3y ago

Maybe if I rearranged the contents of the requests_queue folder into a bunch of folders that each have a certain sized batch of requests (from rearranging the one massive folder were all the json files currently sit), I could then loop over those folders, run await RequestQueue.open('smaller-folder-of-request-queue-json-files'); wait for the crawler to finish with those using some kind of await statement that waits until the crawler is done (or just brute force wait a certain number of seconds that I'm estimating it will take the crawler to process those requests), then keep going in that loop until it's done? This run would going take ~30+ days 😅 if it ran at the speeds it's currently running at (takes ~1.3 seconds per url even when running 17-25 browser tabs in parallel). Ok, rewrote it to put 200 urls into a requestQueue, then set crawler.RequestQueue to the requestQueue saved value, then call crawler.run(); It's surprising to me that within visiting the first 31 pages it gives this warning:

 2023-05-04 18:42:34.910 INFO  master - PQR:PuppeteerCrawler: Status message: Crawled 31 pages, 0 errors.
2023-05-04 18:42:35.057 INFO  master - PQR:PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":12,"desiredConcurrency":13,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0.16},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
2023-05-04 18:42:35.061 WARN  master - PQR:PuppeteerCrawler:AutoscaledPool:Snapshotter: Memory is critically overloaded. Using 7553 MB of 8025 MB (94%). Consider increasing available memory.

Ok, after some testing, my current theory is that, once I add process.env.CRAWLEE_MEMORY_MBYTES = "30720"; to my main.ts file, my 8 vcpu machine's cpu resources become the bottleneck, so it scales to ~36 browser tabs at which point the cpu is overloaded, so the memory issues inherent in the current RequestQueue implementation may not be an issue for me anymore. Will take 24 hours or so of my scripts running for me to determine if I'm right, but I'll try to remember to report back.

NeoNomade•3y ago

36 browser tabs is a lot for 8 vcpus. try to use incognito windows.

wise-white•3y ago

Interesting, didn't think incognito windows would make things any better. Any reason why that would help? I tried many things today, changing the args (at least the ones I can safely use, in my case it's not safe to use no sandbox), and got a version working that uses a fork of chrome-aws-lambda. I'm entirely new to nodejs though, I'm python guy, so not sure how to optimize anything around the eventloop and it seems to be the limiter at the moment: {"currentConcurrency":10,"desiredConcurrency":15,"systemStatus":{"isSystemIdle":false,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":true,"limitRatio":0.6,"actualRatio":0.724},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}} Hmm, despite changning: process.env.CRAWLEE_AVAILABLE_MEMORY_RATIO = "0.7"; I'm still seeing memInfo... limitRatio 0.2... should I be specifying this somewhere else Hmmm, still haven't gotten through this issue is seems. Running 20k urls at a time and it's crashing with the error:

ProtocolError: Protocol error (Page.addScriptToEvaluateOnNewDocument): Target closed
    at new Callback (/node_modules/puppeteer-core/src/common/Connection.ts:65:12)
    at CallbackRegistry.create (/node_modules/puppeteer-core/src/common/Connection.ts:126:22)
    at Connection._rawSend (/node_modules/puppeteer-core/src/common/Connection.ts:266:22)
    at CDPSessionImpl.send (/node_modules/puppeteer-core/src/common/Connection.ts:525:29)
    at CDPPage.evaluateOnNewDocument (/node_modules/puppeteer-core/src/common/Page.ts:1268:24)
    at Plugin.onPageCreated (/node_modules/puppeteer-extra-plugin-stealth/evasions/navigator.webdriver/index.js:19:16)
    at Plugin._onTargetCreated (/node_modules/puppeteer-extra-plugin/src/index.ts:544:22)
    at processTicksAndRejections (node:internal/process/task_queues:95:5)

ProtocolError: Protocol error (Page.addScriptToEvaluateOnNewDocument): Target closed
    at new Callback (/node_modules/puppeteer-core/src/common/Connection.ts:65:12)
    at CallbackRegistry.create (/node_modules/puppeteer-core/src/common/Connection.ts:126:22)
    at Connection._rawSend (/node_modules/puppeteer-core/src/common/Connection.ts:266:22)
    at CDPSessionImpl.send (/node_modules/puppeteer-core/src/common/Connection.ts:525:29)
    at CDPPage.evaluateOnNewDocument (/node_modules/puppeteer-core/src/common/Page.ts:1268:24)
    at Plugin.onPageCreated (/node_modules/puppeteer-extra-plugin-stealth/evasions/navigator.webdriver/index.js:19:16)
    at Plugin._onTargetCreated (/node_modules/puppeteer-extra-plugin/src/index.ts:544:22)
    at processTicksAndRejections (node:internal/process/task_queues:95:5)

And the previous state reported was:

PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":16,"desiredConcurrency":17,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0.556},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}

So maybe it's not a memory issue after all? I'll try dropping things down from 20k to 5k and report back but I need to get solved.

Pepa J•3y ago

Could be memory issue, your tabs may be closed by the browser having not enough resources or could be issue with the code, that you are using old page reference after you navigated to different url and this one doesn't anymore.

Gaming

Programming

Node running out of memory

Did you find this page helpful?