map maximum size exceeded

I get the following error:

 WARN  PuppeteerCrawler: Reclaiming failed request back to the list or queue. Map maximum size exceeded

 WARN  PuppeteerCrawler: Reclaiming failed request back to the list or queue. Map maximum size exceeded

The script at this point is using 11gb of ram (I've allowed 40gb of max heap size) Last status message:

INFO  PuppeteerCrawler: Status message: Crawled 79500/16744431 pages, 0 errors.

INFO  PuppeteerCrawler: Status message: Crawled 79500/16744431 pages, 0 errors.

How in the world can I overcome this issue ?

17 Replies

NeoNomadeOP•3y ago

@Pepa J @HonzaS , anybody ?

Pepa J•3y ago

@NeoNomade Do you have a Map with over 16 777 200 item? You should generally avoid doing this. Otherwise you could use any BigMap implementation (https://gist.github.com/josephrocca/44e4c0b63828cfc6d6155097b2efc113) to get around this.

Gist

BigMap - wrapper to get past the ~16 million key limit on JavaScrip...

BigMap - wrapper to get past the ~16 million key limit on JavaScript Maps - BigMap.js

rival-black•3y ago

According to this article: https://habr.com/ru/companies/yandex/articles/666870/ (translated version imho clear enough) you need to avoid nested-everything above few thousands as well as huge arrays

NeoNomadeOP•3y ago

@Pepa J how can I implement this if the urls are generated by enqueueLinks ?

eastern-cyan•3y ago

you are scraping the page that has more than 16 777 200 links? anyway - you can use instead crawler.addRequests function and add requests in batches

MEE6•3y ago

@HonzaS just advanced to level 14! Thanks for your contributions! 🎉

Pepa J•3y ago

@NeoNomade Ah sorry didn't know you're using enqueueLinks. enqueueLinks is basically is quite limited and simple interface for parsing links based on selectors and adding them as requests to queue. Just to be sure, you have a single page with over a 16 milion links? Otherwise I would to go for crawler.addRequests, as @HonzaS suggested. You need to parse the links in batches - this could be quite simple but also very hard depends on the page structure - I am currently not sure how to scrape like first 1000 links from the page and then another batch 🤔 .

NeoNomadeOP•3y ago

It's not a single page, the scraper goes through a website with maaany categories and subcategories :)) and this is the resulting amount

eastern-cyan•3y ago

But enqueue links enqueues only on the single page, on other page it is called again.

NeoNomadeOP•3y ago

the issue is that at concurrency 32, the urls are consumed too slow and they add up really quick

NeoNomadeOP•3y ago

This is my routes file : https://pastebin.com/ah28RBNK

Pastebin

import { createPuppeteerRouter } from 'crawlee';import {createObjec...

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

eastern-cyan•3y ago

and on what line is the error ocurring?

NeoNomadeOP•3y ago

28 from what I remember

eastern-cyan•3y ago

So the enqueueLinks, I can imagine that this could occur somewhere deep inside the crawlee when it is deduplicating requests when working with large request queue. In that case crawler.addRequest propably would not help you. Only thing that comes to my mind is to go around this with running crawler multiple times for each category. That way queue should be much smaller.

NeoNomadeOP•3y ago

I'm trying now to divide into a product url discovery spider and product spider. Extract all product urls and then read from a file and navigate those. If this isn't enough I will try to also divide in smaller categories Can I somehow check the length of the request list ? So I can put the enqueue links, in an if block :)) to stop them when the request queue gets at 16mil items ? :)) Or can I use multiple request queues in order to achieve this ?

eastern-cyan•3y ago

requestQueue has some methods like getInfo https://crawlee.dev/api/core/class/RequestQueue#getInfo you can create multiple crawlers and with multiple requestQueues

NeoNomadeOP•3y ago

I just have to see how to divide the workload between the crawlers. @HonzaS I'm trying to do it like this, but it only works once :))

const mcmasterQueue2 = await RequestQueue.open('mcmaster2');
const mcmasterQueue3 = await RequestQueue.open('mcmaster3');
const mcmasterQueue4 = await RequestQueue.open('mcmaster4');
const mcmasterQueue5 = await RequestQueue.open('mcmaster5');
const mcmasterQueue6 = await RequestQueue.open('mcmaster6');
const requestQueues = [
    mcmasterQueue2,
    mcmasterQueue3,
    mcmasterQueue4,
    mcmasterQueue5,
    mcmasterQueue6
];

router.addDefaultHandler(async ({ enqueueLinks, log, page, request, parseWithCheerio, crawler }) => {
    totalRequestCount += crawler.requestQueue.assumedTotalCount
    console.log(totalRequestCount)
    // console.log(`Total request count: ${totalRequestCount}`)
    let queueIndex = 0;
    if (totalRequestCount > 15000000 ) {
        totalRequestCount = 0
        // Select the next queue to use
        let newQueue = requestQueues[queueIndex]
        crawler.requestQueue = newQueue;
        requestQueues.splice(0, 1)
        // Use the selected queue
        // console.log(`Using ${requestQueues[queueIndex]}`);
        console.log(crawler.requestQueue.name)
        console.log(totalRequestCount)
        // Check if we've reached the final queue
        if (queueIndex === requestQueues.length - 1) {
            console.log("Reached the final queue");
        }
    }

const mcmasterQueue2 = await RequestQueue.open('mcmaster2');
const mcmasterQueue3 = await RequestQueue.open('mcmaster3');
const mcmasterQueue4 = await RequestQueue.open('mcmaster4');
const mcmasterQueue5 = await RequestQueue.open('mcmaster5');
const mcmasterQueue6 = await RequestQueue.open('mcmaster6');
const requestQueues = [
    mcmasterQueue2,
    mcmasterQueue3,
    mcmasterQueue4,
    mcmasterQueue5,
    mcmasterQueue6
];

router.addDefaultHandler(async ({ enqueueLinks, log, page, request, parseWithCheerio, crawler }) => {
    totalRequestCount += crawler.requestQueue.assumedTotalCount
    console.log(totalRequestCount)
    // console.log(`Total request count: ${totalRequestCount}`)
    let queueIndex = 0;
    if (totalRequestCount > 15000000 ) {
        totalRequestCount = 0
        // Select the next queue to use
        let newQueue = requestQueues[queueIndex]
        crawler.requestQueue = newQueue;
        requestQueues.splice(0, 1)
        // Use the selected queue
        // console.log(`Using ${requestQueues[queueIndex]}`);
        console.log(crawler.requestQueue.name)
        console.log(totalRequestCount)
        // Check if we've reached the final queue
        if (queueIndex === requestQueues.length - 1) {
            console.log("Reached the final queue");
        }
    }

It's a bit messy, I know, but I was thinking that if I can use multiple queues it should work. I'm testing with small batches of 431.

Gaming

Programming

map maximum size exceeded

Did you find this page helpful?