map maximum size exceeded
I get the following error:
The script at this point is using 11gb of ram (I've allowed 40gb of max heap size)
Last status message:
How in the world can I overcome this issue ?
17 Replies
@Pepa J @HonzaS , anybody ?
@NeoNomade Do you have a Map with over 16 777 200 item? You should generally avoid doing this. Otherwise you could use any BigMap implementation (https://gist.github.com/josephrocca/44e4c0b63828cfc6d6155097b2efc113) to get around this.
Gist
BigMap - wrapper to get past the ~16 million key limit on JavaScrip...
BigMap - wrapper to get past the ~16 million key limit on JavaScript Maps - BigMap.js
rival-black•3y ago
According to this article: https://habr.com/ru/companies/yandex/articles/666870/ (translated version imho clear enough) you need to avoid nested-everything above few thousands as well as huge arrays
@Pepa J how can I implement this if the urls are generated by enqueueLinks ?
eastern-cyan•3y ago
you are scraping the page that has more than 16 777 200 links?
anyway - you can use instead
crawler.addRequests
function and add requests in batches@HonzaS just advanced to level 14! Thanks for your contributions! 🎉
@NeoNomade Ah sorry didn't know you're using
enqueueLinks
. enqueueLinks
is basically is quite limited and simple interface for parsing links based on selectors and adding them as requests to queue.
Just to be sure, you have a single page with over a 16 milion links?
Otherwise I would to go for crawler.addRequests
, as @HonzaS suggested. You need to parse the links in batches - this could be quite simple but also very hard depends on the page structure - I am currently not sure how to scrape like first 1000 links from the page and then another batch 🤔 .It's not a single page, the scraper goes through a website with maaany categories and subcategories :)) and this is the resulting amount
eastern-cyan•3y ago
But enqueue links enqueues only on the single page, on other page it is called again.
the issue is that at concurrency 32, the urls are consumed too slow and they add up really quick
This is my routes file :
https://pastebin.com/ah28RBNK
Pastebin
import { createPuppeteerRouter } from 'crawlee';import {createObjec...
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
eastern-cyan•3y ago
and on what line is the error ocurring?
28 from what I remember
eastern-cyan•3y ago
So the enqueueLinks, I can imagine that this could occur somewhere deep inside the crawlee when it is deduplicating requests when working with large request queue. In that case
crawler.addRequest
propably would not help you.
Only thing that comes to my mind is to go around this with running crawler multiple times for each category. That way queue should be much smaller.I'm trying now to divide into a product url discovery spider and product spider.
Extract all product urls and then read from a file and navigate those.
If this isn't enough I will try to also divide in smaller categories
Can I somehow check the length of the request list ?
So I can put the enqueue links, in an if block :)) to stop them when the request queue gets at 16mil items ? :))
Or can I use multiple request queues in order to achieve this ?
eastern-cyan•3y ago
requestQueue has some methods like getInfo https://crawlee.dev/api/core/class/RequestQueue#getInfo
you can create multiple crawlers and with multiple requestQueues
I just have to see how to divide the workload between the crawlers.
@HonzaS I'm trying to do it like this, but it only works once :))
It's a bit messy, I know, but I was thinking that if I can use multiple queues it should work.
I'm testing with small batches of 431.