Addressing playwright memory limitations in crawlee

Hello, I am currently using crawlee on a medium sized project and I am generally happy with it. I am targeting e-commerce websites and I am interested in the presentation of various products on the website, therefore I opted of a browser automation solution, to be able to "see" the page. I am using playwright as the browser automation tool. Recently I noticed some of my scraping instances fail with the following error: While handling this request, the container instance was found to be using too much memory and was terminated. I did some digging around the web and I found the following: https://stackoverflow.com/questions/72954376/python-playwright-memory-overlad It seems that the playwright context just grows over time. It is a known issue, but playwright itself will not handle this because they are primarily a web testing tool, not a scraping tool. The mentioned solution is to save the state of the context on the disk, and restart the context every once in a while. I was wondering if crawlee has any out of the box functionality that applies this solution. If not, did anyone else encounter the problem? How did you fix it?
Stack Overflow
Python Playwright memory overlad
I made a code that scrapy a website continuously and after several times a got this message <--- Last few GCs ---> [17744:00000270608DE2C0] 16122001 ms: Scavenge 2023.5 (2082.0) -> 2017.3...
5 Replies
yelping-magenta
yelping-magenta3y ago
Repro provided at comment Playwright Issue #6319 [1] [1] https://github.com/microsoft/playwright/issues/6319#issuecomment-917705023 Does anyone know what utility was used to plot this graph?
GitHub
[BUG] Memory increases when same context is used · Issue #6319 · mi...
Context: Playwright Version: Latest (today is 26/044/2021) Operating System: Linux Node.js version: tested on both node.js version Browser: chromium Describe the bug I&#39;m watching full-js ap...
No description
yelping-magenta
yelping-magenta3y ago
You can set retireBrowserAfterPageCount[1] in the browserPoolOptions to close browser & launch a new browser. Maybe closing the browser will free the memory. [1] https://crawlee.dev/api/browser-pool/interface/BrowserPoolOptions#retireBrowserAfterPageCount
national-gold
national-goldOP3y ago
Yes this looks perfect for my use case. Thank you! How can I control the number of browsers at any given point in time? I wish that one browser runs for 20 pages, shuts down and another browser opens, without parallel browsing
yelping-magenta
yelping-magenta3y ago
You can do this with these options :
const crawler = new PlaywrightCrawler({
autoscaledPoolOptions: {
minConcurrency: 1,
maxConcurrency: 1,
loggingIntervalSecs: null,
},
maxRequestRetries: 0,
navigationTimeoutSecs: 30,
requestHandlerTimeoutSecs: 20,
useSessionPool: false,
persistCookiesPerSession: false,
headless: false,
browserPoolOptions: {
retireBrowserAfterPageCount: 20,
},

async requestHandler(context) {
context.log.info(`GET Request #${this.handledRequestsCount}: ${context.request.url}`);
},
});
const crawler = new PlaywrightCrawler({
autoscaledPoolOptions: {
minConcurrency: 1,
maxConcurrency: 1,
loggingIntervalSecs: null,
},
maxRequestRetries: 0,
navigationTimeoutSecs: 30,
requestHandlerTimeoutSecs: 20,
useSessionPool: false,
persistCookiesPerSession: false,
headless: false,
browserPoolOptions: {
retireBrowserAfterPageCount: 20,
},

async requestHandler(context) {
context.log.info(`GET Request #${this.handledRequestsCount}: ${context.request.url}`);
},
});
yelping-magenta
yelping-magenta3y ago
With this benchmark, it confirms that stopping the browser frees up memory.
/* eslint-disable max-len */

import {
PlaywrightCrawler, // https://crawlee.dev/api/playwright-crawler/class/PlaywrightCrawler
Request,
} from 'crawlee';
// https://github.com/uuidjs/uuid
import { v4 as uuidv4 } from 'uuid';

const crawler = new PlaywrightCrawler({
autoscaledPoolOptions: {
minConcurrency: 2,
maxConcurrency: 10,
loggingIntervalSecs: null,
},
maxRequestsPerCrawl: 1000,
maxRequestRetries: 0,
navigationTimeoutSecs: 30,
requestHandlerTimeoutSecs: 20,
useSessionPool: false,
persistCookiesPerSession: false,
headless: true,
browserPoolOptions: {
retireBrowserAfterPageCount: 5,
},

async requestHandler(context) {
context.log.info(`GET Request #${this.handledRequestsCount}: ${context.request.url}`);

const numRequest = Math.floor(Math.random() * 4); // between [1 - 3]
for (let i = 0; i < numRequest; i++) {
const uuid = uuidv4()
//await crawler.addRequests([new Request({ url: `https://httpbin.org/delay/${numRequest}#${uuid}`, uniqueKey: uuid })]);
await crawler.addRequests([new Request({ url: `http://127.0.0.1:5000/delay/${numRequest}#${uuid}`, uniqueKey: uuid })]);
}
},
});

(async () => {
const uuid = uuidv4()
//await crawler.run([new Request({ url: `https://httpbin.org/delay/0#${uuid}`, uniqueKey: uuid })]);
await crawler.run([new Request({ url: `http://127.0.0.1:5000/delay/0#${uuid}`, uniqueKey: uuid })]);
})();
/* eslint-disable max-len */

import {
PlaywrightCrawler, // https://crawlee.dev/api/playwright-crawler/class/PlaywrightCrawler
Request,
} from 'crawlee';
// https://github.com/uuidjs/uuid
import { v4 as uuidv4 } from 'uuid';

const crawler = new PlaywrightCrawler({
autoscaledPoolOptions: {
minConcurrency: 2,
maxConcurrency: 10,
loggingIntervalSecs: null,
},
maxRequestsPerCrawl: 1000,
maxRequestRetries: 0,
navigationTimeoutSecs: 30,
requestHandlerTimeoutSecs: 20,
useSessionPool: false,
persistCookiesPerSession: false,
headless: true,
browserPoolOptions: {
retireBrowserAfterPageCount: 5,
},

async requestHandler(context) {
context.log.info(`GET Request #${this.handledRequestsCount}: ${context.request.url}`);

const numRequest = Math.floor(Math.random() * 4); // between [1 - 3]
for (let i = 0; i < numRequest; i++) {
const uuid = uuidv4()
//await crawler.addRequests([new Request({ url: `https://httpbin.org/delay/${numRequest}#${uuid}`, uniqueKey: uuid })]);
await crawler.addRequests([new Request({ url: `http://127.0.0.1:5000/delay/${numRequest}#${uuid}`, uniqueKey: uuid })]);
}
},
});

(async () => {
const uuid = uuidv4()
//await crawler.run([new Request({ url: `https://httpbin.org/delay/0#${uuid}`, uniqueKey: uuid })]);
await crawler.run([new Request({ url: `http://127.0.0.1:5000/delay/0#${uuid}`, uniqueKey: uuid })]);
})();
No description

Did you find this page helpful?