Addressing playwright memory limitations in crawlee
Hello,
I am currently using crawlee on a medium sized project and I am generally happy with it. I am targeting e-commerce websites and I am interested in the presentation of various products on the website, therefore I opted of a browser automation solution, to be able to "see" the page.
I am using playwright as the browser automation tool. Recently I noticed some of my scraping instances fail with the following error:
While handling this request, the container instance was found to be using too much memory and was terminated.
I did some digging around the web and I found the following:
https://stackoverflow.com/questions/72954376/python-playwright-memory-overlad
It seems that the playwright context just grows over time. It is a known issue, but playwright itself will not handle this because they are primarily a web testing tool, not a scraping tool.
The mentioned solution is to save the state of the context on the disk, and restart the context every once in a while. I was wondering if crawlee has any out of the box functionality that applies this solution. If not, did anyone else encounter the problem? How did you fix it?Stack Overflow
Python Playwright memory overlad
I made a code that scrapy a website continuously and after several times a got this message
<--- Last few GCs --->
[17744:00000270608DE2C0] 16122001 ms: Scavenge 2023.5 (2082.0) ->
2017.3...
5 Replies
yelping-magenta•3y ago
Repro provided at comment Playwright Issue #6319 [1]
[1] https://github.com/microsoft/playwright/issues/6319#issuecomment-917705023
Does anyone know what utility was used to plot this graph?
GitHub
[BUG] Memory increases when same context is used · Issue #6319 · mi...
Context: Playwright Version: Latest (today is 26/044/2021) Operating System: Linux Node.js version: tested on both node.js version Browser: chromium Describe the bug I'm watching full-js ap...

yelping-magenta•3y ago
You can set
retireBrowserAfterPageCount
[1] in the browserPoolOptions to close browser & launch a new browser. Maybe closing the browser will free the memory.
[1] https://crawlee.dev/api/browser-pool/interface/BrowserPoolOptions#retireBrowserAfterPageCountnational-goldOP•3y ago
Yes this looks perfect for my use case. Thank you!
How can I control the number of browsers at any given point in time? I wish that one browser runs for 20 pages, shuts down and another browser opens, without parallel browsing
yelping-magenta•3y ago
You can do this with these options :
yelping-magenta•3y ago
With this benchmark, it confirms that stopping the browser frees up memory.
