Crawlee & Apify•2y ago

system design of concurrent crawlers

i have multiple crawlers - primarily playwright, per site that work on their own completely fine when i use only 1 crawler per site i have tried running these crawlers concurrently through a scrape event that is emitted from the server that emits individual scrape events for each site to run each crawler i face a lot of memory overloads, timed out navigations, skipping of many products, and early ending of the crawlers each crawler essentially takes base urls or scrapes these base urls to get product urls that are then indvidually scraped through to get product page info

1 Reply

probable-pinkOP•2y ago

WARN  PlaywrightCrawler:AutoscaledPool:Snapshotter: Memory is critically overloaded. Using 6177 MB of 4017 MB (154%). Consider increasing available memory.
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. This crawler instance is already running, you can add more requests to it via `crawler.addRequests()`.
 {"id":"vI4UdrhFP5NVjsV","url":"https://www.tentree.com/collections/kids?page=10","retryCount":1}
INFO  PlaywrightCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":1,"systemStatus":{"isSystemIdle":false,"memInfo":{"isOverloaded":true,"limitRatio":0.2,"actualRatio":1},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0.05},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. page.goto: Timeout 30000ms exceeded.
=========================== logs ===========================
navigating to "https://www.tentree.com/collections/womens?page=50", waiting until "networkidle"
============================================================
 {"id":"2QcFPmYcDLgjTat","url":"https://www.tentree.com/collections/womens?page=50","retryCount":2}
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. Navigation timed out after 60 seconds. {"id":"qgjcGJ50OKucpO6","url":"https://www.kleankanteen.com/collections/all/products/party-kit","retryCount":1}

WARN  PlaywrightCrawler:AutoscaledPool:Snapshotter: Memory is critically overloaded. Using 6177 MB of 4017 MB (154%). Consider increasing available memory.
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. This crawler instance is already running, you can add more requests to it via `crawler.addRequests()`.
 {"id":"vI4UdrhFP5NVjsV","url":"https://www.tentree.com/collections/kids?page=10","retryCount":1}
INFO  PlaywrightCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":1,"systemStatus":{"isSystemIdle":false,"memInfo":{"isOverloaded":true,"limitRatio":0.2,"actualRatio":1},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0.05},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. page.goto: Timeout 30000ms exceeded.
=========================== logs ===========================
navigating to "https://www.tentree.com/collections/womens?page=50", waiting until "networkidle"
============================================================
 {"id":"2QcFPmYcDLgjTat","url":"https://www.tentree.com/collections/womens?page=50","retryCount":2}
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. Navigation timed out after 60 seconds. {"id":"qgjcGJ50OKucpO6","url":"https://www.kleankanteen.com/collections/all/products/party-kit","retryCount":1}

how do i design the scrapers and run them either one by one, certain aspects one by one, how to manage multiple running at once, etc. and is it doable to do that in the first place should i run them in separate terminalsor how else should i run thrm and finally how would i run each one in actors on apify, what plan would i choose, how would i manage memory, storage, cu's, etc. and should i run them under the same actor instance or should i run them on individual actor instances and how would you reccomed for me to effectively manage each site crawler

Gaming

Programming

system design of concurrent crawlers

Did you find this page helpful?