error handling w/ playwright
i've been experiencing this same error pattern with my scrapers of completely different sites that i thought was an individual site problem but now its a repeating pattern
i have my scraper scraping results page urls, then product urls, which it has no problem with, but when it goes through those product urls with a playwright crawler, it always scrapes around 30-40 URLs successfully and suddenly experiences some crash error and then randomly rescrapes a couple of old product urls before crashing
7 Replies
wise-whiteOP•2y ago
wise-whiteOP•2y ago
heres the actual scrpaing logic - each site follows mostlythe same pattern with diff tags and slightly diff logic for the descriptions and shipping info:
do i need to change the timeout setting
and also how do i deal with errors when one or more of the data elements arent found without crashing the crawelr and still scrpaing the other available info
stormy-gold•2y ago
@harish To resolve the first issue, I would recommend you run each crawler separately, first run the Puppeteer crawler, and once you collect all of the product URLs, run the Cheerio crawler.
If you don't find data elements and you want to continue you don't have to do anything, it should automatically mark the request as done.
wise-whiteOP•2y ago
ive used an all pw crawler and it doesnt run into these issues it seems to occur when the AutoscaledPool scales up after this message:
INFO Statistics: PlaywrightCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":5894,"requestsFinishedPerMinute":37,"requestsFailedPerMinute":0,"requestTotalDurationMillis":218083,"requestsTotal":37,"crawlerRuntimeMillis":60049,"retryHistogram":[37]}
INFO PlaywrightCrawler:AutoscaledPool: state {"currentConcurrency":7,"desiredConcurrency":8,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0.085},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
how can i make sure that the cheerio crawling is independent of the puppeteer crawling - they use different request queues, crawlers, and routers
and how can i make sure the cheerio crawler is stopped after there are no more links to crawl to not cause any of these errors
i alos tested and saw that even if i use two playwright crawlers it doesnt work so its an issue with two crawlers colliding with each other when the autoscaled pool scales up
stormy-gold•2y ago
@harish What you can do is create named named queues for each crawler or requests array and keep pushing the requests into it, once the Playwright crawler finishes, you can create the Cheerio crawler and pass the queue/requests array to it and that should solve the issue.
Note that if you want to use queues, you need to create separate named queues for both crawlers
Something like this:
wise-whiteOP•2y ago
i alr use separate request queues and the scraped requests from the cheerio crawler are passed to the playwright crawler that has a separate request queue
stormy-gold•2y ago
Okay then you need to run each crawler separately