Mix Cheerio and Playwright same crawler

Hi, i need to crawl a website that have only a certain type of page that need JS to be scraped. So for speed and resource reasons i'm using cheerio to scrape all the possible data, and enqueue every links, including links to the pages requiring JS. After the cheerio scrape ends, i launch a playwright scrape but how can i get playwright to get the requestqueue from the first crawl, and scrape data from specific label.
6 Replies
useful-bronze
useful-bronze2y ago
Hey there! Theoretically you can, by default both crawlers would use the same request queue, but it's better not to mix it up. Better option would be to open a second (named) request queue, populate it in the first crawler, and use it with the second one
deep-jade
deep-jadeOP2y ago
thanks for your answer, so, when enqueuing links that lead to the JS pages, should i pass an argument that will populate a specific requestqueue? could u show me a little snippet ?
deep-jade
deep-jadeOP2y ago
thanks ill check that, by the way i dont know ig it normal, but after the cheerio crawl, i pass the default requestqueue to the playwright crawler: const requestQueue = await RequestQueue.open(); const playwright_crawler = new PlaywrightCrawler({ // proxyConfiguration: new ProxyConfiguration({ proxyUrls: ['...'] }), requestHandler: playwright_router, requestQueue, }); but it just doesnt work PlaywrightCrawler: Finished! Total 0 requests: 0 succeeded, 0 failed. ill try your suggested way though it working with the named requestqueue
useful-bronze
useful-bronze2y ago
I assume you have all requests processed by the first crawler, so when you try to start the second crawler - it basically has the queue empty
deep-jade
deep-jadeOP2y ago
thats what i though indeed ^^ thanks for your help

Did you find this page helpful?