Mix Cheerio and Playwright same crawler
Hi, i need to crawl a website that have only a certain type of page that need JS to be scraped.
So for speed and resource reasons i'm using cheerio to scrape all the possible data, and enqueue every links, including links to the pages requiring JS. After the cheerio scrape ends, i launch a playwright scrape but how can i get playwright to get the requestqueue from the first crawl, and scrape data from specific label.
6 Replies
useful-bronze•2y ago
Hey there! Theoretically you can, by default both crawlers would use the same request queue, but it's better not to mix it up. Better option would be to open a second (named) request queue, populate it in the first crawler, and use it with the second one
deep-jadeOP•2y ago
thanks for your answer, so, when enqueuing links that lead to the JS pages, should i pass an argument that will populate a specific requestqueue?
could u show me a little snippet ?
useful-bronze•2y ago
https://crawlee.dev/api/core/class/RequestQueue#open + https://crawlee.dev/api/core/class/RequestQueue#addRequests
If you use enqueueLinks - then add this requestQueue to the options: https://crawlee.dev/api/core/interface/EnqueueLinksOptions#requestQueue
deep-jadeOP•2y ago
thanks ill check that, by the way i dont know ig it normal, but after the cheerio crawl, i pass the default requestqueue to the playwright crawler: const requestQueue = await RequestQueue.open();
const playwright_crawler = new PlaywrightCrawler({
// proxyConfiguration: new ProxyConfiguration({ proxyUrls: ['...'] }),
requestHandler: playwright_router,
requestQueue,
});
but it just doesnt work
PlaywrightCrawler: Finished! Total 0 requests: 0 succeeded, 0 failed.
ill try your suggested way though
it working with the named requestqueue
useful-bronze•2y ago
I assume you have all requests processed by the first crawler, so when you try to start the second crawler - it basically has the queue empty
deep-jadeOP•2y ago
thats what i though indeed ^^
thanks for your help