Crawlee & Apify•2y ago

Mix Cheerio and Playwright same crawler

Hi, i need to crawl a website that have only a certain type of page that need JS to be scraped. So for speed and resource reasons i'm using cheerio to scrape all the possible data, and enqueue every links, including links to the pages requiring JS. After the cheerio scrape ends, i launch a playwright scrape but how can i get playwright to get the requestqueue from the first crawl, and scrape data from specific label.

6 Replies

useful-bronze•2y ago

Hey there! Theoretically you can, by default both crawlers would use the same request queue, but it's better not to mix it up. Better option would be to open a second (named) request queue, populate it in the first crawler, and use it with the second one

deep-jadeOP•2y ago

thanks for your answer, so, when enqueuing links that lead to the JS pages, should i pass an argument that will populate a specific requestqueue? could u show me a little snippet ?

useful-bronze•2y ago

https://crawlee.dev/api/core/class/RequestQueue#open + https://crawlee.dev/api/core/class/RequestQueue#addRequests If you use enqueueLinks - then add this requestQueue to the options: https://crawlee.dev/api/core/interface/EnqueueLinksOptions#requestQueue

deep-jadeOP•2y ago

thanks ill check that, by the way i dont know ig it normal, but after the cheerio crawl, i pass the default requestqueue to the playwright crawler: const requestQueue = await RequestQueue.open(); const playwright_crawler = new PlaywrightCrawler({ // proxyConfiguration: new ProxyConfiguration({ proxyUrls: ['...'] }), requestHandler: playwright_router, requestQueue, }); but it just doesnt work PlaywrightCrawler: Finished! Total 0 requests: 0 succeeded, 0 failed. ill try your suggested way though it working with the named requestqueue

useful-bronze•2y ago

I assume you have all requests processed by the first crawler, so when you try to start the second crawler - it basically has the queue empty

deep-jadeOP•2y ago

thats what i though indeed ^^ thanks for your help

Gaming

Programming

Mix Cheerio and Playwright same crawler

Did you find this page helpful?