Crawlee & Apify•3mo ago

Shared external queue between multiple crawlers

Hello folks! Is there any way i can force cheerio/playwright crawlers to stop using their own internal request queue and instead "enqueue links" to another queue service such as Redis? I would like to achieve this in order to be able to run multiple crawlers on a single website and i would need them to share the same queue so they won't use duplicate links. Thanks in advance!

3 Replies

MEE6•3mo ago

@mesca4046 just advanced to level 1! Thanks for your contributions! 🎉

Hall•3mo ago

Someone will reply to you shortly. In the meantime, this might help: -# This post was marked as solved by mesca4046. View answer.

unwilling-turquoise•3mo ago

Hello! The request queue is managed by Crawlee, and not by Cheerio or Playwright directly. What you could try to do, is creating a custom RequestQueue which inherits Crawlee's class: https://crawlee.dev/api/core/class/RequestQueue. Here is the source code: https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_queue.ts#L77. Then, you could pass the custom queue to the (Cheerio/Playwright) Crawler: https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler#requestQueue.

CheerioCrawler | API | Crawlee · Build reliable crawlers. Fast.

Crawlee helps you build and maintain your crawlers. It's open source, but built by developers who scrape millions of pages every day for a living.

GitHub

crawlee/packages/core/src/storages/request_queue.ts at master · api...

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, an...

Gaming

Programming

Shared external queue between multiple crawlers

Did you find this page helpful?