Shared external queue between multiple crawlers
Hello folks!
Is there any way i can force cheerio/playwright crawlers to stop using their own internal request queue and instead "enqueue links" to another queue service such as Redis? I would like to achieve this in order to be able to run multiple crawlers on a single website and i would need them to share the same queue so they won't use duplicate links.
Thanks in advance!
3 Replies
@mesca4046 just advanced to level 1! Thanks for your contributions! 🎉
Someone will reply to you shortly. In the meantime, this might help:
-# This post was marked as solved by mesca4046. View answer.
unwilling-turquoise•3mo ago
Hello!
The request queue is managed by Crawlee, and not by Cheerio or Playwright directly. What you could try to do, is creating a custom
RequestQueue
which inherits Crawlee's class: https://crawlee.dev/api/core/class/RequestQueue. Here is the source code: https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_queue.ts#L77.
Then, you could pass the custom queue to the (Cheerio/Playwright) Crawler: https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler#requestQueue.CheerioCrawler | API | Crawlee · Build reliable crawlers. Fast.
Crawlee helps you build and maintain your crawlers. It's open source, but built by developers who scrape millions of pages every day for a living.
GitHub
crawlee/packages/core/src/storages/request_queue.ts at master · api...
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, an...