Managing Queue using redis or something similar and having worker nodes listening on queue
I'm trying to run Crawlee for production use and try to scale where we can have a cluster of worker nodes who will be ready for crawling pages based on the request. How can achieve this.
The RequestQueue is basically writing requests to files and not utilizing any queueing system. I couldn't find doc that said how i can utilise Redis queue or something similar.
6 Replies
Someone will reply to you shortly. In the meantime, this might help:
harsh-harlequin•5mo ago
I'm not aware of such a possibility. Actually, I don't think that Crawlee's queues were intended for concurrent access, but for keeping track of todo/done jobs within a single or multiple, but subsequent, executions. You should develop your own solution to manage and scale workers, or look at existing solutions, such as Apify.
rising-crimsonOP•5mo ago
If i create a custom RequestQueue which uses redis, then this should be possible right?
Or is it possible that I can use Apify managed queue and still run the crawler in my infra instead of managed actors?
@Marco
harsh-harlequin•5mo ago
To the latter question, I'd say no: Apify does not provide on premise solutions.
Regarding implementing a RequestQueue with uses Redis, I think it would be possible! You can take a look at the code here: https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_queue_v2.ts#L55
GitHub
crawlee/packages/core/src/storages/request_queue_v2.ts at master · ...
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, an...
rising-crimsonOP•5mo ago
Okay. I will check it out. I guess extending the RequestQueue with redis would do the trick for me.
@darkprince just advanced to level 1! Thanks for your contributions! 🎉