External request queue + external result storage, Crawlee as daemon process - how to implement it?
Hi all,
I would like to run Crawlee (actually PlaywrightCrawler) all the time, even when no requests in the request queue. (Crawlee will run on small Ubuntu box in datacenter. I can handle all the devops work needed for this).
The requests/URLs should come from an external (running outside of Nodejs process) message queue. The Nodejs API to read from the external message queue exists.
The scraping results should be stored in the same external message queue.
In this configuration Crawlee is controlled by the external message queue providing URLs to scrape,
so no such things like breadth/depth crawling, no crawling at all - just scrape the provided URL and
return result
After reading this forum and some experiments my plan is:
1.
I should implement (can I subclass existing implementation?) the AutoscaledPool.
The AutoscaledPool.isFinishedFunction() should return "false" - the crawler will run as daemon
even when no messages in the Crawlee request queue
2.
Somewhere (where???) I should poll the external message queue, get the URL and call crawler.addRequests()
3.
Somewhere at the end of requestHandler() - instead of Dataset.pushData - I should write
the results back into the external message queue.
4. May be there are some other hidden problems? Would be great to know about it in advance )))
P.S. This is my first attempt to write JS/TS code, I have Java background,
so I might ask one or two strange JS-related questions, be prepared )))
P.P.S. It seems, what I want to do is more or less similar to this:
https://discord.com/channels/801163717915574323/1024728065651249222/1025363339041308732
6 Replies
foreign-sapphireOP•3y ago
Actually this is what I would like to implement:
https://github.com/apify/crawlee/issues/446
It can be a good idea to create a documentation page: "Crawlee with external message queue howto"
GitHub
[Question] Use with message queue / rabbitmq · Issue #446 · apify/c...
I want to use Apify pulling jobs from rabbitmq (amqp message queue). What would be the right approach to implement this? Using my amqp connection as a wrapper doesn't work because Apify ter...
correct-apricot•3y ago
Who externalisation message queue do you use? (Redis, other...)
@LeMoussel just advanced to level 5! Thanks for your contributions! 🎉
foreign-sapphireOP•3y ago
it is beanstalkd (see https://beanstalkd.github.io/)
beanstalkd is very simple, I use only two methods:
1. get from the queue (the URL)
2. put into another queue ( the URL + HTML of the page)
correct-apricot•3y ago
Happy New Year 2023. 🍾
@new_in_town Gift for this new year, see this POC with the use of Beanstalkd & node-beanstalk.
https://github.com/apify/crawlee/issues/446#issuecomment-1368405706
GitHub
[Question] Use with message queue / rabbitmq · Issue #446 · apify/c...
I want to use Apify pulling jobs from rabbitmq (amqp message queue). What would be the right approach to implement this? Using my amqp connection as a wrapper doesn't work because Apify ter...
foreign-sapphireOP•3y ago
well, well, I'm not alone using Beanstalkd + Crawlee )))
@LeMoussel - thanks for sharing, I'm learning from your code )))
A few words about design of such systems (where message queue is the "glue" connecting different components/programs):
1. usually in such systems there is more than one program/process.
In my case:
one program is writing requests/URLs into "URL" queue and reading results (content of HTML pages) from the "result" queue (and this program is in Java)
another program - the Crawlee scraper, reads from URL queue, write results into result queue.
2. it is a good idea to create a special "error" queue where all the failing requests are stored. Later, a human can take look at this error queue and find out why these requests are failing.
By the way - in case the requests/messages ("jobs" in beanstalkd jargon) in error and in URL queue are in the same format - the "retry" operation is very easy to implement - you just move messages from the error queue into URL queue