Deploying Crawlee in Self-hosted Servers
Hello, I'm quite new in using crawlee and scraping using javascript in general. I have experience in using python for creating medium scale scraping (multiple playwright browser orchestrated by airflow). Is there any analogue for this in js/node ecosystem, or the more node-ish way of orchestrating multiple crawler in self-hosted server?
For now the options that I can think of is Wraping the script using express js and hitting it with API call periodically (like cron/ airflow). Is there any better way of doing this?
* I've tried searching for scaling/deploy in the forum but haven't found any that I could understand and implement
5 Replies
extended-salmon•3y ago
Yea there's no resources out there about self-hosting (likely since it goes against how apify makes money) -- but what are you trying to achieve? crawlers working together or draining a shared queue?
passive-yellow•3y ago
I did the same. Encapsulated the crawler inside express app. But I havent tested the scalbility of it.
sensitive-blue•3y ago
I'm trying to do exactly that right now. Have you guys figured out how to use a cron with crawlee ? I have a CheerioCrawler that i want to run every 30 minutes and i use node-cron for that. When the next job arrives, it doesn't start the crawler again, it only logs that the crawler has already finished. I've made posts on this forum but haven't found a fix :/
If anyone knows, it would mean a lot. I'm stuck! I'll put below a link to the codesandbox of the 'issue'.
@curioussoul Were you able to call the api of your crawler periodically ?
https://codesandbox.io/p/sandbox/crazy-stitch-p759zo?file=%2Findex.js&selection=%5B%7B%22endColumn%22%3A33%2C%22endLineNumber%22%3A8%2C%22startColumn%22%3A33%2C%22startLineNumber%22%3A8%7D%5D
crazy-stitch-p759zo
CodeSandbox is an online editor tailored for web applications.
passive-yellow•3y ago
Yes I was able to call it periodically. I can give you an idea. Do the exact same thing I did and make and api and call that api via cron job. Remember that I start my crawler outside the api and only add requests to the queue on each api call. There is also a function in the constructor of the crawler which define what to do when the request queue is empty and I return false from that function so the crawler never dies but wait for the new requests to be added.
sensitive-blue•3y ago
Great idea ! Thank you so much @curioussoul !