Suggestions to integrate Crawlee in a a new cloud platform
TL;DR: I'm a developer working in Estela, a cloud web scraping platform. We want to integrate support for Crawlee to expand our technology options. We use Kafka to store requests, stats, logs, and items. We're exploring different solutions like middlewares, hooks, or a custom crawler to make it work smoothly. Our goal is to ensure minimal code modification for users migrating their spiders from Crawlee to Crawlee + Estela. Any technical advice would be much appreciated.
-----
Hello! I'm a developer working in https://estela.bitmaker.la/docs/, a platform for web scraping in the cloud. We currently support Scrapy and Requests, but our focus is on expanding to include Crawlee in the platform.
Our system relies on Kafka for queueing requests, stats, logs, and items. To update job statuses (WAITING, RUNNING, COMPLETED, etc.), we use an API endpoint. Now, we're facing some challenges in implementing a wrapper to run Crawlee within Estela.
To store relevant information in Kafka and make calls to the API, we considered a few solutions:
Middlewares: While it's possible to run middlewares in Crawlee, they don't match Scrapy's middlewares, which perfectly suit our needs in Estela. Seems Crawlee's middlewares only run before the request.
Hooks: This seems like an ideal solution, but there's limited documentation on its application with Crawlee crawlers. We found some information on documents and migrations.md, but it's unclear if it applies to Crawlee.
Custom Crawler: Developing a custom crawler would be an extensive maintenance task and is not favored by our team.
Another important consideration is how much code modification a user would need to adapt their existing Crawlee spider for use with Crawlee + Estela. Ideally, we want the migration process to be seamless without requiring additional code.
Any technical advice or insights on these matters would be greatly appreciated. Thank you for your time!
1 Reply
Hello @Jq , Can you describe little bit more about use-cases, that you try to solve inside the middlewares?