Approach to store scrapped data in database (postgres)
(Apologises for the crosslink: https://github.com/apify/crawlee/discussions/1577)
Hi, I recently discovered Crawlee and I'm trying to figure out how can I store the scraped data in database instead in local directorio storage.
Is there any plugin for that? How must I proceed to implement one? Must I code my own class that implements StorageClient interface? If so how must I injected later to be used.
Thanks!
13 Replies
dependent-tan•3y ago
you need to implement your own logic
instead of Dataset.push() just call insert to your db
unwilling-turquoiseOP•3y ago
isn't a good practice or have any benefit to implement StorageClient?
stormy-gold•3y ago
If you want your crawler to be practical and performant, I wouldn't recommend pushing into a Dataset, then into your PostgreSQL database. At that point, the Dataset would just be an unnecessary middle man.
The only way that'd be beneficial is if you'd like to validate the data with some custom scripts before actually pushing it into the production DB. Otherwise, just push directly into your DB.
unwilling-turquoiseOP•3y ago
Thanks Matt, I mean implement a custom StorageClient so when you write Dataset.push() really you store data in postgres instead in local filesystem
@acanimal just advanced to level 1! Thanks for your contributions! 🎉
automatic-azure•3y ago
Its not a common case, so not covered by SDK, imho just use external package like https://github.com/supabase/supabase
judicial-coral•3y ago
Yeah, I'm actually using a graph database to store crawl results, and it performs very well — the only hitch has been making sure that my logic for what constitutes a "unique item" etc meshes with crawlee's
stormy-gold•3y ago
At that point, I'd recommend just using Sequelize to connect to your remote database and push data into it. Sequelize is (in my opinion) the best ORM.
national-gold•11mo ago
Hi all, I'd looking to push straight to postgres. Wondering if anyone would be willing to share their implementation of doing so?
@acanimal, sorry to ping, did you implement this?
sunny-green•6mo ago
Sorry to necro an older thread but Im looking at pushing data into postgres as well
Is the suggestion to skip Data.push entirely and just save directly into the DB?
I havent seen any examples of using PostGres (or any database for that matter)
stormy-gold•6mo ago
This is something I am wanting to do as well
fascinating-indigo•6mo ago
I use Supabase as my postgres platform and simply await my table and insert the data within the request handler
metropolitan-bronze•6mo ago
I recently implemented a custom storage client to store request queues in postgres as the storage costs for request queues in apify is too much. reduced my costs from 500 usd per month to 25 (25 is for the managed postgres service)
The same thing can also be extended to store datasets. I only did it for request queues. For dataset and key value the custom client still uses the apify storage.