Efficiently syncing large data sets from different websites
Hello everyone,
I’ve been reading the documentation and would love to get your thoughts on how to efficiently sync a large amount of data between our database and a third-party website using the Firecrawl API.
For example, the Parliament website publishes records of recent meetings between MPs. There are a few hundred thousand meetings that I’ve already crawled and stored in our database using a custom CSS crawler. Now that I’m migrating to Firecrawl, I don’t need to crawl all of them again—I just need to check the main list for recent meetings and sync any that we’ve missed.
I’d like to automate this process and run the crawl periodically—ideally once a day—to ensure our database stays up to date with the latest meetings.
How would you approach this problem?
I see that the API provides a skip parameter, but I doubt it’s practical to pass 100k of URLs. There’s also no consistent URL pattern I can leverage.
1. My idea is to start Firecrawl only on the meeting list.
2. Assuming the meetings are paginated, I’d crawl through the pages until I find the last meeting ID already stored in our database.
3. Then, I’d extract and sync only the missing meetings.
I feel like I might be overlooking something, and I’m not entirely sure if this is the best approach. What do you think?
0 Replies