web scrapper create 1 file instead of multiple output for pagination
I used this article https://docs.apify.com/academy/advanced-web-scraping/scraping-paginated-sites#define-and-enqueue-pivot-ranges to scrap data of multiple pages, when I run
apify run
I get 20 different json files. How can I combine all the data in 1 json file for all the pages?Scraping paginated sites | Apify Documentation
Learn how to extract all of a website's listings even if they limit the number of results pages. See code examples for setting up your scraper.
7 Replies
xenophobic-harlequinOP•2y ago
also very similar to this : https://docs.apify.com/sdk/js/docs/examples/add-data-to-dataset but instead of having individual files, how can I have one file?
@Andrey Bykov can you please help me with this? I can't find a solution and searched every where
frail-apricot•2y ago
One call to
Dataset.pushData()
or Actor.pushData()
produces one JSON. I believe what's you're looking is https://crawlee.dev/api/core/class/Dataset#exportToJSONxenophobic-harlequinOP•2y ago
yes, I tried using
await Dataset.pushData
to create a single large file , in this page "https://crawlee.dev/docs/introduction/saving-data#whats-datasetpushdata" it says "If you would like to store your data in a single big file, instead of many small ones, see the Result storage guide for Key-value stores."
I want all the data from the 20 pages to be present in 1 file
if I use await Dataset.exportToJSON('results');
it just combines the json files, I want the data to be under data
key
right now it is like this:
I want to have frail-apricot•2y ago
btw just realised this is crawlee forum. Apify-related question should be on separate forum. As for your question - in web scraper there's no easy way do it, except using globalStore I guess, see there: https://apify.com/apify/web-scraper#globalstore-object
Apify
Web Scraper · Apify
Crawls arbitrary websites using the Chrome browser and extracts data from pages using a provided JavaScript code. The actor supports both recursive crawling and lists of URLs and automatically manages concurrency for maximum performance. This is Apify's basic tool for web crawling and scraping.
xenophobic-harlequinOP•2y ago
sorry for posting to the wrong forum, I didnt notice that. What about this
reduce
? https://crawlee.dev/api/core/class/Dataset#reduce@generator101 just advanced to level 2! Thanks for your contributions! 🎉
frail-apricot•2y ago
web scraper does not have full access to crawlee - only certain methods are exposed