Crawlee & Apify•2y ago

web scrapper create 1 file instead of multiple output for pagination

I used this article https://docs.apify.com/academy/advanced-web-scraping/scraping-paginated-sites#define-and-enqueue-pivot-ranges to scrap data of multiple pages, when I run apify run I get 20 different json files. How can I combine all the data in 1 json file for all the pages?

Scraping paginated sites | Apify Documentation

Learn how to extract all of a website's listings even if they limit the number of results pages. See code examples for setting up your scraper.

7 Replies

xenophobic-harlequinOP•2y ago

also very similar to this : https://docs.apify.com/sdk/js/docs/examples/add-data-to-dataset but instead of having individual files, how can I have one file? @Andrey Bykov can you please help me with this? I can't find a solution and searched every where

frail-apricot•2y ago

One call to Dataset.pushData() or Actor.pushData() produces one JSON. I believe what's you're looking is https://crawlee.dev/api/core/class/Dataset#exportToJSON

xenophobic-harlequinOP•2y ago

yes, I tried using await Dataset.pushData to create a single large file , in this page "https://crawlee.dev/docs/introduction/saving-data#whats-datasetpushdata" it says "If you would like to store your data in a single big file, instead of many small ones, see the Result storage guide for Key-value stores." I want all the data from the 20 pages to be present in 1 file if I use await Dataset.exportToJSON('results'); it just combines the json files, I want the data to be under data key right now it is like this:

  {
    "url": "example.com/page/1",
    "data": [ 
     ],
  },
   {
    "url": "example.com/page/2",
    "data": [ 
     ],
  }
]

  {
    "url": "example.com/page/1",
    "data": [ 
     ],
  },
   {
    "url": "example.com/page/2",
    "data": [ 
     ],
  }
]

I want to have

  {
    "data": [ 
        // data for all pages
     ],
  }
]

  {
    "data": [ 
        // data for all pages
     ],
  }
]

frail-apricot•2y ago

btw just realised this is crawlee forum. Apify-related question should be on separate forum. As for your question - in web scraper there's no easy way do it, except using globalStore I guess, see there: https://apify.com/apify/web-scraper#globalstore-object

Apify

Web Scraper · Apify

Crawls arbitrary websites using the Chrome browser and extracts data from pages using a provided JavaScript code. The actor supports both recursive crawling and lists of URLs and automatically manages concurrency for maximum performance. This is Apify's basic tool for web crawling and scraping.

xenophobic-harlequinOP•2y ago

sorry for posting to the wrong forum, I didnt notice that. What about this reduce? https://crawlee.dev/api/core/class/Dataset#reduce

MEE6•2y ago

@generator101 just advanced to level 2! Thanks for your contributions! 🎉

frail-apricot•2y ago

web scraper does not have full access to crawlee - only certain methods are exposed

Gaming

Programming

web scrapper create 1 file instead of multiple output for pagination

Did you find this page helpful?