metropolitan-bronze

Scrape different layouts

Hi, I am just getting started with Apify web scraping. I am trying to scrape a page with different page layouts. There are many pages of listed items with links to the item pages. First I scrape the links from the list and then I need to fetch the actual data from each item page. How can I manage this in Apify? My current solution right now was to wrap it in an if/else depending on the content of the URL. However, this gives issues when I try to add new requests as it apparently can't use the await statement anywhere else but at the top level of bodies of modules.

11 Replies

Pepa J•3y ago

Hello @AK, Generally speaking if the logic of scraping one page is different from others it is implemented as another standalone actor. But even your if-else solution should work. May you share the block of code where the await statement cannot be used in? We may figure it out.

metropolitan-bronzeOP•3y ago

Does this mean it is possible to use the output from one actor as input to another actor? 😊 This is essentially what I am trying to do right now that doesn't work:

if(context.request.url.includes("https://madensverden.dk/category/")) {
        const currentPageNo = $('.page-numbers.current').text();
        const nextPageNo = parseInt(currentPageNo) + 1;

        $('.listing-item a').each((index, el) => {
            const link = $(el).attr('href');
            if (link) {
                await context.enqueueRequest({ url: link });
            }
        })

        // Print some information to actor log
        context.log.info(`URLs: ${links}, PageNo: ${currentPageNo}`);

        // Manually add a new page to the queue for scraping.
        await context.enqueueRequest({ url: context.request.url + nextPageNo });
    }

if(context.request.url.includes("https://madensverden.dk/category/")) {
        const currentPageNo = $('.page-numbers.current').text();
        const nextPageNo = parseInt(currentPageNo) + 1;

        $('.listing-item a').each((index, el) => {
            const link = $(el).attr('href');
            if (link) {
                await context.enqueueRequest({ url: link });
            }
        })

        // Print some information to actor log
        context.log.info(`URLs: ${links}, PageNo: ${currentPageNo}`);

        // Manually add a new page to the queue for scraping.
        await context.enqueueRequest({ url: context.request.url + nextPageNo });
    }

The error I get is this: ERROR Compilation of pageFunction failed. await is only valid in async functions and the top level bodies of modules sorry about the shitty formatting in that codeblock - it fucked up when I copied it

Pepa J•3y ago

There is several ways how to deal with this, the most easy to understand could be dealing with await outside of the (for)each:

// ...
    if (context.request.url.includes("https://madensverden.dk/category/")) {
        const currentPageNo = $('.page-numbers.current').text();
        const nextPageNo = parseInt(currentPageNo) + 1;
    
        const requests = [];
        $('.listing-item a').each((index, el) => {
            const link = $(el).attr('href');
            if (link) {
                requests.push({ url: link });
            }
        });
    
        await context.enqueueRequests(requests);
    
        // ...
    }

// ...
    if (context.request.url.includes("https://madensverden.dk/category/")) {
        const currentPageNo = $('.page-numbers.current').text();
        const nextPageNo = parseInt(currentPageNo) + 1;
    
        const requests = [];
        $('.listing-item a').each((index, el) => {
            const link = $(el).attr('href');
            if (link) {
                requests.push({ url: link });
            }
        });
    
        await context.enqueueRequests(requests);
    
        // ...
    }

Actually it would generate less requests to Apify API (it will use only one, with all the urls at once).

metropolitan-bronzeOP•3y ago

that makes sense. So the issue is the foreach. Before changing it I just appended it to a list. I'll go back to doing that 🙂 Thanks man Is it possible to use the output of an actor as input in another actor? Also I found another issue. context.EnqueueRequests isn't a function that exists when I try it out. Is there any other way to queue multiple requests at the same time?

Pepa J•3y ago

Which version of apify, do you use? (can see it in package.json) In the latest it could be await context.addRequests(requests)

metropolitan-bronzeOP•3y ago

I have no idea tbh 😛 I just fired it up in my browser today using the online console

MEE6•3y ago

@AK just advanced to level 1! Thanks for your contributions! 🎉

metropolitan-bronzeOP•3y ago

but addRequests doesn't work either. Again I just get the message that no such function exists I found the apify docs, but I still don't see anything there indicating I can add multiple requests to the queue

Pepa J•3y ago

Oh, so are you using the puppeteer scraper actor (https://console.apify.com/actors/YJCnS9qogi9XxDgLB) or you created a new one from the template? If so I am not that deeply familiar with the version of Apify SDK, but:

for (const request of requests) {
     await context.enqueueRequest(request);
}

for (const request of requests) {
     await context.enqueueRequest(request);
}

should also work.

metropolitan-bronzeOP•3y ago

Hi again @Pepa J I spent the rest of yesterday trying to figure out what to do from here. The for-loop seems to be the same solution as my foreach solution? With the same amount of calls? What I want to do from here is to transfer my solution into an application locally and make API calls towards Apify. But I still can't see anywhere in the documentation that I can add multiple requests in one call - it would greatly improve the amount of calls I have to make to the API, so it would be much appreciated if we could figure out if this actually exists despite it not being obvious from the docs. 😊 @Pepa J you don't have to answer me anymore. I have given up on Apify and will just build my own scraper in python. I realize the documentation is quite shit for python and close to non-existent, and it won't take me long to build my own. Thanks for your help though - I am sorry that Apify isn't mature enough for proper python usage and in depth documentation in that area.

Pepa J•3y ago

@AK I am sorry to hear that. The puppeteer-scraper that you are using is meant to be pre-made standalone actor solution, that has its limitations (as mentioned in Readme) but it is easy quick-to-setup-and-run. For further customization it is worth to create your own actor - with tools like apify-cli (https://docs.apify.com/cli/) it should take few minutes to create it locally and push it to platform - or run it locally. About the python doc - we recently added official support for python (few weeks ago) so even the doc will be improved in future.

Apify Documentation · Apify Documentation | Apify Documentation

Gaming

Programming

Scrape different layouts

Did you find this page helpful?