CA
Crawlee & Apifyโ€ข3y ago
metropolitan-bronze

Scrape different layouts

Hi, I am just getting started with Apify web scraping. I am trying to scrape a page with different page layouts. There are many pages of listed items with links to the item pages. First I scrape the links from the list and then I need to fetch the actual data from each item page. How can I manage this in Apify? My current solution right now was to wrap it in an if/else depending on the content of the URL. However, this gives issues when I try to add new requests as it apparently can't use the await statement anywhere else but at the top level of bodies of modules.
11 Replies
Pepa J
Pepa Jโ€ข3y ago
Hello @AK, Generally speaking if the logic of scraping one page is different from others it is implemented as another standalone actor. But even your if-else solution should work. May you share the block of code where the await statement cannot be used in? We may figure it out.
metropolitan-bronze
metropolitan-bronzeOPโ€ข3y ago
Does this mean it is possible to use the output from one actor as input to another actor? ๐Ÿ˜Š This is essentially what I am trying to do right now that doesn't work:
if(context.request.url.includes("https://madensverden.dk/category/")) {
const currentPageNo = $('.page-numbers.current').text();
const nextPageNo = parseInt(currentPageNo) + 1;

$('.listing-item a').each((index, el) => {
const link = $(el).attr('href');
if (link) {
await context.enqueueRequest({ url: link });
}
})

// Print some information to actor log
context.log.info(`URLs: ${links}, PageNo: ${currentPageNo}`);

// Manually add a new page to the queue for scraping.
await context.enqueueRequest({ url: context.request.url + nextPageNo });
}
if(context.request.url.includes("https://madensverden.dk/category/")) {
const currentPageNo = $('.page-numbers.current').text();
const nextPageNo = parseInt(currentPageNo) + 1;

$('.listing-item a').each((index, el) => {
const link = $(el).attr('href');
if (link) {
await context.enqueueRequest({ url: link });
}
})

// Print some information to actor log
context.log.info(`URLs: ${links}, PageNo: ${currentPageNo}`);

// Manually add a new page to the queue for scraping.
await context.enqueueRequest({ url: context.request.url + nextPageNo });
}
The error I get is this: ERROR Compilation of pageFunction failed. await is only valid in async functions and the top level bodies of modules sorry about the shitty formatting in that codeblock - it fucked up when I copied it
Pepa J
Pepa Jโ€ข3y ago
There is several ways how to deal with this, the most easy to understand could be dealing with await outside of the (for)each:
// ...
if (context.request.url.includes("https://madensverden.dk/category/")) {
const currentPageNo = $('.page-numbers.current').text();
const nextPageNo = parseInt(currentPageNo) + 1;

const requests = [];
$('.listing-item a').each((index, el) => {
const link = $(el).attr('href');
if (link) {
requests.push({ url: link });
}
});

await context.enqueueRequests(requests);

// ...
}
// ...
if (context.request.url.includes("https://madensverden.dk/category/")) {
const currentPageNo = $('.page-numbers.current').text();
const nextPageNo = parseInt(currentPageNo) + 1;

const requests = [];
$('.listing-item a').each((index, el) => {
const link = $(el).attr('href');
if (link) {
requests.push({ url: link });
}
});

await context.enqueueRequests(requests);

// ...
}
Actually it would generate less requests to Apify API (it will use only one, with all the urls at once).
metropolitan-bronze
metropolitan-bronzeOPโ€ข3y ago
that makes sense. So the issue is the foreach. Before changing it I just appended it to a list. I'll go back to doing that ๐Ÿ™‚ Thanks man Is it possible to use the output of an actor as input in another actor? Also I found another issue. context.EnqueueRequests isn't a function that exists when I try it out. Is there any other way to queue multiple requests at the same time?
Pepa J
Pepa Jโ€ข3y ago
Which version of apify, do you use? (can see it in package.json) In the latest it could be await context.addRequests(requests)
metropolitan-bronze
metropolitan-bronzeOPโ€ข3y ago
I have no idea tbh ๐Ÿ˜› I just fired it up in my browser today using the online console
MEE6
MEE6โ€ข3y ago
@AK just advanced to level 1! Thanks for your contributions! ๐ŸŽ‰
metropolitan-bronze
metropolitan-bronzeOPโ€ข3y ago
but addRequests doesn't work either. Again I just get the message that no such function exists I found the apify docs, but I still don't see anything there indicating I can add multiple requests to the queue
Pepa J
Pepa Jโ€ข3y ago
Oh, so are you using the puppeteer scraper actor (https://console.apify.com/actors/YJCnS9qogi9XxDgLB) or you created a new one from the template? If so I am not that deeply familiar with the version of Apify SDK, but:
for (const request of requests) {
await context.enqueueRequest(request);
}
for (const request of requests) {
await context.enqueueRequest(request);
}
should also work.
metropolitan-bronze
metropolitan-bronzeOPโ€ข3y ago
Hi again @Pepa J I spent the rest of yesterday trying to figure out what to do from here. The for-loop seems to be the same solution as my foreach solution? With the same amount of calls? What I want to do from here is to transfer my solution into an application locally and make API calls towards Apify. But I still can't see anywhere in the documentation that I can add multiple requests in one call - it would greatly improve the amount of calls I have to make to the API, so it would be much appreciated if we could figure out if this actually exists despite it not being obvious from the docs. ๐Ÿ˜Š @Pepa J you don't have to answer me anymore. I have given up on Apify and will just build my own scraper in python. I realize the documentation is quite shit for python and close to non-existent, and it won't take me long to build my own. Thanks for your help though - I am sorry that Apify isn't mature enough for proper python usage and in depth documentation in that area.
Pepa J
Pepa Jโ€ข3y ago
@AK I am sorry to hear that. The puppeteer-scraper that you are using is meant to be pre-made standalone actor solution, that has its limitations (as mentioned in Readme) but it is easy quick-to-setup-and-run. For further customization it is worth to create your own actor - with tools like apify-cli (https://docs.apify.com/cli/) it should take few minutes to create it locally and push it to platform - or run it locally. About the python doc - we recently added official support for python (few weeks ago) so even the doc will be improved in future.

Did you find this page helpful?