CA
metropolitan-bronze
Scrape different layouts
Hi,
I am just getting started with Apify web scraping.
I am trying to scrape a page with different page layouts. There are many pages of listed items with links to the item pages. First I scrape the links from the list and then I need to fetch the actual data from each item page.
How can I manage this in Apify? My current solution right now was to wrap it in an if/else depending on the content of the URL. However, this gives issues when I try to add new requests as it apparently can't use the await statement anywhere else but at the top level of bodies of modules.
11 Replies
Hello @AK,
Generally speaking if the logic of scraping one page is different from others it is implemented as another standalone actor.
But even your if-else solution should work.
May you share the block of code where the await statement cannot be used in? We may figure it out.
metropolitan-bronzeOPโข3y ago
Does this mean it is possible to use the output from one actor as input to another actor? ๐
This is essentially what I am trying to do right now that doesn't work:
The error I get is this:
ERROR Compilation of pageFunction failed.
await is only valid in async functions and the top level bodies of modules
sorry about the shitty formatting in that codeblock - it fucked up when I copied it
There is several ways how to deal with this, the most easy to understand could be dealing with
await
outside of the (for)each
:
Actually it would generate less requests to Apify API (it will use only one, with all the urls at once).metropolitan-bronzeOPโข3y ago
that makes sense. So the issue is the foreach. Before changing it I just appended it to a list. I'll go back to doing that ๐ Thanks man
Is it possible to use the output of an actor as input in another actor?
Also I found another issue. context.EnqueueRequests isn't a function that exists when I try it out. Is there any other way to queue multiple requests at the same time?
Which version of apify, do you use? (can see it in
package.json
)
In the latest it could be await context.addRequests(requests)
metropolitan-bronzeOPโข3y ago
I have no idea tbh ๐ I just fired it up in my browser today using the online console
@AK just advanced to level 1! Thanks for your contributions! ๐
metropolitan-bronzeOPโข3y ago
but addRequests doesn't work either. Again I just get the message that no such function exists
I found the apify docs, but I still don't see anything there indicating I can add multiple requests to the queue
Oh, so are you using the puppeteer scraper actor (https://console.apify.com/actors/YJCnS9qogi9XxDgLB) or you created a new one from the template?
If so I am not that deeply familiar with the version of Apify SDK, but:
should also work.
metropolitan-bronzeOPโข3y ago
Hi again @Pepa J
I spent the rest of yesterday trying to figure out what to do from here.
The for-loop seems to be the same solution as my foreach solution? With the same amount of calls?
What I want to do from here is to transfer my solution into an application locally and make API calls towards Apify. But I still can't see anywhere in the documentation that I can add multiple requests in one call - it would greatly improve the amount of calls I have to make to the API, so it would be much appreciated if we could figure out if this actually exists despite it not being obvious from the docs. ๐
@Pepa J you don't have to answer me anymore. I have given up on Apify and will just build my own scraper in python. I realize the documentation is quite shit for python and close to non-existent, and it won't take me long to build my own. Thanks for your help though - I am sorry that Apify isn't mature enough for proper python usage and in depth documentation in that area.
@AK I am sorry to hear that.
The
puppeteer-scraper
that you are using is meant to be pre-made standalone actor solution, that has its limitations (as mentioned in Readme) but it is easy quick-to-setup-and-run. For further customization it is worth to create your own actor - with tools like apify-cli
(https://docs.apify.com/cli/) it should take few minutes to create it locally and push it to platform - or run it locally.
About the python doc - we recently added official support for python (few weeks ago) so even the doc will be improved in future.