CA
xenial-black

Advice concerning run limits and cost optimization

Hi everyone I am currently designing an application that will scrap data on social networks and I am currently considering Apify to do this. However, I have some technical questions, mainly concerning the limits of the actors' runs and cost optimization. After reading the documentation thoroughly, some of my questions remain unanswered. To give you a bit of context, the app will scrap facebook and instagram pages and posts periodically. The number of pages to scrap will be of the 1000 magnitude, and should have a linear growth with time. What I am trying to find is a good compromise of batch size for running the page scrapers Here is a list of the questions I still have after reading the documentation: -> In the pricing documentation (https://apify.com/pricing), it's mentioned that there is a concurrent run limit. However, I can't find any information about what happens when this limit is reached. If a request more runs than the limit, are they queued ? Or will the api respond with an error message? Moreover, the documentation is not consistent about the value of this limit (it is different here: https://docs.apify.com/platform/limits) -> In the price optimization section (https://help.apify.com/en/articles/3470975-how-to-estimate-compute-unit-usage-for-your-project), it is said that it's more price efficient to run few big runs rather than many small runs. However, I can't find information concerning what is 'Too big'. Is there any limits concerning how big a run can be (like the number of urls passed in input, the size of the run result or something else ?). Thank you for your help, Maxim
How to estimate compute unit usage for your project | Apify Help · ...
Need to estimate the long-term costs of a project? Read this article to understand what affects performance.
4 Replies
Pepa J
Pepa J3y ago
However, I can't find any information about what happens when this limit is reached. If a request more runs than the limit, are they queued ?)
You will get error from the API. There is currently no way how to queue runs. The inconsistency in Limits, thanks for noticing that, we will fix it. I believe this page ( https://apify.com/pricing ) is more recently updated and should be the source of truth . Your use-case sounds like you want to use public actors mainly, I think most of these are not possible to set maxConcurrency attribute. Due to that, you are limited by implementation of the actor. I don't think that the number of URLs have anything to do with the "size of the run", but it mostly impact the length of the run (in time) - yes it will be little bit cheaper to do single run for 10 000 url, but it would be significantly (about 10 times) faster to do 10 runs for 1000 urls. There are generally no limits for the Actor run, but to be sure Check the Information tab for each Actor, I have in mind few actors that are capable only to get about 5000 results for single keyword, but that mostly comes from the limitations of the website being scraped.
xenial-black
xenial-blackOP3y ago
Hi @Pepa J, thank you for your answer 🙂 All right, so I have to think about a queue system on my side to ensure that I don't start too many runs at the same time. Indeed, I forgot to specify that, but for now, the actors that I am interested in are the next: --> Facebook Ads --> Facebook Pages --> Facebook Posts --> Instagram Profiles --> Instagram Posts All right, so from what you're saying, a decent size input for a run would be 1000 urls ? Am I right if I say the trade off is the next : --> If I start runs with 1 urls, it's cost inefficient and time inefficient due to the time to start the scrapers (however, it's simpler to implement on my side, as I do not need to add a batching logic). --> Runs with something of the 1000 urls magnitude seems to be a sweet spot. --> Runs with something of the 10 000 urls magnitude will be very cost efficient but not time efficient. Again, thank you for your help 🙂
Pepa J
Pepa J3y ago
@Max I gave you only hypothetical example. I suggest you to try the individual scrapers first, to get better idea about their consumption. Otherwise I agree, except:
--> Runs with something of the 10 000 urls magnitude will be very cost efficient but not time efficient.
I would not expect this would be very cost efficient. With parallel run you would be paying more for about 10 times 15sec for cold start, so extra 150 sec total. So should you would be saving few cents in trade for 10 times longer runs. I just read the example in the article that you mentioned and it is quite out of edge (it is worst case scenario) with the idea of running 1000 runs and each with single 1 url. 🙂
xenial-black
xenial-blackOP3y ago
All right, thank you for your hepl @Pepa J !

Did you find this page helpful?