Has anyone found a solution to run Crawlee inside a Rest API on demand?

I have managed to get some parts of it working such that I have a NodeJs API that starts my crawler. I have yet to manage the request queue to handle additional and concurrent API calls, so I would just like to know if someone has had any luck implementing such a solution? My particular use case for this API requires running in my cloud instead of Apify
15 Replies
foreign-sapphire
foreign-sapphire•3y ago
depends on data case, I created several small cheerio actors, runnable under 128Mb of RAM, for single data request run is done in 4-5 seconds, I still prefer to read data from dataset, but actually it can be delivered as regular API: Apify.main(async () => { .... return finalJSONData; } then you can POST to https://api.apify.com/v2/acts/your~actor/run-sync?token=... and get data in response
deep-jade
deep-jadeOP•3y ago
Interesting however my use case requires running the crawlee crawler in my own cloud because of some constraints. So I will try to work a bit more with it. In order to scale the api I am not sure if crawlee supports running multiple crawlers at the same time or I should start a separate instance? I need to do on demand crawling with PlaywrightCrawler within 5 seconds and scale when more people hit the api. So it is a bit of an edge case
grumpy-cyan
grumpy-cyan•3y ago
You can run more crawlers at the same time, just need to assign a new queue or list to each and cleanup after if needed. You can also keep a crawler running and just fill the queue. You need to manipulate the crawler.autoscaledPool
deep-jade
deep-jadeOP•3y ago
Thanks
foreign-sapphire
foreign-sapphire•3y ago
hm, from past experience I would like to suggest to create cheapest DigitalOcean instance for $5 a month and run crawler endlessly in ExpressJS wrapper, otherwise 5sec challenge will become a pretty big issue if you feel confident about dev-ops you can get twice bigger server from contabo.com for a same price, or check other hosting options
deep-jade
deep-jadeOP•3y ago
thanks that is a good suggestion. I am using azure container apps or Kubernetes in Azure to handle the container. Locally I get data from the site within 6,7 seconds which is also fine. So the only thing I need to solve at the moment is to keep the crawler running as long as there are API requests and run a crawler for each request. I am quite confident with DevOps I tried adding, but it stops the crawler after first crawl, because queue is apparently empty even though I add a new url to it in my GET api endpoint
autoscaledPoolOptions: {
minConcurrency: 1,
},
autoscaledPoolOptions: {
minConcurrency: 1,
},
What can I pass to await crawler.run(), to keep the crawler running until I explicitly say it should stop?
deep-jade
deep-jadeOP•3y ago
I currently have this code, which works and keeps the crawler running, but sometimes the crawler does not wait till the price appears for the domain on the site and the crawler does not shutdown the browser once processing is done plus the api does not finish and return data to the client. So if you have any suggestions about how the code should look like to have an API that can run multiple crawlers at the same time independent of each API all that would be nice 🙂 https://gist.github.com/Trubador/b67a6b78cafec99f191b7aa33f2ed654
Gist
gist:b67a6b78cafec99f191b7aa33f2ed654
GitHub Gist: instantly share code, notes, and snippets.
genetic-orange
genetic-orange•3y ago
I'm also solving a similar problem right now. I used to use puppeteer-cluster (example of implementation https://github.com/thomasdondorf/puppeteer-cluster/blob/master/examples/express-screenshot.js) Now I've decided to migrate to Crawlee and found that it doesn't seem to have any functionality.
deep-jade
deep-jadeOP•3y ago
@Romja Yeah a lot of customization is needed. I have the api working, however it can only process 1 request right now before needing to be restarted.
genetic-orange
genetic-orange•3y ago
I don't think that's the coolest solution 😦
deep-jade
deep-jadeOP•3y ago
No defintely not. It needs to handle multiple concurrent requests @Romja It should be pretty evident how I want to solve this by looking at the code
grumpy-cyan
grumpy-cyan•3y ago
You need to adjust the autoscaledPoolOptions isFinishedFunction https://crawlee.dev/api/core/class/AutoscaledPool. This way you can keep the crawler running even if the queue is empty
foreign-sapphire
foreign-sapphire•3y ago
Its site-specific, I did not check your targeted site, in the past I used crawler with permanently opened page and all other logic was performed based on page instance, this way you will save time on page opening and if you will be able to find out how internal page webapp works you can mimic calls to their data via fetch() under browser. If anything else works faster for browser-based scraping I will be surprised 😉
Statbot
Statbot•3y ago
No description
deep-jade
deep-jadeOP•3y ago
thanks I will try that 🙂 I don't completely understand your comment, do you mean to call apis in case the client application is not hosted server-side?

Did you find this page helpful?