Has anyone found a solution to run Crawlee inside a Rest API on demand?
I have managed to get some parts of it working such that I have a NodeJs API that starts my crawler.
I have yet to manage the request queue to handle additional and concurrent API calls, so I would just like to know if someone has had any luck implementing such a solution?
My particular use case for this API requires running in my cloud instead of Apify
15 Replies
foreign-sapphire•3y ago
depends on data case, I created several small cheerio actors, runnable under 128Mb of RAM, for single data request run is done in 4-5 seconds, I still prefer to read data from dataset, but actually it can be delivered as regular API:
Apify.main(async () => { .... return finalJSONData; }
then you can POST to https://api.apify.com/v2/acts/your~actor/run-sync?token=...
and get data in responsedeep-jadeOP•3y ago
Interesting however my use case requires running the crawlee crawler in my own cloud because of some constraints. So I will try to work a bit more with it. In order to scale the api I am not sure if crawlee supports running multiple crawlers at the same time or I should start a separate instance?
I need to do on demand crawling with PlaywrightCrawler within 5 seconds and scale when more people hit the api. So it is a bit of an edge case
grumpy-cyan•3y ago
You can run more crawlers at the same time, just need to assign a new queue or list to each and cleanup after if needed.
You can also keep a crawler running and just fill the queue. You need to manipulate the crawler.autoscaledPool
deep-jadeOP•3y ago
Thanks
foreign-sapphire•3y ago
hm, from past experience I would like to suggest to create cheapest DigitalOcean instance for $5 a month and run crawler endlessly in ExpressJS wrapper, otherwise 5sec challenge will become a pretty big issue
if you feel confident about dev-ops you can get twice bigger server from contabo.com for a same price, or check other hosting options
deep-jadeOP•3y ago
thanks that is a good suggestion. I am using azure container apps or Kubernetes in Azure to handle the container. Locally I get data from the site within 6,7 seconds which is also fine. So the only thing I need to solve at the moment is to keep the crawler running as long as there are API requests and run a crawler for each request.
I am quite confident with DevOps
I tried adding, but it stops the crawler after first crawl, because queue is apparently empty even though I add a new url to it in my GET api endpoint
What can I pass to await crawler.run(), to keep the crawler running until I explicitly say it should stop?
deep-jadeOP•3y ago
I currently have this code, which works and keeps the crawler running, but sometimes the crawler does not wait till the price appears for the domain on the site and the crawler does not shutdown the browser once processing is done plus the api does not finish and return data to the client. So if you have any suggestions about how the code should look like to have an API that can run multiple crawlers at the same time independent of each API all that would be nice 🙂
https://gist.github.com/Trubador/b67a6b78cafec99f191b7aa33f2ed654
genetic-orange•3y ago
I'm also solving a similar problem right now. I used to use puppeteer-cluster (example of implementation https://github.com/thomasdondorf/puppeteer-cluster/blob/master/examples/express-screenshot.js) Now I've decided to migrate to Crawlee and found that it doesn't seem to have any functionality.
deep-jadeOP•3y ago
@Romja Yeah a lot of customization is needed. I have the api working, however it can only process 1 request right now before needing to be restarted.
genetic-orange•3y ago
I don't think that's the coolest solution 😦
deep-jadeOP•3y ago
No defintely not. It needs to handle multiple concurrent requests
@Romja It should be pretty evident how I want to solve this by looking at the code
grumpy-cyan•3y ago
You need to adjust the
autoscaledPoolOptions
isFinishedFunction
https://crawlee.dev/api/core/class/AutoscaledPool. This way you can keep the crawler running even if the queue is emptyforeign-sapphire•3y ago
Its site-specific, I did not check your targeted site, in the past I used crawler with permanently opened page and all other logic was performed based on page instance, this way you will save time on page opening and if you will be able to find out how internal page webapp works you can mimic calls to their data via
fetch()
under browser. If anything else works faster for browser-based scraping I will be surprised 😉
deep-jadeOP•3y ago
thanks I will try that 🙂
I don't completely understand your comment, do you mean to call apis in case the client application is not hosted server-side?