Crawlee & Apify•3y ago

The best way to scale browser pool on multiple machines.

As I understand it, there are no problems running Сrawlee in a Docker container where browsers will work. But what if you need to create a cluster of machines. Is there a built-in browser pool management functionality running on different hosts or do you have any ideas how to do this.

12 Replies

wee-brown•3y ago

Crawlee's crawlers are designed with the idea of being run on a single machine, but it is definitely more than possible to run multiple machines with a crawler running in each of them. However, things like allocating requests to each container's crawler accordingly and scaling up/down will need to be handled on your end.

eager-peach•3y ago

Generally, you want to split your URLs (before hand or dynamically) and give workloads to the machines. If you can avoid live synchronizing, it will save you a lot of troubles

xenial-blackOP•3y ago

Can this help me in any way? https://github.com/apify/devtools-server

GitHub

GitHub - apify/devtools-server: Runs a simple server that allows yo...

Runs a simple server that allows you to connect to Chrome DevTools running on dynamic hosts, not only localhost. - GitHub - apify/devtools-server: Runs a simple server that allows you to connect to...

xenial-blackOP•3y ago

What did you mean by "avoid live synchronizing"?

eager-peach•3y ago

@Romja 1. Devtools server is for debugging, not scaling 2. If you want to scale into multiple servers/machines, the best way to do is split the URLs so that the machines are independent on each other. Then just merge the data.

xenial-blackOP•3y ago

Can you tell me how to add URL to RequestQueue using GET requests. Like here: https://github.com/thomasdondorf/puppeteer-cluster/blob/master/examples/express-screenshot.js

eager-peach•3y ago

Why? It must be POST/PUT request because you are passing in data

xenial-blackOP•3y ago

No matter what HTTP method I use to send the URL. Is there any way to run and dynamically add URLs to the queue.

MEE6•3y ago

@Romja just advanced to level 1! Thanks for your contributions! 🎉

wee-brown•3y ago

If the request queue is being stored in the cloud, it can be dynamically added to using the Apify API using this endpoint: https://docs.apify.com/api/v2#/reference/request-queues/request-collection/add-request

Apify

Apify API Reference

Documentation of Apify API which enables to integrate and programmatically manage all aspects of the Apify platform.

wee-brown•3y ago

If, for example, you have two different crawlers running in separate containers that are running off of the same request queue, the queue must be stored in the cloud. In the Apify SDK, this can be done by using the forceCloud option when opening a request queue. https://sdk.apify.com/api/apify/interface/OpenStorageOptions#forceCloud

OpenStorageOptions | API | Apify SDK

wee-brown•3y ago

Then, you won't even need to directly interact with the Apify API at all, and you can just use the SDK. For example, this request queue will be stored in the cloud (on your Apify account):

const myQueue = await Actor.openRequestQueue('some-name', { forceCloud: true });

const myQueue = await Actor.openRequestQueue('some-name', { forceCloud: true });

And when you add a new request to it like this:

await myQueue.addRequest({ url: 'https://foo.com' });

await myQueue.addRequest({ url: 'https://foo.com' });

That request will now be available for processing to any other containers that have also accessed the request queue with the name some-name.

Gaming

Programming

The best way to scale browser pool on multiple machines.

Did you find this page helpful?