The best way to scale browser pool on multiple machines.
As I understand it, there are no problems running Сrawlee in a Docker container where browsers will work. But what if you need to create a cluster of machines. Is there a built-in browser pool management functionality running on different hosts or do you have any ideas how to do this.
12 Replies
wee-brown•3y ago
Crawlee's crawlers are designed with the idea of being run on a single machine, but it is definitely more than possible to run multiple machines with a crawler running in each of them. However, things like allocating requests to each container's crawler accordingly and scaling up/down will need to be handled on your end.
eager-peach•3y ago
Generally, you want to split your URLs (before hand or dynamically) and give workloads to the machines. If you can avoid live synchronizing, it will save you a lot of troubles
xenial-blackOP•3y ago
Can this help me in any way?
https://github.com/apify/devtools-server
GitHub
GitHub - apify/devtools-server: Runs a simple server that allows yo...
Runs a simple server that allows you to connect to Chrome DevTools running on dynamic hosts, not only localhost. - GitHub - apify/devtools-server: Runs a simple server that allows you to connect to...
xenial-blackOP•3y ago
What did you mean by "avoid live synchronizing"?
eager-peach•3y ago
@Romja
1. Devtools server is for debugging, not scaling
2. If you want to scale into multiple servers/machines, the best way to do is split the URLs so that the machines are independent on each other. Then just merge the data.
xenial-blackOP•3y ago
Can you tell me how to add URL to RequestQueue using GET requests.
Like here: https://github.com/thomasdondorf/puppeteer-cluster/blob/master/examples/express-screenshot.js
eager-peach•3y ago
Why? It must be POST/PUT request because you are passing in data
xenial-blackOP•3y ago
No matter what HTTP method I use to send the URL. Is there any way to run and dynamically add URLs to the queue.
@Romja just advanced to level 1! Thanks for your contributions! 🎉
wee-brown•3y ago
If the request queue is being stored in the cloud, it can be dynamically added to using the Apify API using this endpoint: https://docs.apify.com/api/v2#/reference/request-queues/request-collection/add-request
Apify
Apify API Reference
Documentation of Apify API which enables to integrate and programmatically manage all aspects of the Apify platform.
wee-brown•3y ago
If, for example, you have two different crawlers running in separate containers that are running off of the same request queue, the queue must be stored in the cloud. In the Apify SDK, this can be done by using the
forceCloud
option when opening a request queue. https://sdk.apify.com/api/apify/interface/OpenStorageOptions#forceCloudwee-brown•3y ago
Then, you won't even need to directly interact with the Apify API at all, and you can just use the SDK.
For example, this request queue will be stored in the cloud (on your Apify account):
And when you add a new request to it like this:
That request will now be available for processing to any other containers that have also accessed the request queue with the name
some-name
.