Crawlee & Apify•3y ago

Concurrency: How to use multiple proxies / session pool IDs?

Hi, I'm using the proxy config with 100 proxies. The goal is to let the scraper run with say 4 sessions concurrently - using 4 different proxies. In each run, I see it picks one Session ID = One proxy and runs through all requests with the same one. (it's different one each time, but each time it's a single IP).

import { ProxyConfiguration } from 'crawlee';
import { SMART_PROXY_DATACENTER_IPS } from '../utils/proxies.js';

import ApplicationRouter from './ApplicationRouter.js';


export default class TestProxies extends ApplicationRouter {

  async setup() {
    this.version = 1;
    this.prefix = 'TestProxies';
    this.datasetName = `${this.prefix}_dataset_V${this.version}`;
  }

  async getInitialPages() {
    return [
      { url: "https://ifconfig.co/?a=1", label: "page" },
      { url: "https://ifconfig.co/?a=2", label: "page" },
      { url: "https://ifconfig.co/?a=3", label: "page" },
      { url: "https://ifconfig.co/?a=4", label: "page" },
    ];
  }

  getRequestQueueName() {
    return `${this.prefix}_queue`;
  }

  getPageRoot() {
    return 'https://ifconfig.co';
  }

  // This is the entry
  async visitPage() {
    const ip = await this.text({ css: "#output" })
    this.debug("Proxy IP is", ip);

    await this.sleep(4000);
  }

  async getCrawlerOptions() {
    return {
      maxRequestRetries: 3,
      maxConcurrency: 2,

      useSessionPool: true,
      sessionPoolOptions: {
        maxPoolSize: 25,

        sessionOptions: {
          maxUsageCount: 150,
          maxAgeSecs: 23*60, // IPs rotate after 30 minutes
        },

        persistStateKeyValueStoreId: `${this.prefix}_V${this.version}_sessions`,
        persistStateKey: `${this.prefix}_V${this.version}_my-session-pool`,
      },

      proxyConfiguration: new ProxyConfiguration({
        proxyUrls: SMART_PROXY_DATACENTER_IPS
      })
    }
  }

}

import { ProxyConfiguration } from 'crawlee';
import { SMART_PROXY_DATACENTER_IPS } from '../utils/proxies.js';

import ApplicationRouter from './ApplicationRouter.js';


export default class TestProxies extends ApplicationRouter {

  async setup() {
    this.version = 1;
    this.prefix = 'TestProxies';
    this.datasetName = `${this.prefix}_dataset_V${this.version}`;
  }

  async getInitialPages() {
    return [
      { url: "https://ifconfig.co/?a=1", label: "page" },
      { url: "https://ifconfig.co/?a=2", label: "page" },
      { url: "https://ifconfig.co/?a=3", label: "page" },
      { url: "https://ifconfig.co/?a=4", label: "page" },
    ];
  }

  getRequestQueueName() {
    return `${this.prefix}_queue`;
  }

  getPageRoot() {
    return 'https://ifconfig.co';
  }

  // This is the entry
  async visitPage() {
    const ip = await this.text({ css: "#output" })
    this.debug("Proxy IP is", ip);

    await this.sleep(4000);
  }

  async getCrawlerOptions() {
    return {
      maxRequestRetries: 3,
      maxConcurrency: 2,

      useSessionPool: true,
      sessionPoolOptions: {
        maxPoolSize: 25,

        sessionOptions: {
          maxUsageCount: 150,
          maxAgeSecs: 23*60, // IPs rotate after 30 minutes
        },

        persistStateKeyValueStoreId: `${this.prefix}_V${this.version}_sessions`,
        persistStateKey: `${this.prefix}_V${this.version}_my-session-pool`,
      },

      proxyConfiguration: new ProxyConfiguration({
        proxyUrls: SMART_PROXY_DATACENTER_IPS
      })
    }
  }

}

14 Replies

MEE6•3y ago

@Michal just advanced to level 3! Thanks for your contributions! 🎉

xenial-blackOP•3y ago

From the documentation, it should pick a proxy with each session ID. With session pool, I'd hope this gets automatically managed since it starts 2 scrapes concurrently, but I can mange it myself, that's okay too. How do I make it use multiple session IDs then? Just verified, It is the same session ID each time:

=== Visiting https://ifconfig.co/?a=1 ===
   What is my IP address? — ifconfig.co
session_EimgHiOPKn

=== Visiting https://ifconfig.co/?a=1 ===
   What is my IP address? — ifconfig.co
session_EimgHiOPKn
   Proxy IP is 185.147.143.125
   Proxy IP is 185.147.143.125
pushing data 15
pushing data 15

=== Visiting https://ifconfig.co/?a=3 ===
   What is my IP address? — ifconfig.co
session_EimgHiOPKn
   Proxy IP is 185.147.143.125

=== Visiting https://ifconfig.co/?a=4 ===
   What is my IP address? — ifconfig.co
session_EimgHiOPKn
   Proxy IP is 185.147.143.125

=== Visiting https://ifconfig.co/?a=1 ===
   What is my IP address? — ifconfig.co
session_EimgHiOPKn

=== Visiting https://ifconfig.co/?a=1 ===
   What is my IP address? — ifconfig.co
session_EimgHiOPKn
   Proxy IP is 185.147.143.125
   Proxy IP is 185.147.143.125
pushing data 15
pushing data 15

=== Visiting https://ifconfig.co/?a=3 ===
   What is my IP address? — ifconfig.co
session_EimgHiOPKn
   Proxy IP is 185.147.143.125

=== Visiting https://ifconfig.co/?a=4 ===
   What is my IP address? — ifconfig.co
session_EimgHiOPKn
   Proxy IP is 185.147.143.125

It would be nice if the documentation mentioned how to control this a little. Maybe it does, just didn't find it and asking here is not super practical time-wise 🙂 From reading the posts here I guess the browser has one session only and starting a new session means a new browser, thus concurrency doesn't mean multiple sessions at once, but just multiple tabs in one browser? In that case, how do we implement concurrent scraping from multiple IPs? I'd prefer to use each IP nicely and switch to a new one with a new session before I get blocked, not looking to burn each IP in the process. So trying again to get this multiple IPs thing to work, I am running Crawlee twice to see if I can get it working that way. It could almost work, but it is picking the same URLs from the queue in the same order so it's pointless unless I can get around that.

conscious-sapphire•3y ago

You answered yourself, for browser crawlers, session is bound to a browser instance, not Request. You can have multiple requests running concurrently with the same session and you can have multiple sessions/browsers running concurrently. You can set everything in browserPoolOptions, e.g. you can have only 1 page per browser and then it will run e.g. 10 browsers in parallel.https://crawlee.dev/api/browser-pool/interface/BrowserPoolOptions

BrowserPoolOptions | API | Crawlee

xenial-blackOP•3y ago

Thanks! Not sure how I missed the browserPoolOptions setting, was fixated on the sessions somehow.

conscious-sapphire•3y ago

here is how it is passed to the crawler https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlerOptions#browserPoolOptions

PlaywrightCrawlerOptions | API | Crawlee

xenial-blackOP•3y ago

Just by glancing it, I'm not sure what to focus on from the options it doesn't seem to have much relevance based on what I see documented https://crawlee.dev/api/browser-pool/interface/BrowserPoolOptions that's the same link as before How about https://crawlee.dev/api/core/interface/AutoscaledPoolOptions, that seems more relevant

conscious-sapphire•3y ago

maxOpenPagesPerBrowser: 1 - you have 10 requests in parallel, each with separate browser

xenial-blackOP•3y ago

so that's the only option to set? session pool concurrency to 10 and this to 1 will open 10 browsers?

MEE6•3y ago

@Michal just advanced to level 4! Thanks for your contributions! 🎉

xenial-blackOP•3y ago

This looks also handy:

    browserPoolOptions: {
        retireBrowserAfterPageCount: 10,

    browserPoolOptions: {
        retireBrowserAfterPageCount: 10,

As I want the "surfing" seem as natural as possible, would that also retire the session?

conscious-sapphire•3y ago

Yeah, that means how often you want to rotate. 10 looks pretty natural I think

xenial-blackOP•3y ago

Awesome, thanks! Btw is there a way to make it a calback so I can make it random?

conscious-sapphire•3y ago

You can manipulate the properties of the crawler as you go, typescript may complain but in JS world, you can do whatever 😄

exotic-emerald•3y ago

This was helpful for me as well, @Lukas Krivka , as I ran into the same issue expecting a rotation through proxy urls and instead saw all requests go through the same single one. Perhaps adding a note to the ProxyConfiguration or proxy management docs would help?

Gaming

Programming

Concurrency: How to use multiple proxies / session pool IDs?

Did you find this page helpful?