Crawlee & Apify•17mo ago

Connecting to a remote browser instance?

Is there a way we can specify a web socket endpoint in the PlaywrightCrawler config (or somewhere else) so we can connect to a remote browser?

8 Replies

quickest-silver•17mo ago

Hi @tim , It looks like the solution is not straightforward, you may try to write your own PlaywrightPlugin, by replacing every this.library.launch by this.library.connectOverCDP('http://hostname:port') (e.g. http://localhost:9222), and then provide it to the PlaywrightCrawler via the browserPool option parameter (check the code of PlaywrightCrawlerOptions for more details).

GitHub

crawlee/packages/playwright-crawler/src/internals/playwright-crawle...

Crawlee—A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast. - apify/crawlee

quickest-silver•17mo ago

For more info about connectOverCDP: https://playwright.dev/docs/api/class-browsertype#browser-type-connect-over-cdp

BrowserType | Playwright

BrowserType provides methods to launch a specific browser instance or connect to an existing one. The following is a typical example of using Playwright to drive automation:

MEE6•17mo ago

@Marc Plouhinec just advanced to level 1! Thanks for your contributions! 🎉

correct-apricotOP•17mo ago

Thanks for the response! unfortunately when i try that i get an error: Error: browserPoolOptions.browserPlugins is disallowed. Use launchContext.launcher instead.

national-gold•17mo ago

Hello! I have same task with remote browser. @tim did you find the solution with launchContext.launcher? May you share this one?

quickest-silver•17mo ago

I was thinking about another solution: you can create a BasicCrawler and manage your browser page by yourself, for example:

import { chromium } from 'playwright';
import { newInjectedContext } from 'fingerprint-injector';

const BROWSER_URL = 'http://127.0.0.1:9222'; // or something like 'ws://127.0.0.1:36775/devtools/browser/a292f96c-7332-4ce8-82a9-7411f3bd280a'

// ... inside your BasicCrawler
async requestHandler({ request, sendRequest, log }) {
    // Initialize your browser
    const browser = await chromium.connectOverCDP(BROWSER_URL);
    const context = await newInjectedContext(browser); // See https://github.com/apify/fingerprint-suite
    const page = await context.newPage();
    
    try {
        await page.goto(request.url, {timeout: 20000});
    
        // ... extract data here
        
    } finally {
        await page.close();
        await context.close();
        await browser.close();
    }
}

import { chromium } from 'playwright';
import { newInjectedContext } from 'fingerprint-injector';

const BROWSER_URL = 'http://127.0.0.1:9222'; // or something like 'ws://127.0.0.1:36775/devtools/browser/a292f96c-7332-4ce8-82a9-7411f3bd280a'

// ... inside your BasicCrawler
async requestHandler({ request, sendRequest, log }) {
    // Initialize your browser
    const browser = await chromium.connectOverCDP(BROWSER_URL);
    const context = await newInjectedContext(browser); // See https://github.com/apify/fingerprint-suite
    const page = await context.newPage();
    
    try {
        await page.goto(request.url, {timeout: 20000});
    
        // ... extract data here
        
    } finally {
        await page.close();
        await context.close();
        await browser.close();
    }
}

Basic crawler | Crawlee

This is the most bare-bones example of using Crawlee, which demonstrates some of its building blocks such as the BasicCrawler. You probably don't need to go this deep though, and it would be better to start with one of the full-featured crawlers

genetic-orange•17mo ago

Yeah, I don't think this is possible with e.g. PlaywrightCrawler but if there would be bigger demand, technically could be implemented. There is actually an issue for this https://github.com/apify/crawlee/issues/1822

GitHub

Connect to remote browser services · Issue #1822 · apify/crawlee

Which package is the feature request for? If unsure which one to select, leave blank @crawlee/browser (BrowserCrawler) Feature There are cloud browser services like Browserless. So that we can use ...

wise-white•5mo ago

Hi @Lukas Krivka Any feature update on this? I checked the github issue, it is still open. Building a scraper functionality into our AI agent, hoping to use Crawlee for the scraping part, but require connecting to remote browser.

Gaming

Programming

Connecting to a remote browser instance?

Did you find this page helpful?