Connecting to a remote browser instance?

Is there a way we can specify a web socket endpoint in the PlaywrightCrawler config (or somewhere else) so we can connect to a remote browser?
8 Replies
quickest-silver
quickest-silver•17mo ago
Hi @tim , It looks like the solution is not straightforward, you may try to write your own PlaywrightPlugin, by replacing every this.library.launch by this.library.connectOverCDP('http://hostname:port') (e.g. http://localhost:9222), and then provide it to the PlaywrightCrawler via the browserPool option parameter (check the code of PlaywrightCrawlerOptions for more details).
GitHub
crawlee/packages/playwright-crawler/src/internals/playwright-crawle...
Crawlee—A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast. - apify/crawlee
quickest-silver
quickest-silver•17mo ago
BrowserType | Playwright
BrowserType provides methods to launch a specific browser instance or connect to an existing one. The following is a typical example of using Playwright to drive automation:
MEE6
MEE6•17mo ago
@Marc Plouhinec just advanced to level 1! Thanks for your contributions! 🎉
correct-apricot
correct-apricotOP•17mo ago
Thanks for the response! unfortunately when i try that i get an error: Error: browserPoolOptions.browserPlugins is disallowed. Use launchContext.launcher instead.
national-gold
national-gold•17mo ago
Hello! I have same task with remote browser. @tim did you find the solution with launchContext.launcher? May you share this one?
quickest-silver
quickest-silver•17mo ago
I was thinking about another solution: you can create a BasicCrawler and manage your browser page by yourself, for example:
import { chromium } from 'playwright';
import { newInjectedContext } from 'fingerprint-injector';

const BROWSER_URL = 'http://127.0.0.1:9222'; // or something like 'ws://127.0.0.1:36775/devtools/browser/a292f96c-7332-4ce8-82a9-7411f3bd280a'

// ... inside your BasicCrawler
async requestHandler({ request, sendRequest, log }) {
// Initialize your browser
const browser = await chromium.connectOverCDP(BROWSER_URL);
const context = await newInjectedContext(browser); // See https://github.com/apify/fingerprint-suite
const page = await context.newPage();

try {
await page.goto(request.url, {timeout: 20000});

// ... extract data here

} finally {
await page.close();
await context.close();
await browser.close();
}
}
import { chromium } from 'playwright';
import { newInjectedContext } from 'fingerprint-injector';

const BROWSER_URL = 'http://127.0.0.1:9222'; // or something like 'ws://127.0.0.1:36775/devtools/browser/a292f96c-7332-4ce8-82a9-7411f3bd280a'

// ... inside your BasicCrawler
async requestHandler({ request, sendRequest, log }) {
// Initialize your browser
const browser = await chromium.connectOverCDP(BROWSER_URL);
const context = await newInjectedContext(browser); // See https://github.com/apify/fingerprint-suite
const page = await context.newPage();

try {
await page.goto(request.url, {timeout: 20000});

// ... extract data here

} finally {
await page.close();
await context.close();
await browser.close();
}
}
Basic crawler | Crawlee
This is the most bare-bones example of using Crawlee, which demonstrates some of its building blocks such as the BasicCrawler. You probably don't need to go this deep though, and it would be better to start with one of the full-featured crawlers
genetic-orange
genetic-orange•17mo ago
Yeah, I don't think this is possible with e.g. PlaywrightCrawler but if there would be bigger demand, technically could be implemented. There is actually an issue for this https://github.com/apify/crawlee/issues/1822
GitHub
Connect to remote browser services · Issue #1822 · apify/crawlee
Which package is the feature request for? If unsure which one to select, leave blank @crawlee/browser (BrowserCrawler) Feature There are cloud browser services like Browserless. So that we can use ...
wise-white
wise-white•5mo ago
Hi @Lukas Krivka Any feature update on this? I checked the github issue, it is still open. Building a scraper functionality into our AI agent, hoping to use Crawlee for the scraping part, but require connecting to remote browser.

Did you find this page helpful?