Crawlee & Apify•3y ago

New fingerprint per new page in browser-pool

Hi all, I'm trying out crawlee's browser-pool with just one browser (only puppeteer with chrome for now), and settings its fingerprintGeneratorOptions (https://crawlee.dev/api/browser-pool#changing-browser-fingerprints-aka-browser-signatures) to multi-OS multi-browser options, but each new page opened in the browser-pool has the same static fingerprint & headers. How can each new page opened via the browser-pool have a different fingerprint and headers?

29 Replies

flat-fuchsia•3y ago

The fingerprint is always per browser. Unfortunately, doing this stuff per page disables cache but you can try the experimental containers option https://crawlee.dev/api/browser-pool/interface/CreateLaunchContextOptions#experimentalContainers btw: usually we use the crawlers and only update the browserPoolOptions if needed

CreateLaunchContextOptions | API | Crawlee

constant-blueOP•3y ago

Thanks Lukas, for my use-case I want the browser cache disabled anyways (no cookie, app data and cache sharing between pages), so is there still a way to have the fingerprint change per page?

constant-blueOP•3y ago

Can the fingerprint-suite (https://github.com/apify/fingerprint-suite) attach to the browser-pool prePageCreateHooks to achieve unique fingerprint per page?

GitHub

GitHub - apify/fingerprint-suite: Browser fingerprinting tools for ...

Browser fingerprinting tools for anonymizing your scrapers. Developed by Apify. - GitHub - apify/fingerprint-suite: Browser fingerprinting tools for anonymizing your scrapers. Developed by Apify.

manual-pink•3y ago

You can use fingerprint like this:

browserPoolOptions: {
      useFingerprints: true,
    },

browserPoolOptions: {
      useFingerprints: true,
    },

flat-fuchsia•3y ago

If you don't care about cache, use the https://crawlee.dev/api/browser-pool/interface/CreateLaunchContextOptions#useIncognitoPages

CreateLaunchContextOptions | API | Crawlee

rare-sapphire•3y ago

it would be great to have some more information about experimentalContainers option what combinations of settings required for experimentalContainers ? As far as I understand it:

const crawler = new PlaywrightCrawler({  

    launchContext: {
        experimentalContainers: true,
        useIncognitoPages: false,
        launcher: firefox,
    },

const crawler = new PlaywrightCrawler({  

    launchContext: {
        experimentalContainers: true,
        useIncognitoPages: false,
        launcher: firefox,
    },

Is it correct?

flat-fuchsia•3y ago

This looks correct. experimentalContainers is an attempt to have context per page like incognito pages but with cache enabled through multiple pages. the problem with it is that it is hacky as it is not supported by the browsers by default Just keep in mind that if you still use session pool it will manage reusing the sessions with proxies. By default it starts by accumulating new sessions so you should see switch of proxy for each page at start. Without experimentalContainers the proxy only switches once whole browser switches which is by default after 100 requests or 3 errors The cool thing about experimentalContainers is that it unifies the behavior with cheerio crawler. We would love to make that default but unfortunately, it is not working 100% with Chrome

rare-sapphire•3y ago

Just turned on experimentalContainers and I see on the console this:

...
INFO  preNavigationHook1: https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html
Request without proxy 127.0.0.4 ocsp.digicert.com
Request without proxy 127.0.0.4 ocsp.sca1b.amazontrust.com
Request without proxy 127.0.0.4 ocsp.pki.goog
Request without proxy 127.0.0.4 ocsp.digicert.com
...

INFO  requestHandler: https://infosimples.github.io/detect-headless/ START, Wait1, Pressing PageDown ...
Request without proxy 127.0.0.4 r3.o.lencr.org
Request without proxy 127.0.0.4 ocsp.pki.goog
Request without proxy 127.0.0.4 r3.o.lencr.org
Request without proxy 127.0.0.4 ocsp.digicert.com
Request without proxy 127.0.0.4 r3.o.lencr.org
...
INFO  preNavigationHook1: https://nowsecure.nl/
Request without proxy 127.0.0.4 ocsp.pki.goog
Request without proxy 127.0.0.4 ocsp.digicert.com
Request without proxy 127.0.0.4 ocsp.digicert.com
Request without proxy 127.0.0.4 ocsp.digicert.com
Request without proxy 127.0.0.4 ocsp.digicert.com
...
INFO  preNavigationHook1: https://pixelscan.net/
Request without proxy 127.0.0.4 ocsp.r2m01.amazontrust.com
Request without proxy 127.0.0.4 ocsp.digicert.com
Request without proxy 127.0.0.4 ocsp.sca1b.amazontrust.com
Request without proxy 127.0.0.4 ocsp.digicert.com

...
INFO  preNavigationHook1: https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html
Request without proxy 127.0.0.4 ocsp.digicert.com
Request without proxy 127.0.0.4 ocsp.sca1b.amazontrust.com
Request without proxy 127.0.0.4 ocsp.pki.goog
Request without proxy 127.0.0.4 ocsp.digicert.com
...

INFO  requestHandler: https://infosimples.github.io/detect-headless/ START, Wait1, Pressing PageDown ...
Request without proxy 127.0.0.4 r3.o.lencr.org
Request without proxy 127.0.0.4 ocsp.pki.goog
Request without proxy 127.0.0.4 r3.o.lencr.org
Request without proxy 127.0.0.4 ocsp.digicert.com
Request without proxy 127.0.0.4 r3.o.lencr.org
...
INFO  preNavigationHook1: https://nowsecure.nl/
Request without proxy 127.0.0.4 ocsp.pki.goog
Request without proxy 127.0.0.4 ocsp.digicert.com
Request without proxy 127.0.0.4 ocsp.digicert.com
Request without proxy 127.0.0.4 ocsp.digicert.com
Request without proxy 127.0.0.4 ocsp.digicert.com
...
INFO  preNavigationHook1: https://pixelscan.net/
Request without proxy 127.0.0.4 ocsp.r2m01.amazontrust.com
Request without proxy 127.0.0.4 ocsp.digicert.com
Request without proxy 127.0.0.4 ocsp.sca1b.amazontrust.com
Request without proxy 127.0.0.4 ocsp.digicert.com

looks like Firefox checks certificates... and Crawlee see this as "Request without proxy " ? how bad is it?

flat-fuchsia•3y ago

I have no idea what this means to be honest, the Crawlee team will need to look into...currently there is probably not capacity in the team to debug experimental containers

rare-sapphire•3y ago

By the way, just tested PlaywrightCrawler with/without incognito mode and with experimental containers turned on. See here: https://discord.com/channels/801163717915574323/1060738415370453032/1060952860868739192

manual-pink•3y ago

Thanks for sharing 🙂

constant-blueOP•3y ago

Thanks, useIncognitoPages disabled the cache, but the same fingerprint is still assigned to each page, so I'm guessing it alone won't help with unique fingerprints. It looks like you're getting new fingerprints per request (i.e. per page) with incognito mode in a crawler, but I can't seem to make it work with a brower-pool instead of a crawler. How can session settings (useSessionPool and persistCookiesPerSession) be set in a browser-pool to see if that makes a difference?

manual-pink•3y ago

Just curious to know how to test if different fingerprints are sent per request? Instead of browser pool I wanted to know can we just use PM2 load balancer with cluster mode and then run crawler when API endpoint is hit. Will that work? Or same puppeteer crawler instance will be shared among all?

rare-sapphire•3y ago

useIncognitoPages disabled the cache, but the same fingerprint is still

assigned to each page, so I'm guessing it alone won't help with unique fingerprints well, this part is a bit mysterious... but it works ))) @voidmonk - please do YOUR experiment: - run your program against https://fingerprint.com/demo/ - with useIncognitoPages enabled and Firefox launcher. - Use different IPs. - Run in headless mode and create screenshots. - After this - compare the IDs from fingerprint.com/demo/ What I saw: despite using the same, let say "Firefox 107 on Windows" in Crawlee - - the fingerprint.com/demo/ gives you DIFFERENT ID each time!!! So something IS changing when useIncognitoPages enabled, I do not know what exactly (may be @Lukas Krivka can explain) - but is is enough to full the bot detector ))) so I did not investigated deeper

Technical Demo - Fingerprint Pro

Fingerprint Pro is the 99.5% accurate device fingerprinting solution. FingerprintJS is the top open-source browser fingerprinting library. Prevent fraud, spam, and account takeovers. Available for web, iOS, and Android.

manual-pink•3y ago

Thanks for sharing insights 🙂

constant-blueOP•3y ago

For basic fingerprint details (mainly user-agent and related headers), I make requests to https://httpbin.org/get which shows the request headers sent, and these fingerprint headers remain the same for all browser-pool requests/pages with or without useIncognitoPages and useFingerprints set. I think @Lukas Krivka is right in saying that fingerprint is always per browser, and since I'm only using one browser so the fingerprint remains the same.

manual-pink•3y ago

But you said you are using browser pool so multiple browser instances right? each browser will have its own fingerprint or will it get shared among all the instances??

flat-fuchsia•3y ago

@voidmonk Browser pool itself doesn't rotate the browsers so that is what Crawlers are for.

rare-sapphire•3y ago

@voidmonk you are right, the test with https://httpbin.org/get is the best, you see that UserAgent is the same all the time, even when useIncognitoPages: true My idea to rely on https://fingerprint.com/demo/ (in other thread) - was wrong, the fingerprint.com is easy to fool, even with the same UserAgent

Technical Demo - Fingerprint Pro

rare-sapphire•3y ago

Nevertheless - you CAN have different fingerprints and Crawler (PlaywrightCrawler in my case). The key is the retireBrowserAfterPageCount setting!

MEE6•3y ago

@new_in_town just advanced to level 5! Thanks for your contributions! 🎉

rare-sapphire•3y ago

this code make 8 requests to https://httpbin.org/get and use retireBrowserAfterPageCount: 2 Check the png files it generates and tell us how many UserAgents you see )) Well, it seems this approach works. But... the side effect might be bad performance... so either I continue to experiment blindly or @Lukas Krivka will guide us )))) I mean - how to have good performance AND different browser fingerprints (per request? every two-three requests?)

main.ts

rare-sapphire•3y ago

...and one more thing: the documentation https://crawlee.dev/api/browser-pool say:

The fingerprints are by default tied to the respective proxy urls to not use the same unique fingerprint from various IP addresses. You can disable this behavior in the fingerprintOptions

Well, if I want to "disable this behavior in the fingerprintOptions" - how can I do this? Can not find it in the documentation...

@crawlee/browser-pool | API | Crawlee

The browser-pool module exports three constructors. One for BrowserPool itself and two for the included Puppeteer and Playwright plugins. Example: ```js import { BrowserPool, PuppeteerPlugin, PlaywrightPlugin } from '@crawlee/browser-pool'; import puppeteer from 'puppeteer'; import playwright from 'playwright'; const browserPool = n...

flat-fuchsia•3y ago

@new_in_town @Lukas Krivka will not guide you unfortunately because this is a bit out of my expertise. We need to wait for Crawlee gods to come down to us mortals and bring the revelation. We are figuring out some bot integration with GitHub so there conversations will be easily references there. There are a lot of great points from you guys. @new_in_town Why would you need to rotate fingerprint so often? Think of it like regular user, they will also go through more than 2 pages. Default rotation is 100 requests but you can go down to 10-20.

rare-sapphire•3y ago

Why would you need to rotate fingerprint so often?

because this is my use case: to open a page from a random IP with random UserAgent/fingerprint and thats it! Next time I visit this site - it should be DIFFERENT IP and different UserAgent/fingerprint. Thats it. Yes, there is a pause between my requests. I do not overload the target site.

constant-blueOP•3y ago

I was finally able to set unique fingerprints per request from a browser pool (note: I'm not testing crawlers), following this official repo guide 'Advanced usage with the Browser pool hooks system': https://github.com/apify/fingerprint-injector#advanced-usage-with-the-browser-pool-hooks-system

GitHub

GitHub - apify/fingerprint-injector: Home of fingerprint injector.

Home of fingerprint injector. Contribute to apify/fingerprint-injector development by creating an account on GitHub.

manual-pink•3y ago

Where do you deploy these crawlers?

rare-sapphire•3y ago

... to set unique fingerprints per request from a browser pool...

Excellent!!! @voidmonk -- would you share an example program doing this? Without crawlers is fine... By the way: in case somebody implementing scraping and use an external message queue like beanstalkd - I can answer the questions (this was discussed here: https://discord.com/channels/801163717915574323/1056348705407651941/1059051356171804682 but there are lot interesting and non obvious details)

Louis Deconinck•7mo ago

Did you manage to bypass fingerprint.com detection?

Gaming

Programming

New fingerprint per new page in browser-pool

Did you find this page helpful?