New fingerprint per new page in browser-pool
Hi all, I'm trying out crawlee's
browser-pool
with just one browser (only puppeteer
with chrome
for now), and settings its fingerprintGeneratorOptions
(https://crawlee.dev/api/browser-pool#changing-browser-fingerprints-aka-browser-signatures) to multi-OS multi-browser options, but each new page opened in the browser-pool
has the same static fingerprint & headers. How can each new page opened via the browser-pool
have a different fingerprint and headers?29 Replies
flat-fuchsia•3y ago
The fingerprint is always per browser. Unfortunately, doing this stuff per page disables cache but you can try the experimental containers option https://crawlee.dev/api/browser-pool/interface/CreateLaunchContextOptions#experimentalContainers
btw: usually we use the crawlers and only update the browserPoolOptions if needed
constant-blueOP•3y ago
Thanks Lukas, for my use-case I want the browser cache disabled anyways (no cookie, app data and cache sharing between pages), so is there still a way to have the fingerprint change per page?
constant-blueOP•3y ago
Can the
fingerprint-suite
(https://github.com/apify/fingerprint-suite) attach to the browser-pool
prePageCreateHooks
to achieve unique fingerprint per page?GitHub
GitHub - apify/fingerprint-suite: Browser fingerprinting tools for ...
Browser fingerprinting tools for anonymizing your scrapers. Developed by Apify. - GitHub - apify/fingerprint-suite: Browser fingerprinting tools for anonymizing your scrapers. Developed by Apify.
manual-pink•3y ago
You can use fingerprint like this:
flat-fuchsia•3y ago
If you don't care about cache, use the https://crawlee.dev/api/browser-pool/interface/CreateLaunchContextOptions#useIncognitoPages
rare-sapphire•3y ago
it would be great to have some more information about
experimentalContainers
option
what combinations of settings required for experimentalContainers
?
As far as I understand it:
Is it correct?flat-fuchsia•3y ago
This looks correct.
experimentalContainers
is an attempt to have context per page like incognito pages but with cache enabled through multiple pages. the problem with it is that it is hacky as it is not supported by the browsers by default
Just keep in mind that if you still use session pool it will manage reusing the sessions with proxies. By default it starts by accumulating new sessions so you should see switch of proxy for each page at start. Without experimentalContainers
the proxy only switches once whole browser switches which is by default after 100 requests or 3 errors
The cool thing about experimentalContainers
is that it unifies the behavior with cheerio crawler. We would love to make that default but unfortunately, it is not working 100% with Chromerare-sapphire•3y ago
Just turned on
experimentalContainers
and I see on the console this:
looks like Firefox checks certificates... and Crawlee see this as "Request without proxy " ?
how bad is it?flat-fuchsia•3y ago
I have no idea what this means to be honest, the Crawlee team will need to look into...currently there is probably not capacity in the team to debug experimental containers
rare-sapphire•3y ago
By the way, just tested PlaywrightCrawler with/without incognito mode and with experimental containers turned on. See here: https://discord.com/channels/801163717915574323/1060738415370453032/1060952860868739192
manual-pink•3y ago
Thanks for sharing 🙂
constant-blueOP•3y ago
Thanks,
useIncognitoPages
disabled the cache, but the same fingerprint is still assigned to each page, so I'm guessing it alone won't help with unique fingerprints.
It looks like you're getting new fingerprints per request (i.e. per page) with incognito mode in a crawler, but I can't seem to make it work with a brower-pool
instead of a crawler. How can session settings (useSessionPool
and persistCookiesPerSession
) be set in a browser-pool
to see if that makes a difference?manual-pink•3y ago
Just curious to know how to test if different fingerprints are sent per request?
Instead of browser pool I wanted to know can we just use PM2 load balancer with cluster mode and then run crawler when API endpoint is hit. Will that work? Or same puppeteer crawler instance will be shared among all?
rare-sapphire•3y ago
useIncognitoPages
disabled the cache, but the same fingerprint is still
assigned to each page, so I'm guessing it alone won't help
with unique fingerprints
well, this part is a bit mysterious... but it works )))
@voidmonk - please do YOUR experiment:
- run your program against https://fingerprint.com/demo/
- with useIncognitoPages
enabled and Firefox launcher.
- Use different IPs.
- Run in headless mode and create screenshots.
- After this - compare the IDs from fingerprint.com/demo/
What I saw: despite using the same, let say "Firefox 107 on Windows" in Crawlee -
- the fingerprint.com/demo/ gives you DIFFERENT ID each time!!!
So something IS changing when useIncognitoPages
enabled, I do not know what
exactly (may be @Lukas Krivka can explain) - but is is enough to full the bot
detector ))) so I did not investigated deeperTechnical Demo - Fingerprint Pro
Fingerprint Pro is the 99.5% accurate device fingerprinting solution. FingerprintJS is the top open-source browser fingerprinting library. Prevent fraud, spam, and account takeovers. Available for web, iOS, and Android.
manual-pink•3y ago
Thanks for sharing insights 🙂
constant-blueOP•3y ago
For basic fingerprint details (mainly user-agent and related headers), I make requests to https://httpbin.org/get which shows the request headers sent, and these fingerprint headers remain the same for all
browser-pool
requests/pages with or without useIncognitoPages
and useFingerprints
set. I think @Lukas Krivka is right in saying that fingerprint is always per browser, and since I'm only using one browser so the fingerprint remains the same.manual-pink•3y ago
But you said you are using browser pool so multiple browser instances right? each browser will have its own fingerprint or will it get shared among all the instances??
flat-fuchsia•3y ago
@voidmonk Browser pool itself doesn't rotate the browsers so that is what Crawlers are for.
rare-sapphire•3y ago
@voidmonk you are right, the test with https://httpbin.org/get is the best, you see that UserAgent is the same all the time, even when
useIncognitoPages: true
My idea to rely on https://fingerprint.com/demo/ (in other thread) - was wrong, the fingerprint.com is easy to fool, even with the same UserAgentTechnical Demo - Fingerprint Pro
Fingerprint Pro is the 99.5% accurate device fingerprinting solution. FingerprintJS is the top open-source browser fingerprinting library. Prevent fraud, spam, and account takeovers. Available for web, iOS, and Android.
rare-sapphire•3y ago
Nevertheless - you CAN have different fingerprints and Crawler (PlaywrightCrawler in my case). The key is the
retireBrowserAfterPageCount
setting!@new_in_town just advanced to level 5! Thanks for your contributions! 🎉
rare-sapphire•3y ago
this code make 8 requests to https://httpbin.org/get and use
retireBrowserAfterPageCount: 2
Check the png files it generates and tell us how many UserAgents you see ))
Well, it seems this approach works. But... the side effect might be bad performance...
so either I continue to experiment blindly or @Lukas Krivka will guide us ))))
I mean - how to have good performance AND different browser fingerprints (per request? every two-three requests?)rare-sapphire•3y ago
...and one more thing: the documentation https://crawlee.dev/api/browser-pool say:
The fingerprints are by default tied to the respective proxy urls to not use the same unique fingerprint from various IP addresses. You can disable this behavior in the fingerprintOptionsWell, if I want to "disable this behavior in the fingerprintOptions" - how can I do this? Can not find it in the documentation...
@crawlee/browser-pool | API | Crawlee
The
browser-pool
module exports three constructors. One for BrowserPool
itself and two for the included Puppeteer and Playwright plugins.
Example:
```js
import {
BrowserPool,
PuppeteerPlugin,
PlaywrightPlugin
} from '@crawlee/browser-pool';
import puppeteer from 'puppeteer';
import playwright from 'playwright';
const browserPool = n...flat-fuchsia•3y ago
@new_in_town @Lukas Krivka will not guide you unfortunately because this is a bit out of my expertise. We need to wait for Crawlee gods to come down to us mortals and bring the revelation.
We are figuring out some bot integration with GitHub so there conversations will be easily references there. There are a lot of great points from you guys.
@new_in_town Why would you need to rotate fingerprint so often? Think of it like regular user, they will also go through more than 2 pages. Default rotation is 100 requests but you can go down to 10-20.
rare-sapphire•3y ago
Why would you need to rotate fingerprint so often?because this is my use case: to open a page from a random IP with random UserAgent/fingerprint and thats it! Next time I visit this site - it should be DIFFERENT IP and different UserAgent/fingerprint. Thats it. Yes, there is a pause between my requests. I do not overload the target site.
constant-blueOP•3y ago
I was finally able to set unique fingerprints per request from a browser pool (note: I'm not testing crawlers), following this official repo guide 'Advanced usage with the Browser pool hooks system': https://github.com/apify/fingerprint-injector#advanced-usage-with-the-browser-pool-hooks-system
GitHub
GitHub - apify/fingerprint-injector: Home of fingerprint injector.
Home of fingerprint injector. Contribute to apify/fingerprint-injector development by creating an account on GitHub.
manual-pink•3y ago
Where do you deploy these crawlers?
rare-sapphire•3y ago
... to set unique fingerprints per request from a browser pool...Excellent!!! @voidmonk -- would you share an example program doing this? Without crawlers is fine... By the way: in case somebody implementing scraping and use an external message queue like beanstalkd - I can answer the questions (this was discussed here: https://discord.com/channels/801163717915574323/1056348705407651941/1059051356171804682 but there are lot interesting and non obvious details)
Did you manage to bypass fingerprint.com detection?