Crawlee vs bot detection systems - Plugins length is not OK
I tested PlaywrightCrawler on three bot detection sites (see [1], [2], [3] and the attached screenshots).
In all cases these sites complains about "0 plugins" or "Plugins length".
If I open these sites with browser I use every day (Firefox on Linux, by the way - the same as
used in PlaywrightCrawler settings) - these sites say "5 plugins" and the field is green.
Is it something in my code?
Can Crawlee emulate these plugins attributes?
[1] - https://infosimples.github.io/detect-headless/
[2] - https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html
[3] - https://webscraping.pro/wp-content/uploads/2021/02/testresult2.html
and here - part of the PlaywrightCrawler:
Screenshots:



36 Replies
xenial-black•3y ago
On https://bot.sannysoft.com/
It's OK with my code (different from yours). I get this:
What Url do you use to test?

fair-roseOP•3y ago
attached - better screenshot [3] from https://webscraping.pro/wp-content/uploads/2021/02/testresult2.html

fair-roseOP•3y ago
please test your program with these three URLs:
[1] - https://infosimples.github.io/detect-headless/
[2] - https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html
[3] - https://webscraping.pro/wp-content/uploads/2021/02/testresult2.html
@new_in_town just advanced to level 2! Thanks for your contributions! 🎉
fair-roseOP•3y ago
Well, here is the code I used to get the "Plugins length" error:
what I want to achieve - to have code/scraper having no "red flags" on bot detection systems like the three sites above AND passing this check: https://nowsecure.nl/ (as far as I understand nowsecure.nl implements a variant of Cloudflare protection).
I'm using Firefox as launcher - it seems, only with Firefox I can pass the nowsecure.nl check
inland-turquoise•3y ago
Thanks for this. Can you try with session pool on, not sure if there isn't anything bound to that.
@petrpatek. please look into this
fair-roseOP•3y ago
just changed useSessionPool to:
same thing - "Plugins Length: 0"
xenial-black•3y ago
@Lukas Krivka With the use of
chromium
instead of firefox
as launcher, There is no "Plugins length" error.
xenial-black•3y ago
I do this hook, for Firefox as launcher, with
fingerprint-injector
& Playwright [1],
Thus, there are no more "Plugins length" errors.
[1] https://github.com/apify/fingerprint-suite/blob/master/docs/guides/fingerprint-injector.md#usage-with-playwrightGitHub
fingerprint-suite/fingerprint-injector.md at master · apify/fingerp...
Browser fingerprinting tools for anonymizing your scrapers. Developed by Apify. - fingerprint-suite/fingerprint-injector.md at master · apify/fingerprint-suite
fair-roseOP•3y ago
Great, so this can be fixed!
But for somebody who is new JS/TS (like me)... would be better to have some example code starting with
it is possible, isn't it?
@new_in_town just advanced to level 3! Thanks for your contributions! 🎉
xenial-black•3y ago
Yes, it's up to you to do the job 😉
fair-roseOP•3y ago
@LeMoussel - many thanks for the code!!!
It works, it really works!!!
Even with my ugly JS code (please suggest how to improve it) -- it works!!!
I put the JS code creating
plugins
in the preNavigationHooks
- not sure this is the optimal solution...inland-turquoise•3y ago
@LeMoussel Thanks for the debug. @petrpatek. and will eventually check this and see how it can be implemented to Crawlee best
fair-roseOP•3y ago
by the way - when fixing "plugin length" - please also fix "0 mime types".
Several sites are checking "mime types length":
https://infosimples.github.io/detect-headless/
under "Mime"
https://browserleaks.com/javascript
search for "mimeTypes"
attached - screenshot from https://browserleaks.com/javascript - made with code above, you can see "mimeTypes: 0"
BrowserLeaks
JavaScript Browser Information
You can get a large amount of data about the system using the basic functionality of JavaScript and modern Web APIs. Such as User-Agent, Screen Resolution, System Language, Local Time, CPU architecture and the number of logical cores, Battery Status API, Network Information API, Web Audio API, Installed Plugins, and more.

xenial-black•3y ago
@new_in_town You can do with this
attached - screenshot from https://browserleaks.com/javascript - made with code above, you can see
mimeTypes: text/pdf, pdf, Portable Document Format
BrowserLeaks
JavaScript Browser Information
You can get a large amount of data about the system using the basic functionality of JavaScript and modern Web APIs. Such as User-Agent, Screen Resolution, System Language, Local Time, CPU architecture and the number of logical cores, Battery Status API, Network Information API, Web Audio API, Installed Plugins, and more.

fair-roseOP•3y ago
works like a charm!
thanks @LeMoussel !!!
other-emerald•3y ago
Just curious to know how are you generating those plugins? I am using puppeter but getting failed check in bot tests. See screenshot
other-emerald•3y ago

xenial-black•3y ago
With
preNavigationHooks
[1]. Is in BrowserCrawlerOptions
, so can be used with puppeter.
See the example of @new_in_town: https://discord.com/channels/801163717915574323/1059483872271798333/1060197941404508220
[1] https://crawlee.dev/api/browser-crawler/interface/BrowserCrawlerOptions#preNavigationHooksfair-roseOP•3y ago
what is interesting: in some cases this code should be in
preLaunchHooks
and in some cases - in prePageCreateHooks
do not ask me what happens there, I just played a bit ))))
Anyway, attached is my super-mega-PlaywrightCrawler ))) producing 1km of logs (printf debugging, yes) but demonstrating green results for "plugin length" and "mimeTypes"other-emerald•3y ago
Thanks a lot. I was able to make it work using Puppeteer.
code:
preload.js:
@Adi just advanced to level 3! Thanks for your contributions! 🎉
other-emerald•3y ago
I ran your script on local with proxy servers but I still see these red flags any idea how are you doing to resolve them? I am also figuring out samething.

other-emerald•3y ago

fair-roseOP•3y ago
Well, this JS code:
https://discord.com/channels/801163717915574323/1059483872271798333/1060501044456607774
is fixing only "Plugin length" and "Mime types".
Nothing else.
other-emerald•3y ago
I was able to resolve all the bot checks using this plugin: https://discord.com/channels/801163717915574323/1051917834290200608/1052147143508500490
only webdriver in frignprint tests and hairline feature test failed rest all passed.
fair-roseOP•3y ago
Well... actually code attached to this message https://discord.com/channels/801163717915574323/1059483872271798333/1060959263641567354
has green "webdriver" flag and many other bot checks are also green
Yes, hairline feature... can we ignore it?
other-emerald•3y ago
I am not sure about hairline feature but I have seen in many youtube videos and few blogs most of them ignore it
xenial-black•3y ago
@Adi With code provided in the following link https://intoli.com/blog/making-chrome-headless-undetectable/, which looks as follows:
returns the desired values for the renderer and vendor like this

xenial-black•3y ago
And as indicated in the article, you can also set
Retina/HiDPI Hairline Feature
.
But as mentioned, "This is another test that doesn’t really make a ton of sense because the majority of people don’t have HiDPI screens and most users’ browsers won’t support this feature. "fair-roseOP•3y ago
const webGLContent = ...Excellent! what we really need is a list of 100-200 such strings and a piece of JS code randomly returning a "webGL string"... (in other words - this functionality should be in the next version of Crawlee)
other-emerald•3y ago
Thanks a lot for sharing 🙂
inland-turquoise•3y ago
Great research guys, once our team gets more time, we will make sure all of this is implemented by default to Crawlee
fair-roseOP•3y ago
any news about this plugin problem?
Hi @new_in_town There is currently PR https://github.com/apify/fingerprint-suite/pull/141 for this.
I am sorry bad thread, this one is for https://discord.com/channels/801163717915574323/1059916802446073957
GitHub
feat: overwrite WebRTC APIs with a recursive ES6 proxy by barjin · ...
While this solution is a bit crude, it seems to work in 100% of all cases.
From what I found, it doesn't even trigger scripts inspecting properties of Web API objects (but it also might be that...