Crawlee vs bot detection systems - Plugins length is not OK

I tested PlaywrightCrawler on three bot detection sites (see [1], [2], [3] and the attached screenshots). In all cases these sites complains about "0 plugins" or "Plugins length". If I open these sites with browser I use every day (Firefox on Linux, by the way - the same as used in PlaywrightCrawler settings) - these sites say "5 plugins" and the field is green. Is it something in my code? Can Crawlee emulate these plugins attributes? [1] - https://infosimples.github.io/detect-headless/ [2] - https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html [3] - https://webscraping.pro/wp-content/uploads/2021/02/testresult2.html and here - part of the PlaywrightCrawler:
const crawler = new PlaywrightCrawler({
...
browserPoolOptions: {
useFingerprints: true,

fingerprintOptions: {
fingerprintGeneratorOptions: {
browsers: ['firefox'],
operatingSystems: ['linux'],
},
},
},

launchContext: {
launcher: firefox
},

});
const crawler = new PlaywrightCrawler({
...
browserPoolOptions: {
useFingerprints: true,

fingerprintOptions: {
fingerprintGeneratorOptions: {
browsers: ['firefox'],
operatingSystems: ['linux'],
},
},
},

launchContext: {
launcher: firefox
},

});
Screenshots:
No description
No description
No description
36 Replies
xenial-black
xenial-black•3y ago
On https://bot.sannysoft.com/ It's OK with my code (different from yours). I get this: What Url do you use to test?
No description
fair-rose
fair-roseOP•3y ago
MEE6
MEE6•3y ago
@new_in_town just advanced to level 2! Thanks for your contributions! 🎉
fair-rose
fair-roseOP•3y ago
Well, here is the code I used to get the "Plugins length" error:
import { firefox, webkit } from 'playwright';
import { PlaywrightCrawler, Dataset, ProxyConfiguration, Request, log, sleep } from 'crawlee';
import { launchPlaywright, playwrightUtils } from 'crawlee';
import * as crypt from 'crypto';

const crawler = new PlaywrightCrawler({
autoscaledPoolOptions: {
minConcurrency: 2,
maxConcurrency: 4,
loggingIntervalSecs: null,

},

maxRequestRetries: 0,
navigationTimeoutSecs: 130,
requestHandlerTimeoutSecs: 110,
useSessionPool: false,
persistCookiesPerSession: false,
headless: true,

browserPoolOptions: {
useFingerprints: true,
operationTimeoutSecs: 40,
fingerprintOptions: {
fingerprintGeneratorOptions: {
browsers: ['firefox'],
operatingSystems: ['linux'],
},
},
},

launchContext: {
useIncognitoPages: true,
launcher: firefox
},

async requestHandler( {request, response, page, enqueueLinks, log, proxyInfo} )
{
const uniqueKey = crypt.randomBytes(16).toString("hex");
let url = new URL(request.url);
let host = url.host;
let scrFile = `${host}-${uniqueKey}.png`;

log.info(`GET ${request.url} Wait1 ...`);
await sleep(40*1000);

log.info(`GET ${request.url} Wait2, Pressing Enter ...`);
await page.keyboard.press('Enter');
await sleep(40*1000);

log.info(`GET ${request.url} Writing into ${scrFile} ...`);
await page.screenshot( {path:scrFile, fullPage:true} );
log.info(`GET ${request.url} DONE`);
},
});

await crawler.run([
"https://infosimples.github.io/detect-headless/",
"https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html",
"https://webscraping.pro/wp-content/uploads/2021/02/testresult2.html"
]);
import { firefox, webkit } from 'playwright';
import { PlaywrightCrawler, Dataset, ProxyConfiguration, Request, log, sleep } from 'crawlee';
import { launchPlaywright, playwrightUtils } from 'crawlee';
import * as crypt from 'crypto';

const crawler = new PlaywrightCrawler({
autoscaledPoolOptions: {
minConcurrency: 2,
maxConcurrency: 4,
loggingIntervalSecs: null,

},

maxRequestRetries: 0,
navigationTimeoutSecs: 130,
requestHandlerTimeoutSecs: 110,
useSessionPool: false,
persistCookiesPerSession: false,
headless: true,

browserPoolOptions: {
useFingerprints: true,
operationTimeoutSecs: 40,
fingerprintOptions: {
fingerprintGeneratorOptions: {
browsers: ['firefox'],
operatingSystems: ['linux'],
},
},
},

launchContext: {
useIncognitoPages: true,
launcher: firefox
},

async requestHandler( {request, response, page, enqueueLinks, log, proxyInfo} )
{
const uniqueKey = crypt.randomBytes(16).toString("hex");
let url = new URL(request.url);
let host = url.host;
let scrFile = `${host}-${uniqueKey}.png`;

log.info(`GET ${request.url} Wait1 ...`);
await sleep(40*1000);

log.info(`GET ${request.url} Wait2, Pressing Enter ...`);
await page.keyboard.press('Enter');
await sleep(40*1000);

log.info(`GET ${request.url} Writing into ${scrFile} ...`);
await page.screenshot( {path:scrFile, fullPage:true} );
log.info(`GET ${request.url} DONE`);
},
});

await crawler.run([
"https://infosimples.github.io/detect-headless/",
"https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html",
"https://webscraping.pro/wp-content/uploads/2021/02/testresult2.html"
]);
what I want to achieve - to have code/scraper having no "red flags" on bot detection systems like the three sites above AND passing this check: https://nowsecure.nl/ (as far as I understand nowsecure.nl implements a variant of Cloudflare protection). I'm using Firefox as launcher - it seems, only with Firefox I can pass the nowsecure.nl check
inland-turquoise
inland-turquoise•3y ago
Thanks for this. Can you try with session pool on, not sure if there isn't anything bound to that. @petrpatek. please look into this
fair-rose
fair-roseOP•3y ago
just changed useSessionPool to:
useSessionPool: true,
useSessionPool: true,
same thing - "Plugins Length: 0"
xenial-black
xenial-black•3y ago
@Lukas Krivka With the use of chromium instead of firefox as launcher, There is no "Plugins length" error.
No description
xenial-black
xenial-black•3y ago
I do this hook, for Firefox as launcher, with fingerprint-injector & Playwright [1], Thus, there are no more "Plugins length" errors. [1] https://github.com/apify/fingerprint-suite/blob/master/docs/guides/fingerprint-injector.md#usage-with-playwright
GitHub
fingerprint-suite/fingerprint-injector.md at master · apify/fingerp...
Browser fingerprinting tools for anonymizing your scrapers. Developed by Apify. - fingerprint-suite/fingerprint-injector.md at master · apify/fingerprint-suite
fair-rose
fair-roseOP•3y ago
Great, so this can be fixed! But for somebody who is new JS/TS (like me)... would be better to have some example code starting with
crawler = new PlaywrightCrawler({
...
});
crawler = new PlaywrightCrawler({
...
});
it is possible, isn't it?
MEE6
MEE6•3y ago
@new_in_town just advanced to level 3! Thanks for your contributions! 🎉
xenial-black
xenial-black•3y ago
Yes, it's up to you to do the job 😉
fair-rose
fair-roseOP•3y ago
@LeMoussel - many thanks for the code!!! It works, it really works!!! Even with my ugly JS code (please suggest how to improve it) -- it works!!! I put the JS code creating plugins in the preNavigationHooks - not sure this is the optimal solution...
inland-turquoise
inland-turquoise•3y ago
@LeMoussel Thanks for the debug. @petrpatek. and will eventually check this and see how it can be implemented to Crawlee best
fair-rose
fair-roseOP•3y ago
by the way - when fixing "plugin length" - please also fix "0 mime types". Several sites are checking "mime types length": https://infosimples.github.io/detect-headless/ under "Mime" https://browserleaks.com/javascript search for "mimeTypes" attached - screenshot from https://browserleaks.com/javascript - made with code above, you can see "mimeTypes: 0"
BrowserLeaks
JavaScript Browser Information
You can get a large amount of data about the system using the basic functionality of JavaScript and modern Web APIs. Such as User-Agent, Screen Resolution, System Language, Local Time, CPU architecture and the number of logical cores, Battery Status API, Network Information API, Web Audio API, Installed Plugins, and more.
No description
xenial-black
xenial-black•3y ago
@new_in_town You can do with this
const pluginContent = `
Object.defineProperty(navigator, 'plugins', {
get: () => {
const PDFPlugin = Object.create(Plugin.prototype, {
description: { value: 'Portable Document Format', enumerable: false },
filename: { value: 'internal-pdf-viewer', enumerable: false },
name: { value: 'PDF Plugin', enumerable: false },
});
return Object.create(PluginArray.prototype, {
length: { value: 1 },
0: { value: PDFPlugin },
});
},
});
Object.defineProperty(navigator, 'mimeTypes', {
get: () => {
const PDFMimeTypeTxt = Object.create(MimeType.prototype, {
type: { value: 'text/pdf', enumerable: false },
suffixes: { value: 'pdf', enumerable: false },
description: { value: 'Portable Document Format', enumerable: false },
enabledPlugin: { value: 'PDF Plugin', enumerable: false },
});
return Object.create(MimeTypeArray.prototype, {
length: { value: 1 },
0: { value: PDFMimeTypeTxt },
});
},
});
`
const pluginContent = `
Object.defineProperty(navigator, 'plugins', {
get: () => {
const PDFPlugin = Object.create(Plugin.prototype, {
description: { value: 'Portable Document Format', enumerable: false },
filename: { value: 'internal-pdf-viewer', enumerable: false },
name: { value: 'PDF Plugin', enumerable: false },
});
return Object.create(PluginArray.prototype, {
length: { value: 1 },
0: { value: PDFPlugin },
});
},
});
Object.defineProperty(navigator, 'mimeTypes', {
get: () => {
const PDFMimeTypeTxt = Object.create(MimeType.prototype, {
type: { value: 'text/pdf', enumerable: false },
suffixes: { value: 'pdf', enumerable: false },
description: { value: 'Portable Document Format', enumerable: false },
enabledPlugin: { value: 'PDF Plugin', enumerable: false },
});
return Object.create(MimeTypeArray.prototype, {
length: { value: 1 },
0: { value: PDFMimeTypeTxt },
});
},
});
`
attached - screenshot from https://browserleaks.com/javascript - made with code above, you can see mimeTypes: text/pdf, pdf, Portable Document Format
BrowserLeaks
JavaScript Browser Information
You can get a large amount of data about the system using the basic functionality of JavaScript and modern Web APIs. Such as User-Agent, Screen Resolution, System Language, Local Time, CPU architecture and the number of logical cores, Battery Status API, Network Information API, Web Audio API, Installed Plugins, and more.
No description
fair-rose
fair-roseOP•3y ago
works like a charm! thanks @LeMoussel !!!
other-emerald
other-emerald•3y ago
Just curious to know how are you generating those plugins? I am using puppeter but getting failed check in bot tests. See screenshot
other-emerald
other-emerald•3y ago
No description
xenial-black
xenial-black•3y ago
fair-rose
fair-roseOP•3y ago
what is interesting: in some cases this code should be in preLaunchHooks and in some cases - in prePageCreateHooks do not ask me what happens there, I just played a bit )))) Anyway, attached is my super-mega-PlaywrightCrawler ))) producing 1km of logs (printf debugging, yes) but demonstrating green results for "plugin length" and "mimeTypes"
other-emerald
other-emerald•3y ago
Thanks a lot. I was able to make it work using Puppeteer. code:
preNavigationHooks: [
async ({ page, request }) => {
log.info(`preNavigationHook: GET=${request.url} START`);
const preloadFile = fs.readFileSync('./preload.js', 'utf8');
await page.evaluateOnNewDocument(preloadFile);
log.info(`preNavigationHook: GET=${request.url} END`);
}
],
preNavigationHooks: [
async ({ page, request }) => {
log.info(`preNavigationHook: GET=${request.url} START`);
const preloadFile = fs.readFileSync('./preload.js', 'utf8');
await page.evaluateOnNewDocument(preloadFile);
log.info(`preNavigationHook: GET=${request.url} END`);
}
],
preload.js:
Object.defineProperty(navigator, 'plugins', {
get: () => {
const PDFPlugin = Object.create(Plugin.prototype, {
description: { value: 'Portable Document Format', enumerable: false },
filename: { value: 'internal-pdf-viewer', enumerable: false },
name: { value: 'PDF Plugin', enumerable: false },
});
return Object.create(PluginArray.prototype, {
length: { value: 1 },
0: { value: PDFPlugin },
});
},
});
Object.defineProperty(navigator, 'mimeTypes', {
get: () => {
const PDFMimeTypeTxt = Object.create(MimeType.prototype, {
type: { value: 'text/pdf', enumerable: false },
suffixes: { value: 'pdf', enumerable: false },
description: { value: 'Portable Document Format', enumerable: false },
enabledPlugin: { value: 'PDF Plugin', enumerable: false },
});
return Object.create(MimeTypeArray.prototype, {
length: { value: 1 },
0: { value: PDFMimeTypeTxt },
});
},
});
Object.defineProperty(navigator, 'plugins', {
get: () => {
const PDFPlugin = Object.create(Plugin.prototype, {
description: { value: 'Portable Document Format', enumerable: false },
filename: { value: 'internal-pdf-viewer', enumerable: false },
name: { value: 'PDF Plugin', enumerable: false },
});
return Object.create(PluginArray.prototype, {
length: { value: 1 },
0: { value: PDFPlugin },
});
},
});
Object.defineProperty(navigator, 'mimeTypes', {
get: () => {
const PDFMimeTypeTxt = Object.create(MimeType.prototype, {
type: { value: 'text/pdf', enumerable: false },
suffixes: { value: 'pdf', enumerable: false },
description: { value: 'Portable Document Format', enumerable: false },
enabledPlugin: { value: 'PDF Plugin', enumerable: false },
});
return Object.create(MimeTypeArray.prototype, {
length: { value: 1 },
0: { value: PDFMimeTypeTxt },
});
},
});
MEE6
MEE6•3y ago
@Adi just advanced to level 3! Thanks for your contributions! 🎉
other-emerald
other-emerald•3y ago
I ran your script on local with proxy servers but I still see these red flags any idea how are you doing to resolve them? I am also figuring out samething.
No description
other-emerald
other-emerald•3y ago
No description
fair-rose
fair-roseOP•3y ago
Well, this JS code: https://discord.com/channels/801163717915574323/1059483872271798333/1060501044456607774 is fixing only "Plugin length" and "Mime types". Nothing else.
other-emerald
other-emerald•3y ago
I was able to resolve all the bot checks using this plugin: https://discord.com/channels/801163717915574323/1051917834290200608/1052147143508500490 only webdriver in frignprint tests and hairline feature test failed rest all passed.
fair-rose
fair-roseOP•3y ago
Well... actually code attached to this message https://discord.com/channels/801163717915574323/1059483872271798333/1060959263641567354 has green "webdriver" flag and many other bot checks are also green Yes, hairline feature... can we ignore it?
other-emerald
other-emerald•3y ago
I am not sure about hairline feature but I have seen in many youtube videos and few blogs most of them ignore it
xenial-black
xenial-black•3y ago
@Adi With code provided in the following link https://intoli.com/blog/making-chrome-headless-undetectable/, which looks as follows:
const webGLContent = `
const getParameter = WebGLRenderingContext.getParameter;
WebGLRenderingContext.prototype.getParameter = function(parameter) {
// UNMASKED_VENDOR_WEBGL
if (parameter === 37445) {
return 'Intel Open Source Technology Center';
}
// UNMASKED_RENDERER_WEBGL
if (parameter === 37446) {
return 'Mesa DRI Intel(R) Ivybridge Mobile ';
}

return getParameter(parameter);
};
`
......
await page.addInitScript({ content: webGLContent });
......
const webGLContent = `
const getParameter = WebGLRenderingContext.getParameter;
WebGLRenderingContext.prototype.getParameter = function(parameter) {
// UNMASKED_VENDOR_WEBGL
if (parameter === 37445) {
return 'Intel Open Source Technology Center';
}
// UNMASKED_RENDERER_WEBGL
if (parameter === 37446) {
return 'Mesa DRI Intel(R) Ivybridge Mobile ';
}

return getParameter(parameter);
};
`
......
await page.addInitScript({ content: webGLContent });
......
returns the desired values for the renderer and vendor like this
No description
xenial-black
xenial-black•3y ago
And as indicated in the article, you can also set Retina/HiDPI Hairline Feature. But as mentioned, "This is another test that doesn’t really make a ton of sense because the majority of people don’t have HiDPI screens and most users’ browsers won’t support this feature. "
fair-rose
fair-roseOP•3y ago
const webGLContent = ...
Excellent! what we really need is a list of 100-200 such strings and a piece of JS code randomly returning a "webGL string"... (in other words - this functionality should be in the next version of Crawlee)
other-emerald
other-emerald•3y ago
Thanks a lot for sharing 🙂
inland-turquoise
inland-turquoise•3y ago
Great research guys, once our team gets more time, we will make sure all of this is implemented by default to Crawlee
fair-rose
fair-roseOP•3y ago
any news about this plugin problem?
Pepa J
Pepa J•3y ago
Hi @new_in_town There is currently PR https://github.com/apify/fingerprint-suite/pull/141 for this. I am sorry bad thread, this one is for https://discord.com/channels/801163717915574323/1059916802446073957
GitHub
feat: overwrite WebRTC APIs with a recursive ES6 proxy by barjin · ...
While this solution is a bit crude, it seems to work in 100% of all cases. From what I found, it doesn't even trigger scripts inspecting properties of Web API objects (but it also might be that...

Did you find this page helpful?