Crawlee - how to set timezone?

Ok, I know in which country are my proxies/IPs, so I can set locale:
const crawler = new PlaywrightCrawler({
...
fingerprintOptions: {
fingerprintGeneratorOptions: {
locales: [ ... ],
...
const crawler = new PlaywrightCrawler({
...
fingerprintOptions: {
fingerprintGeneratorOptions: {
locales: [ ... ],
...
BUT! How to set the timezone corresponding to the country? This is not a theoretical question: this site: https://pixelscan.net checks timezone, detects "Africa/Abidjan", compares it with my IP in German datacenter and says "Look like you spoofing your location". (attached - two parts of the huge screenshot made in headless mode with PlaywrightCrawler) So how to set/control timezone?
Pixelscan
Pixelscan is a one-and-done solution to detect internet bots and manually-controlled browsers with irregular connections between browser fingerprint parameters.
No description
9 Replies
foreign-sapphire
foreign-sapphireOP3y ago
part of screenshot from https://pixelscan.net
Pixelscan
Pixelscan is a one-and-done solution to detect internet bots and manually-controlled browsers with irregular connections between browser fingerprint parameters.
No description
fascinating-indigo
fascinating-indigo3y ago
There is property timezoneId in playwright.newContext(...) https://playwright.dev/docs/api/class-browser#browser-new-context-option-timezone-id
Browser | Playwright
* extends: [EventEmitter]
foreign-sapphire
foreign-sapphireOP3y ago
@LeMoussel would you pls help me using timezoneId in Crawlee? (I have difficulties connecting Plawright API and Crawlee) This is configuration of my PlaywrightCrawler:
const crawler = new PlaywrightCrawler({
autoscaledPoolOptions: {
minConcurrency: 2,
maxConcurrency: 4,
loggingIntervalSecs: null,

},

maxRequestRetries: 0,
navigationTimeoutSecs: 130,
requestHandlerTimeoutSecs: 110,
useSessionPool: false,
persistCookiesPerSession: false,
headless: true,

browserPoolOptions: {
useFingerprints: true,
operationTimeoutSecs: 40,
fingerprintOptions: {
fingerprintGeneratorOptions: {
browsers: ['firefox'],
operatingSystems: ['linux'],
locales: ['de-DE', 'de'],
},
},
},

launchContext: {
useIncognitoPages: true,
launcher: firefox
},
const crawler = new PlaywrightCrawler({
autoscaledPoolOptions: {
minConcurrency: 2,
maxConcurrency: 4,
loggingIntervalSecs: null,

},

maxRequestRetries: 0,
navigationTimeoutSecs: 130,
requestHandlerTimeoutSecs: 110,
useSessionPool: false,
persistCookiesPerSession: false,
headless: true,

browserPoolOptions: {
useFingerprints: true,
operationTimeoutSecs: 40,
fingerprintOptions: {
fingerprintGeneratorOptions: {
browsers: ['firefox'],
operatingSystems: ['linux'],
locales: ['de-DE', 'de'],
},
},
},

launchContext: {
useIncognitoPages: true,
launcher: firefox
},
I think - timezoneId it should be somewhere in the launchContext... but where?
fascinating-indigo
fascinating-indigo3y ago
GitHub
crawlee/MIGRATIONS.md at master · apify/crawlee
Crawlee—A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast. - crawlee/MIGRATIONS.md at master · apify/crawlee
GitHub
GitHub - apify/browser-pool: A Node.js library to easily manage and...
A Node.js library to easily manage and rotate a pool of web browsers, using any of the popular browser automation libraries like Puppeteer, Playwright, or SecretAgent. - GitHub - apify/browser-pool...
fascinating-indigo
fascinating-indigo3y ago
With this working code:
import {
PlaywrightCrawler, // https://crawlee.dev/docs/examples/playwright-crawler
sleep,
} from 'crawlee';
import { firefox } from 'playwright';

const crawler = new PlaywrightCrawler({
headless: false,
maxConcurrency: 4,
minConcurrency: 2,
maxRequestRetries: 0,
navigationTimeoutSecs: 130,
requestHandlerTimeoutSecs: 110,
useSessionPool: false,
persistCookiesPerSession: false,
browserPoolOptions: {
useFingerprints: true,
operationTimeoutSecs: 40,
fingerprintOptions: {
fingerprintGeneratorOptions: {
browsers: ['firefox'],
operatingSystems: ['linux'],
},
},
preLaunchHooks: [
async (pageId, launchContext) => {
launchContext.launchOptions.locale = 'en-AU'
launchContext.launchOptions.timezoneId='Australia/Brisbane'
}
],
},
launchContext: {
useIncognitoPages: false,
launcher: firefox
},

async requestHandler({ request, page, log }) {
log.info(`GET ${request.url} DONE`);

// To get the system's IANA timezone in JavaScript (https://en.wikipedia.org/wiki/List_of_tz_database_time_zones)
const timezoneFromJavascript = await page.evaluate('Intl.DateTimeFormat().resolvedOptions().timeZone');
log.info(`Timezone from Javascript: ${timezoneFromJavascript}`)

if (request.userData.site === 'pixelscan') {
await sleep(10000);
}

const url = new URL(request.url);
await page.screenshot( {path:`${url.host}.png`, fullPage:true} );
},
});

await crawler.run([
{ url: "https://pixelscan.net/" , userData: { site: "pixelscan" } }
]);
import {
PlaywrightCrawler, // https://crawlee.dev/docs/examples/playwright-crawler
sleep,
} from 'crawlee';
import { firefox } from 'playwright';

const crawler = new PlaywrightCrawler({
headless: false,
maxConcurrency: 4,
minConcurrency: 2,
maxRequestRetries: 0,
navigationTimeoutSecs: 130,
requestHandlerTimeoutSecs: 110,
useSessionPool: false,
persistCookiesPerSession: false,
browserPoolOptions: {
useFingerprints: true,
operationTimeoutSecs: 40,
fingerprintOptions: {
fingerprintGeneratorOptions: {
browsers: ['firefox'],
operatingSystems: ['linux'],
},
},
preLaunchHooks: [
async (pageId, launchContext) => {
launchContext.launchOptions.locale = 'en-AU'
launchContext.launchOptions.timezoneId='Australia/Brisbane'
}
],
},
launchContext: {
useIncognitoPages: false,
launcher: firefox
},

async requestHandler({ request, page, log }) {
log.info(`GET ${request.url} DONE`);

// To get the system's IANA timezone in JavaScript (https://en.wikipedia.org/wiki/List_of_tz_database_time_zones)
const timezoneFromJavascript = await page.evaluate('Intl.DateTimeFormat().resolvedOptions().timeZone');
log.info(`Timezone from Javascript: ${timezoneFromJavascript}`)

if (request.userData.site === 'pixelscan') {
await sleep(10000);
}

const url = new URL(request.url);
await page.screenshot( {path:`${url.host}.png`, fullPage:true} );
},
});

await crawler.run([
{ url: "https://pixelscan.net/" , userData: { site: "pixelscan" } }
]);
Another option is to set launchOptions in launchContext like this:
launchContext: {
launchOptions: {
locale: 'en-AU',
timezoneId: 'Australia/Brisbane'
},
useIncognitoPages: false,
launcher: firefox
},
launchContext: {
launchOptions: {
locale: 'en-AU',
timezoneId: 'Australia/Brisbane'
},
useIncognitoPages: false,
launcher: firefox
},
IMPORTANT: In all cases, useIncognitoPages must be set to false
foreign-sapphire
foreign-sapphireOP3y ago
awesome! this thing with preLaunchHooks, and with useIncognitoPages: false AND with these settings in fingerprintGeneratorOptions (I set locales additionally):
fingerprintGeneratorOptions: {
browsers: ['firefox'],
operatingSystems: ['linux'],
locales: ['de-DE'],
},
fingerprintGeneratorOptions: {
browsers: ['firefox'],
operatingSystems: ['linux'],
locales: ['de-DE'],
},
-- with all this... it finally works and pixelscan.net believes I'm in Germany (which is true, no proxies used in this test.. yet)... see screenshots
No description
No description
No description
foreign-sapphire
foreign-sapphireOP3y ago
@Lukas Krivka - I think, such example should be added to Crawlee documentation. Without @LeMoussel I would never figure out how it works Yes, I tried to set "launchOptions" this way - does not works! no idea why....
inland-turquoise
inland-turquoise3y ago
new headless chrome passes all the pixelscan checks as well as creepJS. Do check it out once When we use proxies then I see location is being spoofed message coming up on pixelscan using new headless. Any idea how to avoid it?
unwilling-turquoise
unwilling-turquoise3y ago
@Adi Is that the Web RTC leak perhaps? As I wrote, it will be fixed soonish

Did you find this page helpful?