CA
Crawlee & Apify9mo ago
genetic-orange

which browser is the best to crawl

As title said I’m using chromium currently but it is cpu heavy in usage Killing browser do not kill the process and because of that it’s easy to get 100% cpu usage pretty quickly (I’m crawling thousands of websites where on each I’m looking for different data) I already try to load pure html without css, images and other assets, that helped a lot but issue is still there
4 Replies
Hall
Hall9mo ago
View post on community site
This post has been pushed to the community knowledgebase. Any replies in this thread will be synced to the community site.
Apify Community
absent-sapphire
absent-sapphire8mo ago
Hi @Wojciech I recommend also blocking unnecessary network requests. with the blockRequests Make sure that are running it in headless mode. Also you could try using cheerio if the use-case allows it. Regarding your question about the browser: Firefox tends to be lighter on CPU usage.
Using Firefox browser with Playwright crawler | Crawlee · Build rel...
Crawlee helps you build and maintain your crawlers. It's open source, but built by developers who scrape millions of pages every day for a living.
PlaywrightCrawlingContext | API | Crawlee · Build reliable crawlers...
Crawlee helps you build and maintain your crawlers. It's open source, but built by developers who scrape millions of pages every day for a living.
genetic-orange
genetic-orangeOP8mo ago
yes I already do that
const launchContext: PlaywrightLaunchContext = {
launcher: firefox,
launchOptions: {
headless: false,
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
],
},
useChrome: false, // Use Chromium instead of Chrome for better performance
userAgent: userAgents[Math.floor(Math.random() * userAgents.length)],
}
...
launchContext,
preNavigationHooks: [
async ({ page }) => {
await playwrightUtils.blockRequests(page, {
urlPatterns: [
'.png',
'.jpg',
'.jpeg',
'.gif',
'.svg',
'.ico',
'.woff',
'.woff2',
'adsbygoogle.js',
],
extraUrlPatterns: ['adsbygoogle.js'],
})

await playwrightUtils.closeCookieModals(page)
},
],
const launchContext: PlaywrightLaunchContext = {
launcher: firefox,
launchOptions: {
headless: false,
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
],
},
useChrome: false, // Use Chromium instead of Chrome for better performance
userAgent: userAgents[Math.floor(Math.random() * userAgents.length)],
}
...
launchContext,
preNavigationHooks: [
async ({ page }) => {
await playwrightUtils.blockRequests(page, {
urlPatterns: [
'.png',
'.jpg',
'.jpeg',
'.gif',
'.svg',
'.ico',
'.woff',
'.woff2',
'adsbygoogle.js',
],
extraUrlPatterns: ['adsbygoogle.js'],
})

await playwrightUtils.closeCookieModals(page)
},
],
unfortunetly I recive: WARN Playwright Utils: blockRequests() helper is incompatible with non-Chromium browsers. I didn't know that 😄
multiple-amethyst
multiple-amethyst7mo ago
you can block requests manually (I mean not using util func) Example:
const BLOCKED = ['image', 'stylesheet', 'media', 'font','other'];

Then within your preNavigationHooks of your crawler, add this function:
async ({ page }) => {
await page.route('**/*', (route) => {
if (BLOCKED.includes(route.request().resourceType())) return route.abort();
return route.continue()
});
};
const BLOCKED = ['image', 'stylesheet', 'media', 'font','other'];

Then within your preNavigationHooks of your crawler, add this function:
async ({ page }) => {
await page.route('**/*', (route) => {
if (BLOCKED.includes(route.request().resourceType())) return route.abort();
return route.continue()
});
};

Did you find this page helpful?