Need help bypassing CF 403 Blocked

Hi guys, i'm new to this community and i'm trying to scrape allpeople.com which has Cloudflare protection. After reading the docs I came up with two solutions - puppeteer-stealth and playwright/firefox combinations. Both are getting 403 Blocked by CF (i will share code snippets inside the thread) Am I doing something wrong? Or if not, what else can I try to bypass CF 403?
6 Replies
correct-apricot
correct-apricotOP3y ago
Puppeteer-stealth
import puppeteerVanilla from "puppeteer";
import { addExtra } from "puppeteer-extra";
const puppeteer = addExtra(puppeteerVanilla);
import { PuppeteerCrawler, puppeteerUtils } from 'crawlee';
import { Actor } from 'apify';

import StealthPlugin from "puppeteer-extra-plugin-stealth";
puppeteer.use(StealthPlugin());

await Actor.init();

const proxyConfiguration = await Actor.createProxyConfiguration();

const crawler = new PuppeteerCrawler({
useSessionPool: true,
persistCookiesPerSession: true,
proxyConfiguration,
launchContext: {
launcher: puppeteer
},
async requestHandler({ request, page }) {
const title = await page.$eval('h1', (el) => el.textContent);
console.log('title', title);
},
async errorHandler({ session, proxyInfo }) {
console.log('proxyInfo', proxyInfo);
await session.retire();
},
maxRequestRetries: 5
});

await crawler.run([
{ url: 'https://allpeople.com/search?ss=peter+michaek&ss-e=&ss-p=&ss-i=&where=&industry-auto=&where-auto=' }
]);

await Actor.exit();
import puppeteerVanilla from "puppeteer";
import { addExtra } from "puppeteer-extra";
const puppeteer = addExtra(puppeteerVanilla);
import { PuppeteerCrawler, puppeteerUtils } from 'crawlee';
import { Actor } from 'apify';

import StealthPlugin from "puppeteer-extra-plugin-stealth";
puppeteer.use(StealthPlugin());

await Actor.init();

const proxyConfiguration = await Actor.createProxyConfiguration();

const crawler = new PuppeteerCrawler({
useSessionPool: true,
persistCookiesPerSession: true,
proxyConfiguration,
launchContext: {
launcher: puppeteer
},
async requestHandler({ request, page }) {
const title = await page.$eval('h1', (el) => el.textContent);
console.log('title', title);
},
async errorHandler({ session, proxyInfo }) {
console.log('proxyInfo', proxyInfo);
await session.retire();
},
maxRequestRetries: 5
});

await crawler.run([
{ url: 'https://allpeople.com/search?ss=peter+michaek&ss-e=&ss-p=&ss-i=&where=&industry-auto=&where-auto=' }
]);

await Actor.exit();
Playwright/firefox
import { PlaywrightCrawler } from 'crawlee';
import { firefox } from 'playwright';
import { Actor } from 'apify';

await Actor.init();

const proxyConfiguration = await Actor.createProxyConfiguration({
groups: ['RESIDENTIAL'],
countryCode: 'US',
});

const crawler = new PlaywrightCrawler({
launchContext: {
launcher: firefox,
launchOptions: {
headless: true
}
},
proxyConfiguration,
async requestHandler({ request, page, log }) {
const title = await page.$eval('h1', (el) => el.textContent);
log.info('title', title);
},
});

await crawler.addRequests(['https://allpeople.com/search?ss=Blanca+murillo&ss-e=&ss-p=&ss-i=&where=&industry-auto=&where-auto=']);

await crawler.run();

await Actor.exit();
import { PlaywrightCrawler } from 'crawlee';
import { firefox } from 'playwright';
import { Actor } from 'apify';

await Actor.init();

const proxyConfiguration = await Actor.createProxyConfiguration({
groups: ['RESIDENTIAL'],
countryCode: 'US',
});

const crawler = new PlaywrightCrawler({
launchContext: {
launcher: firefox,
launchOptions: {
headless: true
}
},
proxyConfiguration,
async requestHandler({ request, page, log }) {
const title = await page.$eval('h1', (el) => el.textContent);
log.info('title', title);
},
});

await crawler.addRequests(['https://allpeople.com/search?ss=Blanca+murillo&ss-e=&ss-p=&ss-i=&where=&industry-auto=&where-auto=']);

await crawler.run();

await Actor.exit();
foreign-sapphire
foreign-sapphire3y ago
firstly you should find out if the problem is with the automated browser or with the proxies
deep-jade
deep-jade3y ago
hey there! I briefly checked the site and it does go through CF. But it seems like CF started to send 403 code for their check page itself. This means the site is actually loaded, but the crawler thinks it's blocked :/ Adding sessionPoolOptions: { blockedStatusCodes: [] }, to crawler options solves the problem
correct-apricot
correct-apricotOP3y ago
Thanks, @Andrey Bykov
foreign-sapphire
foreign-sapphire3y ago
Hey, where de find your residentials proxies ?
deep-jade
deep-jade3y ago
Sorry, what do you mean? Proxies are available in your Apify account.

Did you find this page helpful?