Crawlee & Apify•3y ago

Need help bypassing CF 403 Blocked

Hi guys, i'm new to this community and i'm trying to scrape allpeople.com which has Cloudflare protection. After reading the docs I came up with two solutions - puppeteer-stealth and playwright/firefox combinations. Both are getting 403 Blocked by CF (i will share code snippets inside the thread) Am I doing something wrong? Or if not, what else can I try to bypass CF 403?

6 Replies

correct-apricotOP•3y ago

Puppeteer-stealth

import puppeteerVanilla from "puppeteer";
import { addExtra } from "puppeteer-extra";
const puppeteer = addExtra(puppeteerVanilla);
import { PuppeteerCrawler, puppeteerUtils } from 'crawlee';
import { Actor } from 'apify';

import StealthPlugin from "puppeteer-extra-plugin-stealth";
puppeteer.use(StealthPlugin());

await Actor.init();

const proxyConfiguration = await Actor.createProxyConfiguration();

const crawler = new PuppeteerCrawler({
    useSessionPool: true,
    persistCookiesPerSession: true,
    proxyConfiguration,
    launchContext: {
        launcher: puppeteer
    },
    async requestHandler({ request, page }) {
        const title = await page.$eval('h1', (el) => el.textContent);
        console.log('title', title);
    },
    async errorHandler({ session, proxyInfo }) {
        console.log('proxyInfo', proxyInfo);
        await session.retire();
    },
    maxRequestRetries: 5
});

await crawler.run([
    { url: 'https://allpeople.com/search?ss=peter+michaek&ss-e=&ss-p=&ss-i=&where=&industry-auto=&where-auto=' }
]);

await Actor.exit();

import puppeteerVanilla from "puppeteer";
import { addExtra } from "puppeteer-extra";
const puppeteer = addExtra(puppeteerVanilla);
import { PuppeteerCrawler, puppeteerUtils } from 'crawlee';
import { Actor } from 'apify';

import StealthPlugin from "puppeteer-extra-plugin-stealth";
puppeteer.use(StealthPlugin());

await Actor.init();

const proxyConfiguration = await Actor.createProxyConfiguration();

const crawler = new PuppeteerCrawler({
    useSessionPool: true,
    persistCookiesPerSession: true,
    proxyConfiguration,
    launchContext: {
        launcher: puppeteer
    },
    async requestHandler({ request, page }) {
        const title = await page.$eval('h1', (el) => el.textContent);
        console.log('title', title);
    },
    async errorHandler({ session, proxyInfo }) {
        console.log('proxyInfo', proxyInfo);
        await session.retire();
    },
    maxRequestRetries: 5
});

await crawler.run([
    { url: 'https://allpeople.com/search?ss=peter+michaek&ss-e=&ss-p=&ss-i=&where=&industry-auto=&where-auto=' }
]);

await Actor.exit();

Playwright/firefox

import { PlaywrightCrawler } from 'crawlee';
import { firefox } from 'playwright';
import { Actor } from 'apify';

await Actor.init();

const proxyConfiguration = await Actor.createProxyConfiguration({
    groups: ['RESIDENTIAL'],
    countryCode: 'US',
});

const crawler = new PlaywrightCrawler({
    launchContext: {
        launcher: firefox,
        launchOptions: {
            headless: true
        }
    },
    proxyConfiguration,
    async requestHandler({ request, page, log }) {
        const title = await page.$eval('h1', (el) => el.textContent);
        log.info('title', title);
    },
});

await crawler.addRequests(['https://allpeople.com/search?ss=Blanca+murillo&ss-e=&ss-p=&ss-i=&where=&industry-auto=&where-auto=']);

await crawler.run();

await Actor.exit();

import { PlaywrightCrawler } from 'crawlee';
import { firefox } from 'playwright';
import { Actor } from 'apify';

await Actor.init();

const proxyConfiguration = await Actor.createProxyConfiguration({
    groups: ['RESIDENTIAL'],
    countryCode: 'US',
});

const crawler = new PlaywrightCrawler({
    launchContext: {
        launcher: firefox,
        launchOptions: {
            headless: true
        }
    },
    proxyConfiguration,
    async requestHandler({ request, page, log }) {
        const title = await page.$eval('h1', (el) => el.textContent);
        log.info('title', title);
    },
});

await crawler.addRequests(['https://allpeople.com/search?ss=Blanca+murillo&ss-e=&ss-p=&ss-i=&where=&industry-auto=&where-auto=']);

await crawler.run();

await Actor.exit();

foreign-sapphire•3y ago

firstly you should find out if the problem is with the automated browser or with the proxies

deep-jade•3y ago

hey there! I briefly checked the site and it does go through CF. But it seems like CF started to send 403 code for their check page itself. This means the site is actually loaded, but the crawler thinks it's blocked :/ Adding sessionPoolOptions: { blockedStatusCodes: [] }, to crawler options solves the problem

correct-apricotOP•3y ago

Thanks, @Andrey Bykov

foreign-sapphire•3y ago

Hey, where de find your residentials proxies ?

deep-jade•3y ago

Sorry, what do you mean? Proxies are available in your Apify account.

Gaming

Programming

Need help bypassing CF 403 Blocked

Did you find this page helpful?