Crawlee & Apify•3y ago

There is a major problem, Crawlee is unable to bypass the cloudflare protecti...

@Helper @gahabeen There is a major problem, Crawlee is unable to bypass the cloudflare protection (captcha solution tried 5 times) useChrome method was tried and failed. Manual login was successful when done on Chrome (out of Node and also tried with incognito mode, etc.) https://abrahamjuliot.github.io/creepjs/ Despite Crawlee receiving a higher trust score from the Chrome browser I am currently using, it is unable to pass the cloudflare page.

28 Replies

fair-roseOP•3y ago

parallel-tan•3y ago

1. Try it with Playwright + Firefox 2. Make sure you have high quality proxies. But local IP should also be good if you can open it normally 3. Try with Crawler

reduced-jade•3y ago

there is thread with some suggestions https://discord.com/channels/801163717915574323/1039611311467810856/1041684802052562974 but as far as I know for some pages no approach from crawlee really works and you always get captcha

optimistic-gold•3y ago

Do yoy try with BasicCrawler (got-scraping library)? https://crawlee.dev/docs/guides/got-scraping

Got Scraping | Crawlee

Blazing fast cURL alternative for modern web scraping

fair-roseOP•3y ago

3. Fail: WARN PuppeteerCrawler: Reclaiming failed request back to the list or queue. Request blocked - received 403 status code. Chromium worked perfectly with puppeteer-extra's StealthPlugin (it redirected to the main content without needing to solve the Cloudflare captcha):

const puppeteerVanilla = require("puppeteer");
const { addExtra } = require("puppeteer-extra");
const puppeteer = addExtra(puppeteerVanilla);

const StealthPlugin = require("puppeteer-extra-plugin-stealth");
puppeteer.use(StealthPlugin());

// Main function
puppeteer.launch({ headless: false }).then(async (browser) => {
  const page = await browser.newPage();
  await page.goto("https://chat.openai.com/");
});

const puppeteerVanilla = require("puppeteer");
const { addExtra } = require("puppeteer-extra");
const puppeteer = addExtra(puppeteerVanilla);

const StealthPlugin = require("puppeteer-extra-plugin-stealth");
puppeteer.use(StealthPlugin());

// Main function
puppeteer.launch({ headless: false }).then(async (browser) => {
  const page = await browser.newPage();
  await page.goto("https://chat.openai.com/");
});

Also PuppeteerCrawler worked with puppeteer-extra's StealthPlugin:

import { PuppeteerCrawler } from "crawlee";
import puppeteerVanilla from "puppeteer";
import { addExtra } from "puppeteer-extra";
const puppeteer = addExtra(puppeteerVanilla);

import StealthPlugin from "puppeteer-extra-plugin-stealth";
puppeteer.use(StealthPlugin());

// Main function
const crawler = new PuppeteerCrawler({
  launchContext: {
    launcher: puppeteer.launch({ headless: false }).then(async (browser) => {
      const page = await browser.newPage();
      await page.goto("https://chat.openai.com/");
    }),
  },
});

import { PuppeteerCrawler } from "crawlee";
import puppeteerVanilla from "puppeteer";
import { addExtra } from "puppeteer-extra";
const puppeteer = addExtra(puppeteerVanilla);

import StealthPlugin from "puppeteer-extra-plugin-stealth";
puppeteer.use(StealthPlugin());

// Main function
const crawler = new PuppeteerCrawler({
  launchContext: {
    launcher: puppeteer.launch({ headless: false }).then(async (browser) => {
      const page = await browser.newPage();
      await page.goto("https://chat.openai.com/");
    }),
  },
});

parallel-tan•3y ago

@petrpatek. Can you looks into this?

reduced-jade•3y ago

thanks I will try puppeteer-extra's StealthPlugin

sensitive-blue•3y ago

stealth plugin works awesome for CF bypassing. Using vanilla puppeteer is not a good option for scraping & crawling since it's easy to detect the browser is driven by a script due to the fingerprinting creep.js can be used to see the browser's trust score

parallel-tan•3y ago

Vanilla Crawlee should be better than puppeteer stealth, if it is not, we need to fix it

fascinating-indigo•3y ago

you mean the PuppeteerCrawler from Crawlee? is the userFingerprints set by default or should be set explicitly?

const crawler = new PuppeteerCrawler({
    ......
    browserPoolOptions: {
        useFingerprints: true,

const crawler = new PuppeteerCrawler({
    ......
    browserPoolOptions: {
        useFingerprints: true,

btw, I also tried the StealthPlugin, I didn't feel it improved anything. ymmv

fair-roseOP•3y ago

I also tried the StealthPlugin, I didn't feel it improved anything.

Did you say for Cloudflare? Could your IP or device information be contaminated?

MEE6•3y ago

@eigensinnig just advanced to level 3! Thanks for your contributions! 🎉

fair-roseOP•3y ago

Not for Cloudflare, please test it

parallel-tan•3y ago

Yeah, we need to fix it. The goal is to beat the stealth plugin. We are already likely better with Playwright and Firefox (the best combo) but need to catch up with Puppeteer It is on by default.

sensitive-blue•3y ago

I think the same

MEE6•3y ago

@Samet just advanced to level 1! Thanks for your contributions! 🎉

reduced-jade•3y ago

Thank you very much. Your solution with puppeteer-extra's StealthPlugin works like a charm. (at least for the url that crawlee even with playwright+firefox got always 403) I am still not sure how to incorporate it to the PuppeteerCrawler as in your example you do not use requestQueue but have the url in the constructor can you give a hint?

parallel-tan•3y ago

Just do launcher: puppeteer

genetic-orange•3y ago

I do not know how actual is the problem with chat.openai.com... Actually - a simple program with PlaywrightCrawler configured with Firefox on Linux is able to access this site, just did a screenshot in headless mode (without proxy! straight from the machine in data center):

fair-roseOP•3y ago

It may be related to the trust score, I have a VPN on 24/7 but even then Crawlee is at fault because another tool is running

parallel-tan•3y ago

It is always a combination of IP address + browser config. You cannot really forgot about one or other when doing blocking comparisons. Your local home IP is usually as clean as it gets (residential proxies are worse and datacenter even worse)

reduced-jade•3y ago

Actually with stealth plugin it works even with datacenter proxies. With crawlee default config it did not work even with residential so I think it is was all about browser config at least in my case (g2 review pages).

fair-roseOP•3y ago

Just a note here, I will carry out detailed tests if needed: https://stateofscraping.org/ https://github.com/unblocked-web/double-agent

State of Scraping

State of Scraping is a report about detectability of popular scraping stacks compiled by the Data Liberation Foundation.

GitHub

GitHub - unblocked-web/double-agent: A test suite of common scraper...

A test suite of common scraper detection techniques. See how detectable your scraper stack is. - GitHub - unblocked-web/double-agent: A test suite of common scraper detection techniques. See how de...

conscious-sapphire•3y ago

How to bypass or avoid this page (press & hold) while scrapping capterra reviews using Apify Actor

conscious-sapphire•3y ago

WARN PuppeteerCrawler: Reclaiming failed request back to the list or queue. Request blocked - received 403 status code. 2023-04-07T03:49:37.993Z {"id":"kmPcFnRhSQM8xHs","url":"https://www.capterra.com/p/107199/Medallia-Enterprise/reviews/","retryCount":3} 2023-04-07T03:49:47.490Z ERROR PuppeteerCrawler: Request failed and reached maximum retries. Error: Request blocked - received 403 status code.

Pepa J•3y ago

This looks like quite specific captcha, which actor are you using? If you are developer to press and hold a button you may try similar solution as it is suggested there https://stackoverflow.com/a/68513568 This looks like your request is being blocked, have you tried to use different proxy group (ex. RESIDENTIAL)?

conscious-sapphire•3y ago

@Pepa J Yes, i'm using below proxy const proxyConfiguration = await Actor.createProxyConfiguration({ //proxyUrls: ['http://groups-RESIDENTIAL:[email protected]:8000'], groups:['RESIDENTIAL'], countryCode: 'US' });

MEE6•3y ago

@ankit21090 just advanced to level 1! Thanks for your contributions! 🎉

Gaming

Programming

There is a major problem, Crawlee is unable to bypass the cloudflare protecti...

Did you find this page helpful?