There is a major problem, Crawlee is unable to bypass the cloudflare protecti...

@Helper @gahabeen There is a major problem, Crawlee is unable to bypass the cloudflare protection (captcha solution tried 5 times) useChrome method was tried and failed. Manual login was successful when done on Chrome (out of Node and also tried with incognito mode, etc.) https://abrahamjuliot.github.io/creepjs/ Despite Crawlee receiving a higher trust score from the Chrome browser I am currently using, it is unable to pass the cloudflare page.
No description
28 Replies
fair-rose
fair-roseOP•3y ago
No description
parallel-tan
parallel-tan•3y ago
1. Try it with Playwright + Firefox 2. Make sure you have high quality proxies. But local IP should also be good if you can open it normally 3. Try with Crawler
reduced-jade
reduced-jade•3y ago
there is thread with some suggestions https://discord.com/channels/801163717915574323/1039611311467810856/1041684802052562974 but as far as I know for some pages no approach from crawlee really works and you always get captcha
optimistic-gold
optimistic-gold•3y ago
Do yoy try with BasicCrawler (got-scraping library)? https://crawlee.dev/docs/guides/got-scraping
Got Scraping | Crawlee
Blazing fast cURL alternative for modern web scraping
fair-rose
fair-roseOP•3y ago
3. Fail: WARN PuppeteerCrawler: Reclaiming failed request back to the list or queue. Request blocked - received 403 status code. Chromium worked perfectly with puppeteer-extra's StealthPlugin (it redirected to the main content without needing to solve the Cloudflare captcha):
const puppeteerVanilla = require("puppeteer");
const { addExtra } = require("puppeteer-extra");
const puppeteer = addExtra(puppeteerVanilla);

const StealthPlugin = require("puppeteer-extra-plugin-stealth");
puppeteer.use(StealthPlugin());

// Main function
puppeteer.launch({ headless: false }).then(async (browser) => {
const page = await browser.newPage();
await page.goto("https://chat.openai.com/");
});
const puppeteerVanilla = require("puppeteer");
const { addExtra } = require("puppeteer-extra");
const puppeteer = addExtra(puppeteerVanilla);

const StealthPlugin = require("puppeteer-extra-plugin-stealth");
puppeteer.use(StealthPlugin());

// Main function
puppeteer.launch({ headless: false }).then(async (browser) => {
const page = await browser.newPage();
await page.goto("https://chat.openai.com/");
});
Also PuppeteerCrawler worked with puppeteer-extra's StealthPlugin:
import { PuppeteerCrawler } from "crawlee";
import puppeteerVanilla from "puppeteer";
import { addExtra } from "puppeteer-extra";
const puppeteer = addExtra(puppeteerVanilla);

import StealthPlugin from "puppeteer-extra-plugin-stealth";
puppeteer.use(StealthPlugin());

// Main function
const crawler = new PuppeteerCrawler({
launchContext: {
launcher: puppeteer.launch({ headless: false }).then(async (browser) => {
const page = await browser.newPage();
await page.goto("https://chat.openai.com/");
}),
},
});
import { PuppeteerCrawler } from "crawlee";
import puppeteerVanilla from "puppeteer";
import { addExtra } from "puppeteer-extra";
const puppeteer = addExtra(puppeteerVanilla);

import StealthPlugin from "puppeteer-extra-plugin-stealth";
puppeteer.use(StealthPlugin());

// Main function
const crawler = new PuppeteerCrawler({
launchContext: {
launcher: puppeteer.launch({ headless: false }).then(async (browser) => {
const page = await browser.newPage();
await page.goto("https://chat.openai.com/");
}),
},
});
parallel-tan
parallel-tan•3y ago
@petrpatek. Can you looks into this?
reduced-jade
reduced-jade•3y ago
thanks I will try puppeteer-extra's StealthPlugin
sensitive-blue
sensitive-blue•3y ago
stealth plugin works awesome for CF bypassing. Using vanilla puppeteer is not a good option for scraping & crawling since it's easy to detect the browser is driven by a script due to the fingerprinting creep.js can be used to see the browser's trust score
parallel-tan
parallel-tan•3y ago
Vanilla Crawlee should be better than puppeteer stealth, if it is not, we need to fix it
fascinating-indigo
fascinating-indigo•3y ago
you mean the PuppeteerCrawler from Crawlee? is the userFingerprints set by default or should be set explicitly?
const crawler = new PuppeteerCrawler({
......
browserPoolOptions: {
useFingerprints: true,
const crawler = new PuppeteerCrawler({
......
browserPoolOptions: {
useFingerprints: true,
btw, I also tried the StealthPlugin, I didn't feel it improved anything. ymmv
fair-rose
fair-roseOP•3y ago
I also tried the StealthPlugin, I didn't feel it improved anything.
Did you say for Cloudflare? Could your IP or device information be contaminated?
MEE6
MEE6•3y ago
@eigensinnig just advanced to level 3! Thanks for your contributions! 🎉
fair-rose
fair-roseOP•3y ago
Not for Cloudflare, please test it
parallel-tan
parallel-tan•3y ago
Yeah, we need to fix it. The goal is to beat the stealth plugin. We are already likely better with Playwright and Firefox (the best combo) but need to catch up with Puppeteer It is on by default.
sensitive-blue
sensitive-blue•3y ago
I think the same
MEE6
MEE6•3y ago
@Samet just advanced to level 1! Thanks for your contributions! 🎉
reduced-jade
reduced-jade•3y ago
Thank you very much. Your solution with puppeteer-extra's StealthPlugin works like a charm. (at least for the url that crawlee even with playwright+firefox got always 403) I am still not sure how to incorporate it to the PuppeteerCrawler as in your example you do not use requestQueue but have the url in the constructor can you give a hint?
parallel-tan
parallel-tan•3y ago
Just do launcher: puppeteer
genetic-orange
genetic-orange•3y ago
I do not know how actual is the problem with chat.openai.com... Actually - a simple program with PlaywrightCrawler configured with Firefox on Linux is able to access this site, just did a screenshot in headless mode (without proxy! straight from the machine in data center):
No description
fair-rose
fair-roseOP•3y ago
It may be related to the trust score, I have a VPN on 24/7 but even then Crawlee is at fault because another tool is running
parallel-tan
parallel-tan•3y ago
It is always a combination of IP address + browser config. You cannot really forgot about one or other when doing blocking comparisons. Your local home IP is usually as clean as it gets (residential proxies are worse and datacenter even worse)
reduced-jade
reduced-jade•3y ago
Actually with stealth plugin it works even with datacenter proxies. With crawlee default config it did not work even with residential so I think it is was all about browser config at least in my case (g2 review pages).
fair-rose
fair-roseOP•3y ago
Just a note here, I will carry out detailed tests if needed: https://stateofscraping.org/ https://github.com/unblocked-web/double-agent
State of Scraping
State of Scraping is a report about detectability of popular scraping stacks compiled by the Data Liberation Foundation.
GitHub
GitHub - unblocked-web/double-agent: A test suite of common scraper...
A test suite of common scraper detection techniques. See how detectable your scraper stack is. - GitHub - unblocked-web/double-agent: A test suite of common scraper detection techniques. See how de...
conscious-sapphire
conscious-sapphire•3y ago
How to bypass or avoid this page (press & hold) while scrapping capterra reviews using Apify Actor
No description
conscious-sapphire
conscious-sapphire•3y ago
WARN PuppeteerCrawler: Reclaiming failed request back to the list or queue. Request blocked - received 403 status code. 2023-04-07T03:49:37.993Z {"id":"kmPcFnRhSQM8xHs","url":"https://www.capterra.com/p/107199/Medallia-Enterprise/reviews/","retryCount":3} 2023-04-07T03:49:47.490Z ERROR PuppeteerCrawler: Request failed and reached maximum retries. Error: Request blocked - received 403 status code.
Pepa J
Pepa J•3y ago
This looks like quite specific captcha, which actor are you using? If you are developer to press and hold a button you may try similar solution as it is suggested there https://stackoverflow.com/a/68513568 This looks like your request is being blocked, have you tried to use different proxy group (ex. RESIDENTIAL)?
conscious-sapphire
conscious-sapphire•3y ago
@Pepa J Yes, i'm using below proxy const proxyConfiguration = await Actor.createProxyConfiguration({ //proxyUrls: ['http://groups-RESIDENTIAL:[email protected]:8000'], groups:['RESIDENTIAL'], countryCode: 'US' });
MEE6
MEE6•3y ago
@ankit21090 just advanced to level 1! Thanks for your contributions! 🎉

Did you find this page helpful?