There is a major problem, Crawlee is unable to bypass the cloudflare protecti...
@Helper @gahabeen There is a major problem, Crawlee is unable to bypass the cloudflare protection (captcha solution tried 5 times)
useChrome
method was tried and failed.
Manual login was successful when done on Chrome (out of Node and also tried with incognito mode, etc.)
https://abrahamjuliot.github.io/creepjs/ Despite Crawlee receiving a higher trust score from the Chrome browser I am currently using, it is unable to pass the cloudflare page.
28 Replies
fair-roseOP•3y ago

parallel-tan•3y ago
1. Try it with Playwright + Firefox
2. Make sure you have high quality proxies. But local IP should also be good if you can open it normally
3. Try with Crawler
reduced-jade•3y ago
there is thread with some suggestions https://discord.com/channels/801163717915574323/1039611311467810856/1041684802052562974
but as far as I know for some pages no approach from crawlee really works and you always get captcha
optimistic-gold•3y ago
Got Scraping | Crawlee
Blazing fast cURL alternative for modern web scraping
fair-roseOP•3y ago
3. Fail: WARN PuppeteerCrawler: Reclaiming failed request back to the list or queue. Request blocked - received 403 status code.
Chromium worked perfectly with
puppeteer-extra
's StealthPlugin
(it redirected to the main content without needing to solve the Cloudflare captcha):
Also PuppeteerCrawler
worked with puppeteer-extra
's StealthPlugin
:
parallel-tan•3y ago
@petrpatek. Can you looks into this?
reduced-jade•3y ago
thanks I will try puppeteer-extra's StealthPlugin
sensitive-blue•3y ago
stealth plugin works awesome for CF bypassing. Using vanilla puppeteer is not a good option for scraping & crawling since it's easy to detect the browser is driven by a script due to the fingerprinting
creep.js can be used to see the browser's trust score
parallel-tan•3y ago
Vanilla Crawlee should be better than puppeteer stealth, if it is not, we need to fix it
fascinating-indigo•3y ago
you mean the
PuppeteerCrawler
from Crawlee? is the userFingerprints
set by default or should be set explicitly?
btw, I also tried the StealthPlugin, I didn't feel it improved anything. ymmvfair-roseOP•3y ago
I also tried the StealthPlugin, I didn't feel it improved anything.Did you say for Cloudflare? Could your IP or device information be contaminated?
@eigensinnig just advanced to level 3! Thanks for your contributions! 🎉
fair-roseOP•3y ago
Not for Cloudflare, please test it
parallel-tan•3y ago
Yeah, we need to fix it. The goal is to beat the stealth plugin. We are already likely better with Playwright and Firefox (the best combo) but need to catch up with Puppeteer
It is on by default.
sensitive-blue•3y ago
I think the same
@Samet just advanced to level 1! Thanks for your contributions! 🎉
reduced-jade•3y ago
Thank you very much. Your solution with
puppeteer-extra
's StealthPlugin
works like a charm. (at least for the url that crawlee even with playwright+firefox got always 403)
I am still not sure how to incorporate it to the PuppeteerCrawler
as in your example you do not use requestQueue but have the url in the constructor can you give a hint?parallel-tan•3y ago
Just do
launcher: puppeteer
genetic-orange•3y ago
I do not know how actual is the problem with chat.openai.com...
Actually - a simple program with PlaywrightCrawler configured with Firefox on Linux is able to access this site, just did a screenshot in headless mode (without proxy! straight from the machine in data center):

fair-roseOP•3y ago
It may be related to the trust score, I have a VPN on 24/7 but even then Crawlee is at fault because another tool is running
parallel-tan•3y ago
It is always a combination of IP address + browser config. You cannot really forgot about one or other when doing blocking comparisons. Your local home IP is usually as clean as it gets (residential proxies are worse and datacenter even worse)
reduced-jade•3y ago
Actually with stealth plugin it works even with datacenter proxies. With crawlee default config it did not work even with residential so I think it is was all about browser config at least in my case (g2 review pages).
fair-roseOP•3y ago
Just a note here, I will carry out detailed tests if needed:
https://stateofscraping.org/
https://github.com/unblocked-web/double-agent
State of Scraping
State of Scraping is a report about detectability of popular scraping stacks compiled by the Data Liberation Foundation.
GitHub
GitHub - unblocked-web/double-agent: A test suite of common scraper...
A test suite of common scraper detection techniques. See how detectable your scraper stack is. - GitHub - unblocked-web/double-agent: A test suite of common scraper detection techniques. See how de...
conscious-sapphire•3y ago
How to bypass or avoid this page (press & hold) while scrapping capterra reviews using Apify Actor

conscious-sapphire•3y ago
WARN PuppeteerCrawler: Reclaiming failed request back to the list or queue. Request blocked - received 403 status code.
2023-04-07T03:49:37.993Z {"id":"kmPcFnRhSQM8xHs","url":"https://www.capterra.com/p/107199/Medallia-Enterprise/reviews/","retryCount":3}
2023-04-07T03:49:47.490Z ERROR PuppeteerCrawler: Request failed and reached maximum retries. Error: Request blocked - received 403 status code.
This looks like quite specific captcha, which actor are you using?
If you are developer to press and hold a button you may try similar solution as it is suggested there https://stackoverflow.com/a/68513568
This looks like your request is being blocked, have you tried to use different proxy group (ex. RESIDENTIAL)?
conscious-sapphire•3y ago
@Pepa J Yes, i'm using below proxy
const proxyConfiguration = await Actor.createProxyConfiguration({
//proxyUrls: ['http://groups-RESIDENTIAL:[email protected]:8000'],
groups:['RESIDENTIAL'],
countryCode: 'US'
});
@ankit21090 just advanced to level 1! Thanks for your contributions! 🎉