pass the cloudflare browser check
Anybody know how to pass cloudflare browser check with crawlee playwrightCrawler?
site I have problem with: https://www.g2.com/
I have tried residential proxies, no proxies, chrome and firefox browser, headful or headless but nothing works.
My chrome browser passes the check with no proxies and residential proxies too, so I guess proxy is not the problem. The problem is that cloudflare somehow knows that it is automated browser.
In apify store there is working scraper for g2 but it is written in python but atleast I know it is possible to do it.
G2
Business Software and Services Reviews | G2
Compare the best business software and services based on user ratings and social data. Reviews for CRM, ERP, HR, CAD, PDM and Marketing software.
18 Replies
absent-sapphireā¢3y ago
Playwright with FF should work, I used it to bypass CF few months ago, pls share run or snapshot, make sure you
await
for some real content control, on page load there will be CF checkup running for a whileunwilling-turquoiseOPā¢3y ago
here is the run https://console.apify.com/view/runs/UHYyD8JIrtj68ePpW but there is not much to see, it just returns 403
I got through CF many times with playwright but now it looks like they have improved the protection.
absent-sapphireā¢3y ago
Looks like some IPs working and some are not, content reached under Chrome after two retries: https://console.apify.com/view/runs/MWefTdPk6wfZZ3rz5
unwilling-turquoiseOPā¢3y ago
interesting, looks like https://www.g2.com/ works but for example does not https://www.g2.com/products/monday-com-monday-com/reviews
@HonzaS just advanced to level 12! Thanks for your contributions! š
unwilling-turquoiseOPā¢3y ago
I took your config and just changed the url to https://www.g2.com/products/monday-com-monday-com/reviews and number of retries to 20 but no luck. https://console.apify.com/view/runs/3RQyInzk9aQ0SEOJS
Any suggestions would be greatly appreciated.
frail-apricotā¢3y ago
@petrpatek. could add it to his repository of Cloudflare sites
correct-apricotā¢3y ago
We chatted about this in private with @HonzaS, as I encountered CF blocking too...
If I understand it correctly, CF has two modes of bot protection (with kinda confusing names TBH)
- a) Bot Management ā basic
- b) Super Bot Fight Mode ā advanced
The sites Iām scraping seems to use A) The solution to that seems to be pretty easy:
- using CheerioCrawler with playwright:firefox Dockerfile
- in createSessionFunction: open Firefox (via Playwright), goto the site, let the Firefox solve the Javascript challenge, and save all the cookies and request headers to the session.
- in preNavigationHooks: get stored cookies/headers from session and set them to gotScraping
This solution works for me both locally and on Apify platform, without any proxies used. Beware that it probably only works for sites that use the basic bot protection mode.
correct-apricotā¢3y ago
flat-fuchsiaā¢3y ago
@strajk What do you use for debugging with MITM?
Is it mitmproxy?
https://mitmproxy.org/
correct-apricotā¢3y ago
Proxyman
Proxyman Ā· Native, Modern Web Debugging Proxy Ā· Inspect network fro...
Proxyman is a native, high-performance macOS app, which enables developers to capture, inspect, and manipulate HTTP(s) requests/responses with ease. Support macOS, iOS & Android devices.
@strajk just advanced to level 1! Thanks for your contributions! š
correct-apricotā¢3y ago
It was crucial for me for discovering it's the headers order that causes the issue


frail-apricotā¢3y ago
Wow, never heard about headers order messing things up
absent-sapphireā¢3y ago
I used the same approach but with internal
fetch
calls from inside browser, imho might be more reliable since they should be doing something logically equal to "heartbeat" checkup to see if web visitor still online
This also should be working regardless of their internal protection mode: if page context reached then fetch
expected to work, otherwise they (CF) will not be able to support web appscorrect-apricotā¢3y ago
@Lukas Krivka it was first time for me too, but it's probably not too uncommon as there's logic exactly for this in header-generator https://github.com/apify/header-generator/blob/master/src/header-generator.ts#L208


frail-apricotā¢3y ago
Thanks, good to keep in mind
unwilling-turquoiseOPā¢3y ago