Got captha and HTTP 403 using PlaywrightCrawler
Got captha and HTTP 403 when accessing wellfound.com
I get captcha all the time when I access links like these (basically - accessing any job ad on wellfound):
https://wellfound.com/company/kalepa/jobs/2651640-tech-lead-manager-full-stack-europe
https://wellfound.com/company/pinatacloud/jobs/2655889-principal-software-engineer
https://wellfound.com/company/wingspanapp/jobs/2629420-senior-software-engineer
Screenshot attached.
and this is not Cloudflare protection - it's some other anti-bot thing.
I am using:
- US residential proxies from smartproxy.com
- PlaywrightCrawler with
useSessionPool: false
and persistCookiesPerSession: false
- headless Firefox, both as launcher
and in fingerprintGeneratorOptions
browsers
- my locale is en-US, timezone in America/New_York (to match US proxies)
- in fingerprintGeneratorOptions
devices: ['desktop']
- in launchContext: { useIncognitoPages: true }
- I set pluginContent in preNavigationHooks
to fix the "plugin length" problem, as described here: https://discord.com/channels/801163717915574323/1059483872271798333
And still this site detects me as robot!
Any ideas how to overcome this?
UPDATE1: the IP on screenshot is somewhere in US/Texas...
UPDATE2: when I open these links in my desktop browser incognito mode - I get this captcha too...
4 Replies
extended-salmonOP•3y ago
UPDATE 1: this IP on screenshot - it is somewhere in US/Texas
UPDATE 2: when I open these links in my desktop browser/incognito mode - get this captcha too...
UPDATE3: it is some variation of Cloudflare, just looked into source code of this captcha HTML... for people who like to look inside: https://wellfound.com/cdn-cgi/apps/head/JIiAUxCYLtpv-hVKsQ6mzsTHfds.js
UPDATE4: It seems, opening wellfound.com in desktop browser, scrolling the page down to the end, and than in the same window opening one of the above links - it worked one-two times, no captcha.
It looks like this thing thinks "aha, this is a normal end-user behavior"
I am sure many people here already saw this protection, so let us share the experience...
Any hints how to fight this protection?
stormy-gold•3y ago
Did You try to use headfull mode?
+ modifying headers can have positive impact
+ maybe don't use incognito mode
unwilling-turquoise•3y ago
try higher number of retries, it can help sometimes
I think the rationale is some IPs are burnt and you need to go through the pool of IPs and find those that work with that site
and then ride those IPs
extended-salmonOP•3y ago
modifying headers can have positive impactcan you show an example? what should be changed? I mean, there are lot of headers...
Did You try to use headfull mode?i did not. how headull mode would help? at the end it must run in headless mode anyway...