Cheerio Crawler works for Amazon.de but gets detected bot at amazon.com

Dear all, I am experimenting with cheerio crawler to scrape Amazon. I followed the tutorial online and it works for Germany but the same crawler gets detected as a bot for US. For Germany, I am using a data center proxy of Germany and it works but for USA the datacenter proxy of US doesn't work. Below is the configuration. I am building an Amazon scraper for multiple marketplaces. But this inconsistency makes it challenging. const crawler = new CheerioCrawler({ proxyConfiguration, requestQueue: queue, useSessionPool: true, persistCookiesPerSession: true, maxRequestRetries: 20, maxRequestsPerMinute: 250, autoscaledPoolOptions:{ maxConcurrency:100, minConcurrency: 5, isFinishedFunction: async () => { // Tell the pool whether it should finish // or wait for more tasks to become available. // Return true or false return false } }, failedRequestHandler: async (context) => rebirth_requests({ ...context}), requestHandler: async (context) => router({ ...context, dbPool}) //sessionPoolOptions:{blockedStatusCodes:[]}, });
8 Replies
optimistic-gold
optimistic-goldOP3y ago
When I use this proxy in my system with real browser it works. So I assume proxy is fine only problem is the config in cheerio.
following-aqua
following-aqua3y ago
Have you tried other proxies (groups, maybe residential). But amazon is quite protected, and it's common that one country will be better protected than the other for the "same" website
optimistic-gold
optimistic-goldOP3y ago
But the point is same proxy works in the browser. So an http call in a browser with same with same proxy works but in cheerio doesnt. To me it feels like headers, cookies etc when browser is used is different then what being used in cheerio. Is there any fingerprints used when we scrap via cheerio ? I tried out same proxy in playwright and it works. So there must be some settings different in Cheerio which are inconsistent.
following-aqua
following-aqua3y ago
CheerioCrawler is using got-scraping, and yes - it use the fingerprints..
optimistic-gold
optimistic-goldOP3y ago
preNavigationHooks: [ async (crawlingContext, gotOptions) => { // ... gotOptions.headerGeneratorOptions= { // browsers: [ // { // name: 'chrome', // minVersion: 90, // maxVersion: 100 // } //], devices: ['mobile'], locales: ['en-US'], operatingSystems: ['ios','android'], } }, with these settings its better now.
adverse-sapphire
adverse-sapphire3y ago
for me i have no problem as long as i send the user agent in the headers
optimistic-gold
optimistic-goldOP3y ago
Ok for .com or for .de ?
adverse-sapphire
adverse-sapphire3y ago
.com

Did you find this page helpful?