Amazon scraping stopped working suddenly last Friday
I have a scraper for Amazon pages. Everything was working fine for a month. I had lots of calls per actor run and all ultimately ended up being successful. An occasional 503 was returned but retries fixed the problem. As of last Friday I am getting a 503 for most calls. I was able to improve it a bit by:
- configuring proxies and retiring sessions
- residential proxies
- more headers, manually rotating user agent, etc.
- buying more IPs
Still only 70% of my calls are working. Before Friday everything was fine. Some requests also started to return CAPTCHA (haven't seen before). Once a CAPTCHA is returned, there is no way of recovering from that failure.
I use a CheerioCrawler with the following proxy/session initialisation:
When a call fails (either 503 or I don't find specific information on my page), in the errorHandler I mark the session manually as bad
In the log I see the session is re-generated and a retry happens, but subsequent calls fail.
My questions:
- is marking the session as bad make Apify use a new IP?
- is there a way to debug the IPs? I can get the proxy information but this is not equal to the actual IP of the request... If it is, it seems like the IPs are not rotated on failure since one failed request ultimately ends up failing on all (10) following retries and all urls are equal
- how can I improve my scraping in any other way? Right now it is actually unusable since 30% of my calls fail 😦
- where can I find a high-level architectural documentation on Apify? The majority of docs seem to be autogenerated and rarely describe the flow/architecture of Apify/Cheerio
I would be very thankful for any help here.
Cheers,
Chris
4 Replies
robust-apricot•2y ago
Hey there! So no answer your questions:
- marking session as bad increases errorScore by 1. So if you have it set as max error score - it means the session is retired.
- you could make additional request e.g. to https://api.apify.com/v2/browser-info?skipHeaders=1 to get the IP address.
- this is hard to say, it's all about trying this and that. You could try using the headers provided by fingerprint suite (enabled by default in got-scraping), you could try residential proxy, browser, retire the session e.g. after 3 requests etc. There's really no universal answer for this, but amazon is know for rather hard blocking. You could check more here: https://docs.apify.com/academy/anti-scraping/techniques
- we hardly have any auto-generated docs. Not sure if you mean something like this https://docs.apify.com/academy/apify-scrapers/cheerio-scraper ? Or you could generally check the following part of the docs: https://docs.apify.com/academy
Anti-scraping techniques | Apify Documentation
Understand the various common (and obscure) anti-scraping techniques used by websites to prevent bots from accessing their content.
Scraping with Cheerio Scraper | Apify Documentation
Learn how to scrape a website using Apify's Cheerio Scraper. Build an actor's page function, extract information from a web page and download your data.
Web Scraping Academy | Apify Documentation
Learn everything about web scraping and automation with our free courses that will turn you into an expert scraper developer.
rival-blackOP•2y ago
Hey Andrey, thanks a lot for the answer. Regarding the auto-generated docs: I was rather referring to the API docs. Where I would actually expect some more than a pure description of the parameters but also an example here and there (for example the tip you gave me how to inspect the IP, etc.) But thanks for pointing me to the docs, I will check them out.
I already tried using residential proxies but it actually didn't improve much....
Since I am scraping using Cheerio, I have my pregenerated URLs which I just add to the queue/list. Now, I could obviously try using puppeteer/playwright and actually invoke the main page and have the browser navigate to a specific url and pretend to be a real browser, but that sounds like a very slow approach. Also, I could potentially fake it using cheerio too, having each product request visiting the root page first and using the response headers/cookies to visit the product page.
One last question: do you have an idea why it all seems to have stopped working last Friday? Up until then everything worked like a charm and only a few requests failed. Or is it coincidence that just more IPs have been burnt by someone firing too many requests?
Thanks,
Chris
robust-apricot•2y ago
About the docs - we know that they aren't perfect, and we are constantly working on improving them. I's day we have examples section, but you are right - we definitely should add more smaller examples. Thanks for the feedback, we definitely will continue working on it 👍
About Friday - to be honest - can't really come up with some explanation. Could be a coincidence, could be some change from Amazon side, or maybe you could let it cool off for a few days, and if it will start working after 2-3 days again - maybe they just detected the addresses and temporarily blocked them...
fair-rose•2y ago
@CK1 did it start working again ?