retryOnBlocked with HttpCrawler
Hi, I'm using the HttpCrawler to scrape a static list of URLs. However, when I do get a 403 response as a result of CloudFlare challenge, the request is not retried with retryOnBlocked: true. However, if I remove retryOnBlocked, I see my errorHandler getting invoked and the request is retried. Do I understand retryOnBlocked wrong?
6 Replies
Someone will reply to you shortly. In the meantime, we’ve found some posts that could help answer your question.
Hi @triGun can you provide us with minimal reproducable code?
wise-white•7mo ago
The error handler runs after every failed request, the failed request handler runs after max retries. Perhaps you might want to move some logic from one to the other?
deep-jadeOP•7mo ago
@Pepa J Not sure if this reproduces it, in my case it lead to the described result:
Nothing really special. I have two proxies in my configuration one in tier1 and the second in tier2.
Depends on the implementation of the website.
If you experience the captcha even in regular browser without proxy, than you cannot pass it just with HttpCrawler, you may need to you a browser-based solution like PuppeteerCrawler or PlaywrightCrawler.
If you don't experience the captcha in your browser - it could be about the quality of the Proxies that you set up - the website provides the captcha just to suspicious visitors (ip from proxy).
deep-jadeOP•6mo ago
It's OK to get a 403 as a result of the request. The problem is that with retryOnBlocked present, the request is not retried with a different proxy tier unless this property is removed.