Crawlee & Apify•7mo ago

retryOnBlocked with HttpCrawler

Hi, I'm using the HttpCrawler to scrape a static list of URLs. However, when I do get a 403 response as a result of CloudFlare challenge, the request is not retried with retryOnBlocked: true. However, if I remove retryOnBlocked, I see my errorHandler getting invoked and the request is retried. Do I understand retryOnBlocked wrong?

6 Replies

Hall•7mo ago

Someone will reply to you shortly. In the meantime, we’ve found some posts that could help answer your question.

Pepa J•7mo ago

Hi @triGun can you provide us with minimal reproducable code?

wise-white•7mo ago

The error handler runs after every failed request, the failed request handler runs after max retries. Perhaps you might want to move some logic from one to the other?

deep-jadeOP•7mo ago

@Pepa J Not sure if this reproduces it, in my case it lead to the described result:

const crawler = new HttpCrawler(
      {
        maxConcurrency: 2,
        maxRequestsPerMinute: 180,
        ...options,
        proxyConfiguration,
        useSessionPool: true,
        persistCookiesPerSession: true,
        retryOnBlocked: true,
        additionalMimeTypes: ["text/plain", "application/pdf"],
        async requestHandler({ pushData, request, response }) {
          await pushData({
            url: request.url,
            statusCode: response.statusCode,
          });
        },
        async failedRequestHandler({ pushData, request, response }) {
          log.error(`Request for URL "${request.url}" failed.`);
          await pushData({
            url: request.url,
            statusCode: response?.statusCode ?? 0,
          });
        },
        async errorHandler({ request }, { message }) {
          log.error(`Request failed with ${message}`);
          if (!request.noRetry) {
            const baseWaitTime = Math.pow(2, request.retryCount) * 1000;
            const jitter = baseWaitTime * (Math.random() - 0.5);
            const waitTime = baseWaitTime + jitter;
            await new Promise((resolve) => setTimeout(resolve, waitTime));
          }
        },
      },
      config,
    );
crawler.run(urls)

const crawler = new HttpCrawler(
      {
        maxConcurrency: 2,
        maxRequestsPerMinute: 180,
        ...options,
        proxyConfiguration,
        useSessionPool: true,
        persistCookiesPerSession: true,
        retryOnBlocked: true,
        additionalMimeTypes: ["text/plain", "application/pdf"],
        async requestHandler({ pushData, request, response }) {
          await pushData({
            url: request.url,
            statusCode: response.statusCode,
          });
        },
        async failedRequestHandler({ pushData, request, response }) {
          log.error(`Request for URL "${request.url}" failed.`);
          await pushData({
            url: request.url,
            statusCode: response?.statusCode ?? 0,
          });
        },
        async errorHandler({ request }, { message }) {
          log.error(`Request failed with ${message}`);
          if (!request.noRetry) {
            const baseWaitTime = Math.pow(2, request.retryCount) * 1000;
            const jitter = baseWaitTime * (Math.random() - 0.5);
            const waitTime = baseWaitTime + jitter;
            await new Promise((resolve) => setTimeout(resolve, waitTime));
          }
        },
      },
      config,
    );
crawler.run(urls)

Nothing really special. I have two proxies in my configuration one in tier1 and the second in tier2.

Pepa J•7mo ago

Depends on the implementation of the website. If you experience the captcha even in regular browser without proxy, than you cannot pass it just with HttpCrawler, you may need to you a browser-based solution like PuppeteerCrawler or PlaywrightCrawler. If you don't experience the captcha in your browser - it could be about the quality of the Proxies that you set up - the website provides the captcha just to suspicious visitors (ip from proxy).

deep-jadeOP•6mo ago

It's OK to get a 403 as a result of the request. The problem is that with retryOnBlocked present, the request is not retried with a different proxy tier unless this property is removed.

Gaming

Programming

retryOnBlocked with HttpCrawler

Did you find this page helpful?