Crawlee & Apify•3y ago

How to retry failed requests after the queue as "ended"?

I just depleted my proxy quota and all the remaining requests in the queue failed. Similar thing happens often, how do I retry/re-enqueue the failed requests? I've been googling it for a while now, hardly any up to date info on google, only bits and pieces from older versions, closed Github issues etc.. I'm sure it's in the API docs somehow somewhere but they're extremely hard to navigate and usually point to other classes / interfaces. Could such a basic thing be explained in docs in one paragraph and code snippet?

20 Replies

genetic-orangeOP•3y ago

      if (this.args['rerun-failed']) {
        // - read requests from queue
        // - re-add the failed ones
        const request = requestQ.getRequest(requestId);
        requestQ.reclaimRequest(request);
      }

      if (this.args['rerun-failed']) {
        // - read requests from queue
        // - re-add the failed ones
        const request = requestQ.getRequest(requestId);
        requestQ.reclaimRequest(request);
      }

Looks like something like this would work IF I can list all the requests. This looks like it could be it:

 const head = await requestQ.client.listHead({limit: 10000000});

 const head = await requestQ.client.listHead({limit: 10000000});

but for some reason the queue is empty, but I checked - it is not empty, it probably just filters for the requests of a certain status Or maybe the client is looking at the wrong directory, because I have a named queue and the client claims it has no name and is looking at bcf005dd-6b7e-4201-9196-32faba7b6d86 which isn't even there.

RequestQueueClient {
  id: 'bcf005dd-6b7e-4201-9196-32faba7b6d86',
  name: undefined,
  createdAt: 2023-03-24T13:23:04.902Z,
  accessedAt: 2023-03-24T13:23:04.902Z,
  modifiedAt: 2023-03-24T13:23:04.902Z,
  handledRequestCount: 0,
  pendingRequestCount: 0,
  requestQueueDirectory: '/home/michal/dev/crawlee-vps/storage/request_queues/bcf005dd-6b7e-4201-9196-32faba7b6d86',
  requests: Map(0) {},
  client: MemoryStorage {
    localDataDirectory: './storage',
    datasetsDirectory: '/home/michal/dev/crawlee-vps/storage/datasets',
    keyValueStoresDirectory: '/home/michal/dev/crawlee-vps/storage/key_value_stores',
    requestQueuesDirectory: '/home/michal/dev/crawlee-vps/storage/request_queues',
    writeMetadata: false,
    persistStorage: true,
    keyValueStoresHandled: [],
    datasetClientsHandled: [],
    requestQueuesHandled: [ [RequestQueueClient] ],
    __purged: true
  }
}

RequestQueueClient {
  id: 'bcf005dd-6b7e-4201-9196-32faba7b6d86',
  name: undefined,
  createdAt: 2023-03-24T13:23:04.902Z,
  accessedAt: 2023-03-24T13:23:04.902Z,
  modifiedAt: 2023-03-24T13:23:04.902Z,
  handledRequestCount: 0,
  pendingRequestCount: 0,
  requestQueueDirectory: '/home/michal/dev/crawlee-vps/storage/request_queues/bcf005dd-6b7e-4201-9196-32faba7b6d86',
  requests: Map(0) {},
  client: MemoryStorage {
    localDataDirectory: './storage',
    datasetsDirectory: '/home/michal/dev/crawlee-vps/storage/datasets',
    keyValueStoresDirectory: '/home/michal/dev/crawlee-vps/storage/key_value_stores',
    requestQueuesDirectory: '/home/michal/dev/crawlee-vps/storage/request_queues',
    writeMetadata: false,
    persistStorage: true,
    keyValueStoresHandled: [],
    datasetClientsHandled: [],
    requestQueuesHandled: [ [RequestQueueClient] ],
    __purged: true
  }
}

or maybe the active requests are kept in that folder which acts like that filter I was talking about so basically there is no way to re-enqueue failed requests right now? I am ending up reading the jsons from the storage and re-adding them manually. Okay I'm reading the JSONs from the datastore and tried both reclaiming the request and re-enqueing. NEITHER WORKS Re-claiming:

        const requests = this.getFailedRequests(rqName);

        for (let request of requests) {
          const data = request.data; //parsed from request.json string
          await requestQ.reclaimRequest(data); // tried running both with request and request.data
       }

        const requests = this.getFailedRequests(rqName);

        for (let request of requests) {
          const data = request.data; //parsed from request.json string
          await requestQ.reclaimRequest(data); // tried running both with request and request.data
       }

Re-enqueue:

        for (let request of requests) {
          const data = request.data;

          return({
            url: data.url,
            label: data.label,
            userData: data.userData,
          });
        }

        // nope, they get ignored
        await requestQ.addRequests(scheduleRequests);

        for (let request of requests) {
          const data = request.data;

          return({
            url: data.url,
            label: data.label,
            userData: data.userData,
          });
        }

        // nope, they get ignored
        await requestQ.addRequests(scheduleRequests);

Okay from now on I'm using github fulltext search instead of documentation, this is the solution:

          await requestQ.addRequest(new Request({
            url,
            userData,
            uniqueKey: 'reclaim_' + new Date().getTime()
          }));

          await requestQ.addRequest(new Request({
            url,
            userData,
            uniqueKey: 'reclaim_' + new Date().getTime()
          }));

What bugs me though, I want to avoid re-adding them multiple times. So I just want to use the url as unique key and revive the request to allow it to be processed again. Also since I'm reading them myself, how do I properly filter for the failed ones? I see there is __crawlee in userData where I can maybe update the state to allow it to be processed again? I guess I could 1) read the reuqst data 2) clear and re-create the request queue 3) re-add the requests but it seems unnecesary

ambitious-aqua•3y ago

I tried to read all your messages carefully. Correct me if I'm wrong, default reclaiming strategy with maxRetriesCount(I guess it's 3) does not work for your use case?

genetic-orangeOP•3y ago

thanks for the reply, I don't know what to say, I tried everything and it doesn't re-run the request the key point here to understand is when proxy died on me all the retries were exhausted within minutes so I have requests in a completely failed state and wanted to find a way to continue from where I left-off I ended up doing this:

    if (rqName != 'default') { // this line is irrelevant here, I just don't want to reformat the whole thing. I awlays use a named queue.
      const requestQ = await RequestQueue.open(rqName);
      options.requestQueue = requestQ;

      // raise if trying to clear at the sam time
      if (this.args['rerun-failed']) {
        const requests = this.getFailedRequests(rqName);
        
        // it is always the same amount, even if I let some requests finish, not sure if correct
        console.log("failed requests count:", requests.length);

        //const scheduleRequests = requests.map((request) => {
        for (let request of requests) {
          const { userData, url } = JSON.parse(request.json);

          const rerun = userData.__rerun = (userData.__rerun || 0) + 1;
          const uniqueKey = `rerun${rerun}_${url}`;

          console.log('Re-adding: ', uniqueKey);

          await requestQ.addRequest(new Request({
            url,
            uniqueKey,
            userData: {
              ...userData,
              __rerun: rerun,
              // this should make it waiting
              __crawlee: { state: 0 }
            },
          }));

        }
      }

      if (this.args.clear) {
        await options.requestQueue.drop();
        options.requestQueue = await RequestQueue.open(nameWithVersion);
      }
    }

    this.crawler = new PuppeteerCrawler(options);

    return this.crawler;
  }

    if (rqName != 'default') { // this line is irrelevant here, I just don't want to reformat the whole thing. I awlays use a named queue.
      const requestQ = await RequestQueue.open(rqName);
      options.requestQueue = requestQ;

      // raise if trying to clear at the sam time
      if (this.args['rerun-failed']) {
        const requests = this.getFailedRequests(rqName);
        
        // it is always the same amount, even if I let some requests finish, not sure if correct
        console.log("failed requests count:", requests.length);

        //const scheduleRequests = requests.map((request) => {
        for (let request of requests) {
          const { userData, url } = JSON.parse(request.json);

          const rerun = userData.__rerun = (userData.__rerun || 0) + 1;
          const uniqueKey = `rerun${rerun}_${url}`;

          console.log('Re-adding: ', uniqueKey);

          await requestQ.addRequest(new Request({
            url,
            uniqueKey,
            userData: {
              ...userData,
              __rerun: rerun,
              // this should make it waiting
              __crawlee: { state: 0 }
            },
          }));

        }
      }

      if (this.args.clear) {
        await options.requestQueue.drop();
        options.requestQueue = await RequestQueue.open(nameWithVersion);
      }
    }

    this.crawler = new PuppeteerCrawler(options);

    return this.crawler;
  }

  getFailedRequests(dataset) {
    const dir = `./storage/request_queues/${dataset}`;
    const results = fs.readdirSync(dir);

    //console.log({dir, results})

    const data = {};

    for (let file of results) {
      const row = JSON.parse(fs.readFileSync(dir +'/'+ file, 'utf8'));

      row.data = JSON.parse(row.json);


      if (data[row.url]) {
        // override with newer re-runs data so we have all the fresh data
        if (row.data.userData.__rerun > data[row.url].data.userData.__rerun) {
          data[row.url] = row;
        }
        if (row.data.userData.__rerun > !data[row.url].data.userData.__rerun) {
          data[row.url] = row;
        }
      } else {
        data[row.url] = row;
      }
    }

    return Object.values(data).filter((row) => {
      return row.data.userData.__crawlee.state == 5
    });
  }

  getFailedRequests(dataset) {
    const dir = `./storage/request_queues/${dataset}`;
    const results = fs.readdirSync(dir);

    //console.log({dir, results})

    const data = {};

    for (let file of results) {
      const row = JSON.parse(fs.readFileSync(dir +'/'+ file, 'utf8'));

      row.data = JSON.parse(row.json);


      if (data[row.url]) {
        // override with newer re-runs data so we have all the fresh data
        if (row.data.userData.__rerun > data[row.url].data.userData.__rerun) {
          data[row.url] = row;
        }
        if (row.data.userData.__rerun > !data[row.url].data.userData.__rerun) {
          data[row.url] = row;
        }
      } else {
        data[row.url] = row;
      }
    }

    return Object.values(data).filter((row) => {
      return row.data.userData.__crawlee.state == 5
    });
  }

So this basically goes into the folder, reads the request queue results and re-enqueues the ones that failed manually. the problem with doing this a nice way is there seems to be no way to get failed requests at all Also there is no official way to rerun them when they fail completely. So I end up marking my re-runs with a number and once re-run succeeds I know not to rerun that url.

MEE6•3y ago

@Michal just advanced to level 5! Thanks for your contributions! 🎉

genetic-orangeOP•3y ago

it wasn't fun burning the most of the 12 hours I worked today on this, but at least I got it solved now. can't believe this is not supported out of the box I'm exaggerating a bit, it was more like 10 but still sucks to get stuck on something so trivial

ambitious-aqua•3y ago

I've just googled what I thought you were trying to do and found this actor https://apify.com/lukaskrivka/rebirth-failed-requests Let me see why you can't get the failed requests. I'm not using queue directly

Apify

Rebirth failed requests · Apify

Rebirth failed requests of past runs into a pristine state with no retries so you can rescrape them by resurrecting the run.

genetic-orangeOP•3y ago

hey that's a useful resource, thanks!

ambitious-aqua•3y ago

Yeah, completely agree on this. Happens to me quite often as well

genetic-orangeOP•3y ago

I will have to go through some of those recipies, when I googled it I didn't get this webpage

ambitious-aqua•3y ago

There are not that many crawlee related recipes out there

genetic-orangeOP•3y ago

gotcha, another one I found useful was about downloading images ended up using only 5% but saved a lot of time

ambitious-aqua•3y ago

@Michal Here is the reason why https://github.com/apify/crawlee/blob/master/packages/memory-storage/src/resource-clients/request-queue.ts#L153

GitHub

crawlee/request-queue.ts at master · apify/crawlee

Crawlee—A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast. - crawlee/request-queue.ts at master · apify/crawlee

MEE6•3y ago

@yellott just advanced to level 5! Thanks for your contributions! 🎉

genetic-orangeOP•3y ago

Not sure what this part means:

            if (request.orderNo) {
                items.push(request);
            }

            if (request.orderNo) {
                items.push(request);
            }

But from what I gathered, the request queue is not in datasets, its only the processed requests that are in datasets .. I think. Anyways, I was able to put together a solution I'm happy with for now. It's simlper than what I saw in Lukas's actor repo.

ambitious-aqua•3y ago

This means it won't get to the queue as far as I can see from the code. As it was either completed or failed completely

genetic-orangeOP•3y ago

Yep those requests failed completely. The key to re-entering the into the queue is to pick a new uniqueKey

ambitious-aqua•3y ago

Actor is just more universal solution I think, and can be used from outside

genetic-orangeOP•3y ago

Yeah but he also uses labels in a different way in that example. I'm using them to identify different page types so I can't stop using them for that. But we both ended up tracking retry/rebirth count 😄 I'll revisit his solution if mine stops working.

sensitive-blue•3y ago

https://apify.com/lukaskrivka/rebirth-failed-requests should work for Apify platform actors. For local run, there isn't option to list all requests from the queue so it would have to work with the filesystem but the approach will be similar

genetic-orangeOP•3y ago

yep that's what I ended up doing, thanks for the link

Gaming

Programming

How to retry failed requests after the queue as "ended"?

Did you find this page helpful?