Crawlee & Apify•2y ago

Scrape the subpages of a website: depth variable possible?

Hi guys, I searched for that one but could not find a answer: I am building a web crawler which I have implemented at the moment using Puppeteer only. There I can use custom JS to control the depth of a query, basically how often the recursive function is called. I tried the crawlee and it is so good! But unfortunately it collects way too many links, even some I don't need. Hence my question: Is it possible to set the depth of the crawler? E.g: it should only scrape the links of the website and not explore the links further. Thanks!!

5 Replies

genetic-orange•2y ago

Yes, it is possible. Just enqueue the links with some label. Then create special handler function for that label, so you know here you do not want to enqueue anything more.

Pepa J•2y ago

Hi @pzraraz , you may also use transformRequestFunction to setup userdata, which are passed to the next level requests, this way you may set userdata.level= 0 when not set otherwise userdata.level= request.userdata.level + 1 and then you may setup condition for enqueueing:

if (!request.userData.level || request.userData.level < 5) {
  await enqueueLinks({...});
}

if (!request.userData.level || request.userData.level < 5) {
  await enqueueLinks({...});
}

So it will not continue queueing new requests. I am not sure about the version of SDK, that the scraper uses, but you may check https://docs.apify.com/sdk/js/docs/2.3/api/utils to at least get the idea

utils | Apify Documentation

A namespace that contains various utilities.

other-emeraldOP•2y ago

Thank you both @HonzaS @Pepa J for the provided sources, I'll check it out. Big thx! Something like this I worked out, but its not working for the depth. Still gives me way too many links. Any ideas?

 import { PuppeteerCrawler } from "crawlee";

const crawler = new PuppeteerCrawler({
  async requestHandler({ request, enqueueLinks, log }) {
    log.info(request.url);

    if (!request.userData.level || request.userData.level < 0) {
      await enqueueLinks({
        transformRequestFunction: (request) => {
          return {
            ...request,
            userData: {
              level: request.userData.level + 1,
            },
          };
        },
      });
    }
  },
});

await crawler.run([
  {
    url: "http://www.url.de",
    userData: {
      level: 0,
    },
  },
]);

 import { PuppeteerCrawler } from "crawlee";

const crawler = new PuppeteerCrawler({
  async requestHandler({ request, enqueueLinks, log }) {
    log.info(request.url);

    if (!request.userData.level || request.userData.level < 0) {
      await enqueueLinks({
        transformRequestFunction: (request) => {
          return {
            ...request,
            userData: {
              level: request.userData.level + 1,
            },
          };
        },
      });
    }
  },
});

await crawler.run([
  {
    url: "http://www.url.de",
    userData: {
      level: 0,
    },
  },
]);

Pepa J•2y ago

@pzraraz This showld work, does it occurs even for other websites?

variable-lime•2y ago

@pzraraz it might not be working because you are accessing the request instance of the transformRequestFunction, please try it the following:

const crawler = new PuppeteerCrawler({
    async requestHandler({ request, enqueueLinks, log }) {
        const { url, userData } = request;

        log.info(url);

        if (!userData.level || userData.level < 0) {
            await enqueueLinks({
                transformRequestFunction: request => {
                    return {
                        ...request,
                        userData: {
                            level: userData.level + 1,
                        },
                    };
                },
            });
        }
    },
});

const crawler = new PuppeteerCrawler({
    async requestHandler({ request, enqueueLinks, log }) {
        const { url, userData } = request;

        log.info(url);

        if (!userData.level || userData.level < 0) {
            await enqueueLinks({
                transformRequestFunction: request => {
                    return {
                        ...request,
                        userData: {
                            level: userData.level + 1,
                        },
                    };
                },
            });
        }
    },
});

Gaming

Programming

Scrape the subpages of a website: depth variable possible?

Did you find this page helpful?