Scrape the subpages of a website: depth variable possible?

Hi guys, I searched for that one but could not find a answer: I am building a web crawler which I have implemented at the moment using Puppeteer only. There I can use custom JS to control the depth of a query, basically how often the recursive function is called. I tried the crawlee and it is so good! But unfortunately it collects way too many links, even some I don't need. Hence my question: Is it possible to set the depth of the crawler? E.g: it should only scrape the links of the website and not explore the links further. Thanks!!
5 Replies
genetic-orange
genetic-orange2y ago
Yes, it is possible. Just enqueue the links with some label. Then create special handler function for that label, so you know here you do not want to enqueue anything more.
Pepa J
Pepa J2y ago
Hi @pzraraz , you may also use transformRequestFunction to setup userdata, which are passed to the next level requests, this way you may set userdata.level= 0 when not set otherwise userdata.level= request.userdata.level + 1 and then you may setup condition for enqueueing:
if (!request.userData.level || request.userData.level < 5) {
await enqueueLinks({...});
}
if (!request.userData.level || request.userData.level < 5) {
await enqueueLinks({...});
}
So it will not continue queueing new requests. I am not sure about the version of SDK, that the scraper uses, but you may check https://docs.apify.com/sdk/js/docs/2.3/api/utils to at least get the idea
utils | Apify Documentation
A namespace that contains various utilities.
other-emerald
other-emeraldOP2y ago
Thank you both @HonzaS @Pepa J for the provided sources, I'll check it out. Big thx! Something like this I worked out, but its not working for the depth. Still gives me way too many links. Any ideas?
import { PuppeteerCrawler } from "crawlee";

const crawler = new PuppeteerCrawler({
async requestHandler({ request, enqueueLinks, log }) {
log.info(request.url);

if (!request.userData.level || request.userData.level < 0) {
await enqueueLinks({
transformRequestFunction: (request) => {
return {
...request,
userData: {
level: request.userData.level + 1,
},
};
},
});
}
},
});

await crawler.run([
{
url: "http://www.url.de",
userData: {
level: 0,
},
},
]);
import { PuppeteerCrawler } from "crawlee";

const crawler = new PuppeteerCrawler({
async requestHandler({ request, enqueueLinks, log }) {
log.info(request.url);

if (!request.userData.level || request.userData.level < 0) {
await enqueueLinks({
transformRequestFunction: (request) => {
return {
...request,
userData: {
level: request.userData.level + 1,
},
};
},
});
}
},
});

await crawler.run([
{
url: "http://www.url.de",
userData: {
level: 0,
},
},
]);
Pepa J
Pepa J2y ago
@pzraraz This showld work, does it occurs even for other websites?
variable-lime
variable-lime2y ago
@pzraraz it might not be working because you are accessing the request instance of the transformRequestFunction, please try it the following:
const crawler = new PuppeteerCrawler({
async requestHandler({ request, enqueueLinks, log }) {
const { url, userData } = request;

log.info(url);

if (!userData.level || userData.level < 0) {
await enqueueLinks({
transformRequestFunction: request => {
return {
...request,
userData: {
level: userData.level + 1,
},
};
},
});
}
},
});
const crawler = new PuppeteerCrawler({
async requestHandler({ request, enqueueLinks, log }) {
const { url, userData } = request;

log.info(url);

if (!userData.level || userData.level < 0) {
await enqueueLinks({
transformRequestFunction: request => {
return {
...request,
userData: {
level: userData.level + 1,
},
};
},
});
}
},
});

Did you find this page helpful?