Scrape the subpages of a website: depth variable possible?
Hi guys, I searched for that one but could not find a answer: I am building a web crawler which I have implemented at the moment using Puppeteer only. There I can use custom JS to control the depth of a query, basically how often the recursive function is called.
I tried the crawlee and it is so good! But unfortunately it collects way too many links, even some I don't need. Hence my question:
Is it possible to set the depth of the crawler? E.g: it should only scrape the links of the website and not explore the links further.
Thanks!!
5 Replies
genetic-orange•2y ago
Yes, it is possible. Just enqueue the links with some label. Then create special handler function for that label, so you know here you do not want to enqueue anything more.
Hi @pzraraz , you may also use
transformRequestFunction
to setup userdata
, which are passed to the next level requests, this way you may set userdata.level= 0
when not set otherwise userdata.level= request.userdata.level + 1
and then you may setup condition for enqueueing:
So it will not continue queueing new requests.
I am not sure about the version of SDK, that the scraper uses, but you may check https://docs.apify.com/sdk/js/docs/2.3/api/utils to at least get the ideautils | Apify Documentation
A namespace that contains various utilities.
other-emeraldOP•2y ago
Thank you both @HonzaS @Pepa J for the provided sources, I'll check it out. Big thx!
Something like this I worked out, but its not working for the depth. Still gives me way too many links. Any ideas?
@pzraraz This showld work, does it occurs even for other websites?
variable-lime•2y ago
@pzraraz it might not be working because you are accessing the request instance of the
transformRequestFunction
, please try it the following: