Enqueue_links only on match in url path? Cancel request in pre_navigation_hook?
I have set up my handler that it only enqueue links that match on certain keywords Problem here is that I want the code to only check the URL Path and not the full URL.
To give an example:
Lets say I only want to enqueue links where the keyword "team" or "about" is part of the URL path.
When crawling www.example.com and it would find an url with www.example.com/team. I want that URL to queue.
When crawling www.my-team.com it would match on all urls on that website because team is part of the main url. But that is not the desired behaviour I want.
I thought of using a pre_navigation_hook and check there again with the following code, but I don't think it's possible to cancel a request that is already queued?
In the docs I found something like
await request_list.mark_request_as_handled(request)
but I don't think I have any access to a request_list or something simular in the PlaywrightPreNavCrawlingContext
It would be great if someone can point me in the right direction!2 Replies
Someone will reply to you shortly. In the meantime, this might help:
-# This post was marked as solved by ROYOSTI. View answer.
optimistic-gold•4mo ago
Hey @ROYOSTI
A PR is now in the works that will allow you to easily customize this behavior - https://github.com/apify/crawlee-python/pull/923
Prior to its release, there are several ways to solve it.
1. You can try setting up a selector that selects only the links you need
2. You do not necessarily need to use
enqueue_links
I noticed you're using Playwright. You can use route
so you don't have to make a real request.