Keeping track of the parent page with PlaywrightCrawler
Hi! I'm using Crawlee as an e2e test for broken links and generated diagrams in our documentation website. So far it's been successful and the only thing I'm missing is figuring out what page actually contained the broken link.
For example, this is the snippet I use to find pages that display the 404 message:
This will log the actual URL that does not exist, but I can't tell which page contained that URL. What's the easiest way to find this information? Sort of like a History API? Thanks!
6 Replies
unwilling-turquoise•3y ago
so you want to know from which url this url was enqueued?
stormy-goldOP•3y ago
Exactly!
This is what I came up with in the meantime:
unwilling-turquoise•3y ago
Yeah, this is how I would do it. Just put the url to the userData.
stormy-goldOP•3y ago
Thanks for validating the idea! I just wanted to check if there was an existing method/property I could use out of the box, but this works too.
xenial-black•3y ago
@kobeljic We've been doing some similar things, you might consider maintaining some kind of master list of which pages point to which other pages, even it it's lightweight, so that if multiple pages link to the broken URL you catch them all
@eaton just advanced to level 2! Thanks for your contributions! 🎉