Keeping track of the parent page with PlaywrightCrawler

Hi! I'm using Crawlee as an e2e test for broken links and generated diagrams in our documentation website. So far it's been successful and the only thing I'm missing is figuring out what page actually contained the broken link. For example, this is the snippet I use to find pages that display the 404 message:
async requestHandler({ request, page, enqueueLinks, log }) {
// check if Docusaurus handled 404
const isDocusaurus404 = await page
.locator(".terminal-body")
.getByText("404")
.count();

if (isDocusaurus404) {
console.log({ url: page.url()});
}

await enqueueLinks();
},
async requestHandler({ request, page, enqueueLinks, log }) {
// check if Docusaurus handled 404
const isDocusaurus404 = await page
.locator(".terminal-body")
.getByText("404")
.count();

if (isDocusaurus404) {
console.log({ url: page.url()});
}

await enqueueLinks();
},
This will log the actual URL that does not exist, but I can't tell which page contained that URL. What's the easiest way to find this information? Sort of like a History API? Thanks!
6 Replies
unwilling-turquoise
unwilling-turquoise•3y ago
so you want to know from which url this url was enqueued?
stormy-gold
stormy-goldOP•3y ago
Exactly! This is what I came up with in the meantime:
const brokenLinks = new Set();

async requestHandler({ request, page, enqueueLinks, log }) {
// check if Docusaurus handled 404
const isDocusaurus404 = await page
.locator(".terminal-body")
.getByText("404")
.count();

if (isDocusaurus404) {
const { parentUrl } = request.userData;
brokenLinks.add({ url: request.url, parentUrl });
}

await enqueueLinks({ userData: { parentUrl: page.url() } });
},
const brokenLinks = new Set();

async requestHandler({ request, page, enqueueLinks, log }) {
// check if Docusaurus handled 404
const isDocusaurus404 = await page
.locator(".terminal-body")
.getByText("404")
.count();

if (isDocusaurus404) {
const { parentUrl } = request.userData;
brokenLinks.add({ url: request.url, parentUrl });
}

await enqueueLinks({ userData: { parentUrl: page.url() } });
},
unwilling-turquoise
unwilling-turquoise•3y ago
Yeah, this is how I would do it. Just put the url to the userData.
stormy-gold
stormy-goldOP•3y ago
Thanks for validating the idea! I just wanted to check if there was an existing method/property I could use out of the box, but this works too.
xenial-black
xenial-black•3y ago
@kobeljic We've been doing some similar things, you might consider maintaining some kind of master list of which pages point to which other pages, even it it's lightweight, so that if multiple pages link to the broken URL you catch them all
MEE6
MEE6•3y ago
@eaton just advanced to level 2! Thanks for your contributions! 🎉

Did you find this page helpful?