Crawlee & Apify•3y ago

Keeping track of the parent page with PlaywrightCrawler

Hi! I'm using Crawlee as an e2e test for broken links and generated diagrams in our documentation website. So far it's been successful and the only thing I'm missing is figuring out what page actually contained the broken link. For example, this is the snippet I use to find pages that display the 404 message:

async requestHandler({ request, page, enqueueLinks, log }) {
    // check if Docusaurus handled 404
    const isDocusaurus404 = await page
      .locator(".terminal-body")
      .getByText("404")
      .count();

    if (isDocusaurus404) {
      console.log({ url: page.url()});      
    }

    await enqueueLinks();
  },

async requestHandler({ request, page, enqueueLinks, log }) {
    // check if Docusaurus handled 404
    const isDocusaurus404 = await page
      .locator(".terminal-body")
      .getByText("404")
      .count();

    if (isDocusaurus404) {
      console.log({ url: page.url()});      
    }

    await enqueueLinks();
  },

This will log the actual URL that does not exist, but I can't tell which page contained that URL. What's the easiest way to find this information? Sort of like a History API? Thanks!

6 Replies

unwilling-turquoise•3y ago

so you want to know from which url this url was enqueued?

stormy-goldOP•3y ago

Exactly! This is what I came up with in the meantime:

  const brokenLinks = new Set();

  async requestHandler({ request, page, enqueueLinks, log }) {  
    // check if Docusaurus handled 404
    const isDocusaurus404 = await page
      .locator(".terminal-body")
      .getByText("404")
      .count();

    if (isDocusaurus404) {
      const { parentUrl } = request.userData;
      brokenLinks.add({ url: request.url, parentUrl });
    }

    await enqueueLinks({ userData: { parentUrl: page.url() } });
  },

  const brokenLinks = new Set();

  async requestHandler({ request, page, enqueueLinks, log }) {  
    // check if Docusaurus handled 404
    const isDocusaurus404 = await page
      .locator(".terminal-body")
      .getByText("404")
      .count();

    if (isDocusaurus404) {
      const { parentUrl } = request.userData;
      brokenLinks.add({ url: request.url, parentUrl });
    }

    await enqueueLinks({ userData: { parentUrl: page.url() } });
  },

unwilling-turquoise•3y ago

Yeah, this is how I would do it. Just put the url to the userData.

stormy-goldOP•3y ago

Thanks for validating the idea! I just wanted to check if there was an existing method/property I could use out of the box, but this works too.

xenial-black•3y ago

@kobeljic We've been doing some similar things, you might consider maintaining some kind of master list of which pages point to which other pages, even it it's lightweight, so that if multiple pages link to the broken URL you catch them all

MEE6•3y ago

@eaton just advanced to level 2! Thanks for your contributions! 🎉

Gaming

Programming

Keeping track of the parent page with PlaywrightCrawler

Did you find this page helpful?