Crawlee & Apify•6mo ago

Downloading JSON and YAML files while crawling with Playwright

Hi there. Is it possible to detect the Content-Type header of responses and download JSON or YAML files? I'm using Playwright to crawl my sites and have some JSON and YAML content I would like to capture, as well.

11 Replies

Hall•6mo ago

Someone will reply to you shortly. In the meantime, this might help:

foreign-sapphire•6mo ago

Yes of course

MEE6•6mo ago

@Exp just advanced to level 8! Thanks for your contributions! 🎉

xenial-blackOP•6mo ago

Thanks, @Exp. What's the best way of accomplishing that? I'm having trouble finding it in the API docs.

foreign-sapphire•6mo ago

When you use the Playwright crawlee, you can capture all https request Did you try it?

xenial-blackOP•6mo ago

Before the request handler fires, Playwright waits for the load event. When the crawler encounters JSON or YAML content types, the load event never fires, of course, and the request is aborted. I’m trying to figure out how I can intercept the crawler before that happens and either capture the content or queue it up for a different crawler.

azzouzana•6mo ago

@kevinswiber You can use https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlerOptions#preNavigationHooks for this case. Something like this in js:

preNavigationHooks:[
    async (crawlingContext) => {
        crawlingContext.page.on('response', async function (response) {
                if (response.url().includes('......') {
                    const body = await response.json();
                   // check content type here and decide what to do here
                    const headers = response.headers();
                    const body = await response.json() // or text()
                }
            }
        )
    }
]

xenial-blackOP•6mo ago

@azzouzana Thank you! That was helpful. I feel the best solution would probably be creating a new crawler that extends the Playwright crawler. But in the meantime, I'm using this:

  preNavigationHooks: [
    async ({ request, page, log }) => {
      page.once("response", async (response) => {
        const contentType = response.headers()["content-type"];
        if (
          contentType?.includes("application/json") ||
          contentType?.includes("application/x-yaml") ||
          contentType?.includes("text/yaml") ||
          contentType?.includes("application/yaml")
        ) {
          const type = contentType.includes("json") ? "json" : "yaml";

          request.skipNavigation = true;
          request.noRetry = true;

          await new Promise(async (resolve, reject) => {
            page.once("download", async (download) => {
              const path = await download.path();

              const content = await readFile(path, "utf8");
              await dataset.pushData({
                url: download.url(),
                contentType,
                filename: download.suggestedFilename(),
                type,
                content,
              });
              log.info(`Downloaded ${type}: ${download.url()}`);
              resolve();
            });
            await page.waitForEvent("download");
          });
        }
      });
    },
  ],

  preNavigationHooks: [
    async ({ request, page, log }) => {
      page.once("response", async (response) => {
        const contentType = response.headers()["content-type"];
        if (
          contentType?.includes("application/json") ||
          contentType?.includes("application/x-yaml") ||
          contentType?.includes("text/yaml") ||
          contentType?.includes("application/yaml")
        ) {
          const type = contentType.includes("json") ? "json" : "yaml";

          request.skipNavigation = true;
          request.noRetry = true;

          await new Promise(async (resolve, reject) => {
            page.once("download", async (download) => {
              const path = await download.path();

              const content = await readFile(path, "utf8");
              await dataset.pushData({
                url: download.url(),
                contentType,
                filename: download.suggestedFilename(),
                type,
                content,
              });
              log.info(`Downloaded ${type}: ${download.url()}`);
              resolve();
            });
            await page.waitForEvent("download");
          });
        }
      });
    },
  ],

azzouzana•6mo ago

Glad it helped @kevinswiber I recently faced a similar case where some of the requests might return raw json, I tried a much simpler approach & it's working (You can inspect the response & check its headers)

xenial-blackOP•6mo ago

@azzouzana Is this using the Playwright crawler?

azzouzana•6mo ago

@kevinswiber you're right. It's using Cheerio's

Gaming

Programming

Downloading JSON and YAML files while crawling with Playwright

Did you find this page helpful?