Crawlee & Apify•7mo ago

Anyone have any example scraping multiple different websites?

The structure i am doing idoes not look like the best. I am basically creating several routers and then doing something like:

const crawler = new PlaywrightCrawler({
  // proxyConfiguration: new ProxyConfiguration({ proxyUrls: ['...'] }),
  requestHandler: async (ctx) => {
    if (ctx.request.url.includes("url1")) {
      await url1Router(ctx);
    }

    if (ctx.request.url.includes("url2")) {
      await url2Router(ctx);
    }

    if (ctx.request.url.includes("url3")) {
      await url3Router(ctx);
    }
    await Dataset.exportToJSON("data.json");
  },

  // Comment this option to scrape the full website.

  //   maxRequestsPerCrawl: 20,
});

const crawler = new PlaywrightCrawler({
  // proxyConfiguration: new ProxyConfiguration({ proxyUrls: ['...'] }),
  requestHandler: async (ctx) => {
    if (ctx.request.url.includes("url1")) {
      await url1Router(ctx);
    }

    if (ctx.request.url.includes("url2")) {
      await url2Router(ctx);
    }

    if (ctx.request.url.includes("url3")) {
      await url3Router(ctx);
    }
    await Dataset.exportToJSON("data.json");
  },

  // Comment this option to scrape the full website.

  //   maxRequestsPerCrawl: 20,
});

This does not seem correct. Anyone with a better way?

6 Replies

Hall•7mo ago

View post on community site

This post has been pushed to the community knowledgebase. Any replies in this thread will be synced to the community site.

Apify Community

xenial-black•7mo ago

You can use Crawlee's Router: https://crawlee.dev/api/playwright-crawler/function/createPlaywrightRouter. Create a route for each URL, then use labels to identify them.

equal-aquaOP•7mo ago

@Marco , how far is that from what i am doing there? because it seems like soewhere i will have to do it? in the example above i did a router per url there , urlRouter1, urlRouter2 is defined on a per url basis. am i wrong?

xenial-black•7mo ago

It's actually very similar. Routes should be defined depending on your needs, so if you need a route per URL, just do that.

equal-aquaOP•7mo ago

my concern is that i have multiple websites, not just different urls. each website might have two urls that i have to scrape independently is that how you would do it @Marco ? would you have multiple routers ?

xenial-black•7mo ago

Oh, I see. I think I would still use one router, with labels such as "website1-page2", to keep things simple; a function called at the beginning would assign the correct label to each request based on the URL.

Gaming

Programming

Anyone have any example scraping multiple different websites?

Did you find this page helpful?