Crawlee & Apify•6mo ago

Massive Scraper

Hi I have a (noob) question. So I want to crawl many different urls from different pages so they need their own crawler implementation. Some can use the same also. How can I achieve this in crawlee such that they run in parallel and can be lal executed with a single command or also in isolation? Input and example repos etc. would be highly appreciated

3 Replies

Hall•6mo ago

Someone will reply to you shortly. In the meantime, this might help:

continuing-cyan•6mo ago

You gave very small pice of information developers would have more questions than answers from your message, and that's why I assume you didn't get any reply Here is a good example of a scraping system implementation with hight scalability options and good monitoring tools. If you need to implement so called one-time scraper it would be a bad example but in a long term project case it would be one of the best - https://github.com/68publishers/crawler Here for the future questions I would recommend to stick to these rules - https://stackoverflow.com/help/how-to-ask

eager-peach•6mo ago

you can have multiple pageHandlers using a router. this allows you to change which handler a page is processed by by setting the label property when enqueuing a link. heres an example

  const crawlerConfig = new Configuration({
    //config options
  });

  const router = createPlaywrightRouter();
  router.addHandler(
    'label1',
    label1Handler,
  );
  router.addHandler(
    'label2',
    label2Handler,
  );
  router.addHandler(
    'label3',
    label3Handler,
  );

  const crawlerOptions: PlaywrightCrawlerOptions = {
    requestHandler: router,
    //crawler options
  };

  const crawler = new PlaywrightCrawler(crawlerOptions, crawlerConfig);

crawler.run([
    {url: 'url1', label: 'label1'},
])

  const crawlerConfig = new Configuration({
    //config options
  });

  const router = createPlaywrightRouter();
  router.addHandler(
    'label1',
    label1Handler,
  );
  router.addHandler(
    'label2',
    label2Handler,
  );
  router.addHandler(
    'label3',
    label3Handler,
  );

  const crawlerOptions: PlaywrightCrawlerOptions = {
    requestHandler: router,
    //crawler options
  };

  const crawler = new PlaywrightCrawler(crawlerOptions, crawlerConfig);

crawler.run([
    {url: 'url1', label: 'label1'},
])

Gaming

Programming

Massive Scraper

Did you find this page helpful?