Massive Scraper

Hi I have a (noob) question. So I want to crawl many different urls from different pages so they need their own crawler implementation. Some can use the same also. How can I achieve this in crawlee such that they run in parallel and can be lal executed with a single command or also in isolation? Input and example repos etc. would be highly appreciated
3 Replies
Hall
Hall6mo ago
Someone will reply to you shortly. In the meantime, this might help:
continuing-cyan
continuing-cyan6mo ago
You gave very small pice of information developers would have more questions than answers from your message, and that's why I assume you didn't get any reply Here is a good example of a scraping system implementation with hight scalability options and good monitoring tools. If you need to implement so called one-time scraper it would be a bad example but in a long term project case it would be one of the best - https://github.com/68publishers/crawler Here for the future questions I would recommend to stick to these rules - https://stackoverflow.com/help/how-to-ask
eager-peach
eager-peach6mo ago
you can have multiple pageHandlers using a router. this allows you to change which handler a page is processed by by setting the label property when enqueuing a link. heres an example
const crawlerConfig = new Configuration({
//config options
});

const router = createPlaywrightRouter();
router.addHandler(
'label1',
label1Handler,
);
router.addHandler(
'label2',
label2Handler,
);
router.addHandler(
'label3',
label3Handler,
);

const crawlerOptions: PlaywrightCrawlerOptions = {
requestHandler: router,
//crawler options
};

const crawler = new PlaywrightCrawler(crawlerOptions, crawlerConfig);

crawler.run([
{url: 'url1', label: 'label1'},
])
const crawlerConfig = new Configuration({
//config options
});

const router = createPlaywrightRouter();
router.addHandler(
'label1',
label1Handler,
);
router.addHandler(
'label2',
label2Handler,
);
router.addHandler(
'label3',
label3Handler,
);

const crawlerOptions: PlaywrightCrawlerOptions = {
requestHandler: router,
//crawler options
};

const crawler = new PlaywrightCrawler(crawlerOptions, crawlerConfig);

crawler.run([
{url: 'url1', label: 'label1'},
])

Did you find this page helpful?