Crawlee & Apify•3y ago

Puppeteer crawler loop elements lsist

router.addDefaultHandler(async ({ page, request, enqueueLinks,log }) => { log.info(enqueueing new URLs); await enqueueLinks({ selector:"div[role='article'] > a", label: 'detail',//corresponding to handle for processing }); }); how to I can crawler list item data in list div[role='article']👍 not get list url to add to queue

5 Replies

sensitive-blue•3y ago

Try this:

router.addDefaultHander(async ({ parseWithCheerio, log }) => {
    log.info('scraping data');
    const $ = await parseWithCheerio();

    const aTags = $('div[role="article"] > a');
    const finalData = [];

    for (const tag of aTags) {
        const elem = $(tag);

        const data = {
            url: elem.attr('href'),
        }

        finalData.push(data);
    }

    console.log(finalData);
});

router.addDefaultHander(async ({ parseWithCheerio, log }) => {
    log.info('scraping data');
    const $ = await parseWithCheerio();

    const aTags = $('div[role="article"] > a');
    const finalData = [];

    for (const tag of aTags) {
        const elem = $(tag);

        const data = {
            url: elem.attr('href'),
        }

        finalData.push(data);
    }

    console.log(finalData);
});

This will scrape the href attribute from each anchor element. But within the for...of loop, you can do anything.

correct-apricotOP•3y ago

But I want use Pupeteer, because page render by ssjs *js not use Cheerio

fascinating-indigo•3y ago

You can use approach above. parseWithCheerio() is just a util method, that allows you to work with the data same way as with CheerioCrawler: https://crawlee.dev/api/next/puppeteer-crawler/interface/PuppeteerCrawlingContext#parseWithCheerio

sensitive-blue•3y ago

The code above works with PuppeteerCrawler.

correct-apricotOP•3y ago

thank you

Gaming

Programming

Puppeteer crawler loop elements lsist

Did you find this page helpful?