Puppeteer crawler loop elements lsist

router.addDefaultHandler(async ({ page, request, enqueueLinks,log }) => { log.info(enqueueing new URLs); await enqueueLinks({ selector:"div[role='article'] > a", label: 'detail',//corresponding to handle for processing }); }); how to I can crawler list item data in list div[role='article']👍 not get list url to add to queue
No description
5 Replies
sensitive-blue
sensitive-blue3y ago
Try this:
router.addDefaultHander(async ({ parseWithCheerio, log }) => {
log.info('scraping data');
const $ = await parseWithCheerio();

const aTags = $('div[role="article"] > a');
const finalData = [];

for (const tag of aTags) {
const elem = $(tag);

const data = {
url: elem.attr('href'),
}

finalData.push(data);
}

console.log(finalData);
});
router.addDefaultHander(async ({ parseWithCheerio, log }) => {
log.info('scraping data');
const $ = await parseWithCheerio();

const aTags = $('div[role="article"] > a');
const finalData = [];

for (const tag of aTags) {
const elem = $(tag);

const data = {
url: elem.attr('href'),
}

finalData.push(data);
}

console.log(finalData);
});
This will scrape the href attribute from each anchor element. But within the for...of loop, you can do anything.
correct-apricot
correct-apricotOP3y ago
But I want use Pupeteer, because page render by ssjs *js not use Cheerio
fascinating-indigo
fascinating-indigo3y ago
You can use approach above. parseWithCheerio() is just a util method, that allows you to work with the data same way as with CheerioCrawler: https://crawlee.dev/api/next/puppeteer-crawler/interface/PuppeteerCrawlingContext#parseWithCheerio
sensitive-blue
sensitive-blue3y ago
The code above works with PuppeteerCrawler.
correct-apricot
correct-apricotOP3y ago
thank you

Did you find this page helpful?