Crawlee & Apify•3y ago

Scraping multipule items on a page

Hello, I haven't used Apify SDK for many months and I see some things changed, please help me by providing a snippet based on this: https://sdk.apify.com/docs/examples/basic-crawler that will visit a url, create array from all elements with class .branch, and extract the text under class .branch-name, and will create a list of json files for each branch with his branch name. In the past I made things like that and much more complicated but I totally forgot. And I cant find a few articles with examples such as the one that scraped Alexa sites with their ranking. Thank you

Basic crawler | Apify SDK

This is the most bare-bones example of the Apify SDK, which demonstrates some of its building blocks such as the BasicCrawler. You probably don't need to go this deep though, and it would be better to start with one of the full-featured crawlers

9 Replies

tame-yellowOP•3y ago

I found this code which does pretty much what I need, but changing console.log to await Dataset.pushData isn't working, what it the approach here? https://crawlee.dev/docs/introduction/real-world-project import { PlaywrightCrawler, Dataset } from 'crawlee'; const crawler = new PlaywrightCrawler({ requestHandler: async ({ page }) => { await page.waitForSelector('.ActorStoreItem'); const actorTexts = await page.$$eval('.ActorStoreItem', (els) => { return els.map((el) => el.textContent); }); actorTexts.forEach((text, i) => { console.log(ACTOR_${i + 1}: ${text}\n); }); }, }); await crawler.run(['https://apify.com/store']);

Getting some real-world data | Crawlee

Your first steps into the world of scraping with Crawlee

ratty-blush•3y ago

I don't see where you're using Dataset.pushData, but if you're trying to use it within a forEach, it won't work as expected.

tame-yellowOP•3y ago

Thanks for the reply, I managed to solve this, but I am facing another issue, I do this and it works when the element exist const data = await page.$$eval('.single-branch', $posts => { const scrapedData = []; $posts.forEach($post => { const branch_name = $post.querySelector('.branch-name').innerText; scrapedData.push({ branch_name: branch_name, }); But if it doesn't find this: $post.querySelector('.branch-name') It return the error: "Cannot read properties of null (reading 'innerText')" How can I change it so it will return 'null' instead of the error if the element does not exist? Because in some .single-branch elements it find it and in some it doesn't' if it finds it I need to get the innerText, if it doesn't find, it needs to be null

unwilling-turquoise•3y ago

Use JQuery syntax, e.g. $(selector).text(). That doesn't crash if not found

tame-yellowOP•3y ago

Lukas, thank you very much!You helped me a lot, I am able to achieve what I need using 2 lines: const day2_1_return = $post.querySelector('.working_hours > p.rows:nth-of-type(2) > span.days'); const day2_1 = $(day2_1_return).text(); Can this be done in 1 line? I cant use the below line because it return the result from the first post to all the other posts: $('.working_hours > p.rows:nth-of-type(2) > span.days').text(), And this one shows an error: $post.querySelector('.working_hours > p.rows:nth-of-type(2) > span.days').text();

ratty-blush•3y ago

Perhaps .first() could help you out? https://api.jquery.com/first/

.first() | jQuery API Documentation

tame-yellowOP•3y ago

Thanks, I tried that also but couldn't make it work in 1 line

ratty-blush•3y ago

Why does it need to be 1 line?

tame-yellowOP•3y ago

Doesn't have to, just because for very big files it can look nicer I have another question please, if there is a tag with in it, can I split each to a different name? For example Sunday:9:00-17:00 Monday:8:00-17:00 Tuesday:9:00-17:00 and so on...

Gaming

Programming

Scraping multipule items on a page

Did you find this page helpful?