Scraping multipule items on a page
Hello,
I haven't used Apify SDK for many months and I see some things changed, please help me by providing a snippet based on this:
https://sdk.apify.com/docs/examples/basic-crawler
that will visit a url, create array from all elements with class .branch, and extract the text under class .branch-name, and will create a list of json files for each branch with his branch name.
In the past I made things like that and much more complicated but I totally forgot. And I cant find a few articles with examples such as the one that scraped Alexa sites with their ranking.
Thank you
Basic crawler | Apify SDK
This is the most bare-bones example of the Apify SDK, which demonstrates some of its building blocks such as the BasicCrawler. You probably don't need to go this deep though, and it would be better to start with one of the full-featured crawlers
9 Replies
tame-yellowOP•3y ago
I found this code which does pretty much what I need, but changing console.log to await Dataset.pushData isn't working, what it the approach here? https://crawlee.dev/docs/introduction/real-world-project
import { PlaywrightCrawler, Dataset } from 'crawlee';
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page }) => {
await page.waitForSelector('.ActorStoreItem');
const actorTexts = await page.$$eval('.ActorStoreItem', (els) => {
return els.map((el) => el.textContent);
});
actorTexts.forEach((text, i) => {
console.log(
ACTOR_${i + 1}: ${text}\n
);
});
},
});
await crawler.run(['https://apify.com/store']);Getting some real-world data | Crawlee
Your first steps into the world of scraping with Crawlee
ratty-blush•3y ago
I don't see where you're using
Dataset.pushData
, but if you're trying to use it within a forEach
, it won't work as expected.tame-yellowOP•3y ago
Thanks for the reply, I managed to solve this, but I am facing another issue,
I do this and it works when the element exist
const data = await page.$$eval('.single-branch', $posts => {
const scrapedData = [];
$posts.forEach($post => {
const branch_name = $post.querySelector('.branch-name').innerText;
scrapedData.push({
branch_name: branch_name,
});
But if it doesn't find this:
$post.querySelector('.branch-name')
It return the error:
"Cannot read properties of null (reading 'innerText')"
How can I change it so it will return 'null' instead of the error if the element does not exist?
Because in some .single-branch elements it find it and in some it doesn't'
if it finds it I need to get the innerText, if it doesn't find, it needs to be null
unwilling-turquoise•3y ago
Use JQuery syntax, e.g. $(selector).text(). That doesn't crash if not found
tame-yellowOP•3y ago
Lukas, thank you very much!You helped me a lot, I am able to achieve what I need using 2 lines:
const day2_1_return = $post.querySelector('.working_hours > p.rows:nth-of-type(2) > span.days');
const day2_1 = $(day2_1_return).text();
Can this be done in 1 line?
I cant use the below line because it return the result from the first post to all the other posts:
$('.working_hours > p.rows:nth-of-type(2) > span.days').text(),
And this one shows an error:
$post.querySelector('.working_hours > p.rows:nth-of-type(2) > span.days').text();
ratty-blush•3y ago
tame-yellowOP•3y ago
Thanks, I tried that also but couldn't make it work in 1 line
ratty-blush•3y ago
Why does it need to be 1 line?
tame-yellowOP•3y ago
Doesn't have to, just because for very big files it can look nicer
I have another question please, if there is a <p> tag with <br> in it, can I split each <br> to a different name?
For example <p>Sunday:9:00-17:00<br>Monday:8:00-17:00<br>Tuesday:9:00-17:00</p> and so on...