Crawlee & Apify•3y ago

taking list of scraped urls and conducting multiple new scrapes

i have this code that scrapes product URLs from an Amazon results page i am able to successfully scrape the product URLs, but I'm unable to take each link and scrape the needed info in another crawler do i need another cheerio router also how can i take each link once scraped and instead add it to a requestlist and requestqueue and then take the urls in that request queue and scrape that information

5 Replies

harsh-harlequinOP•3y ago

here are the codes: main.js: import { CheerioCrawler } from 'crawlee'; import { router } from './routes.js'; const searchKeywords = 'computers'; // Replace with desired search keywords const searchUrl = https://www.amazon.com/s?k=${searchKeywords}; const startUrls = [searchUrl]; const crawler = new CheerioCrawler({ // Start the crawler right away and ensure there will always be 5 concurrent requests ran at any time minConcurrency: 5, // Ensure the crawler doesn't exceed 15 concurrent requests ran at any time maxConcurrency: 15, // ...but also ensure the crawler never exceeds 250 requests per minute maxRequestsPerMinute: 250, // Define router to run crawl requestHandler: router }); await crawler.run(startUrls) routes.js: import { CheerioCrawler, createCheerioRouter } from 'crawlee'; import fs from 'fs'; export const router = createCheerioRouter(); const linkArray = []; router.addHandler(async ({ $ }) => { // Scrape product links from search results page const productLinks = $('h2 a').map((_, el) => 'https://www.amazon.com' + $(el).attr('href')).get(); console.log(Found ${productLinks.length} product links); // Add each product link to array (this is inside router[01]) for (const link of productLinks) { const router02 = createCheerioRouter(); router02.addDefaultHandler(async ({ $ }) => { const productInfo = {}; productInfo.storeName = 'Amazon'; productInfo.productTitle = $('span.a-size-large.product-title-word-break').text().trim(); productInfo.productDescription = $('div.a-row.a-size-base.a-color-secondary').text().trim(); productInfo.salePrice = $('span.a-offscreen').text().trim(); productInfo.originalPrice = $('span.a-price.a-text-price').text().trim(); productInfo.reviewScore = $('span.a-icon-alt').text().trim(); productInfo.shippingInfo = $('div.a-row.a-size-base.a-color-secondary.s-align-children-center').text().trim(); // Write product info to JSON file if (productInfoList.length > 0) { const rawData = JSON.stringify(productInfo, null, 2); fs.appendFile('rawData.json', rawData, (err) => { if (err) throw err; console.log(Product info written to rawData.json for ${link}); }); } }) //router02.queue.addRequest({ url: link }); const amazon = new CheerioCrawler({ // Start the crawler right away and ensure there will always be 5 concurrent requests ran at any time minConcurrency: 1, // Ensure the crawler doesn't exceed 15 concurrent requests ran at any time maxConcurrency: 10, // ...but also ensure the crawler never exceeds 400 requests per minute maxRequestsPerMinute: 400, // Define route for crawler to run on requestHandler: router02 }); await amazon.run(link); console.log('running link') } });

harsh-harlequinOP•3y ago

here is the console output i receive: INFO CheerioCrawler: Starting the crawl Found 36 product links WARN CheerioCrawler: Reclaiming failed request back to the list or queue. Expected requests to be of type array but received type string {"id":"b1h8C8G7WjcTMKd","url":"https://www.amazon.com/s?k=computers","retryCount":1} INFO CheerioCrawler: Crawl finished. Final request statistics: {"requestsFinished":0,"requestsFailed":1,"retryHistogram":[null,null,null,1],"requestAvgFailedDur ationMillis":1880,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"reque stsFailedPerMinute":3,"requestTotalDurationMillis":1880,"requestsTotal":1,"crawlerRuntimeMillis ":18054}

Amazon.com: Computers","retryCount":1}

Amazon.com: computers","retryCount":1}

harsh-harlequinOP•3y ago

here is the pdf as well with the codes especially if you are confused on different indents and what each function goes under'

harsh-harlequinOP•3y ago

crawlee_cheerio_scra...

eager-peach•3y ago

That's a lot of code, but straight away I see that you're creating a second router. Why? You should use one router per crawler, and use different routes. You could differentiate them with request.label. router.addHandler is not correct syntax - you're not providing label here. It should be either default handler, or router.addHandler('SEARH_PAGE', async ....) while the first request, instead of just URL will be { url: searchUrl], label: 'SEARCH_PAGE' } . router02.queue.addRequest this is also not correct - it should crawler.addRequests([]) , while router is part of the context. Some relevant links: https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler#router https://crawlee.dev/api/cheerio-crawler/function/createCheerioRouter https://crawlee.dev/api/cheerio-crawler/interface/CheerioCrawlingContext

createCheerioRouter | API | Crawlee

CheerioCrawlingContext | API | Crawlee

CheerioCrawler | API | Crawlee

Provides a framework for the parallel crawling of web pages using plain HTTP requests and cheerio HTML parser. The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of URLs enabling recursive crawling of websites. Since CheerioCrawler uses raw HTTP requests to download web...

Gaming

Programming

taking list of scraped urls and conducting multiple new scrapes

Did you find this page helpful?