taking list of scraped urls and conducting multiple new scrapes

i have this code that scrapes product URLs from an Amazon results page i am able to successfully scrape the product URLs, but I'm unable to take each link and scrape the needed info in another crawler do i need another cheerio router also how can i take each link once scraped and instead add it to a requestlist and requestqueue and then take the urls in that request queue and scrape that information
5 Replies
harsh-harlequin
harsh-harlequinOP3y ago
here are the codes: main.js: import { CheerioCrawler } from 'crawlee'; import { router } from './routes.js'; const searchKeywords = 'computers'; // Replace with desired search keywords const searchUrl = https://www.amazon.com/s?k=${searchKeywords}; const startUrls = [searchUrl]; const crawler = new CheerioCrawler({ // Start the crawler right away and ensure there will always be 5 concurrent requests ran at any time minConcurrency: 5, // Ensure the crawler doesn't exceed 15 concurrent requests ran at any time maxConcurrency: 15, // ...but also ensure the crawler never exceeds 250 requests per minute maxRequestsPerMinute: 250, // Define router to run crawl requestHandler: router }); await crawler.run(startUrls) routes.js: import { CheerioCrawler, createCheerioRouter } from 'crawlee'; import fs from 'fs'; export const router = createCheerioRouter(); const linkArray = []; router.addHandler(async ({ $ }) => { // Scrape product links from search results page const productLinks = $('h2 a').map((_, el) => 'https://www.amazon.com' + $(el).attr('href')).get(); console.log(Found ${productLinks.length} product links); // Add each product link to array (this is inside router[01]) for (const link of productLinks) { const router02 = createCheerioRouter(); router02.addDefaultHandler(async ({ $ }) => { const productInfo = {}; productInfo.storeName = 'Amazon'; productInfo.productTitle = $('span.a-size-large.product-title-word-break').text().trim(); productInfo.productDescription = $('div.a-row.a-size-base.a-color-secondary').text().trim(); productInfo.salePrice = $('span.a-offscreen').text().trim(); productInfo.originalPrice = $('span.a-price.a-text-price').text().trim(); productInfo.reviewScore = $('span.a-icon-alt').text().trim(); productInfo.shippingInfo = $('div.a-row.a-size-base.a-color-secondary.s-align-children-center').text().trim(); // Write product info to JSON file if (productInfoList.length > 0) { const rawData = JSON.stringify(productInfo, null, 2); fs.appendFile('rawData.json', rawData, (err) => { if (err) throw err; console.log(Product info written to rawData.json for ${link}); }); } }) //router02.queue.addRequest({ url: link }); const amazon = new CheerioCrawler({ // Start the crawler right away and ensure there will always be 5 concurrent requests ran at any time minConcurrency: 1, // Ensure the crawler doesn't exceed 15 concurrent requests ran at any time maxConcurrency: 10, // ...but also ensure the crawler never exceeds 400 requests per minute maxRequestsPerMinute: 400, // Define route for crawler to run on requestHandler: router02 }); await amazon.run(link); console.log('running link') } });
harsh-harlequin
harsh-harlequinOP3y ago
here is the console output i receive: INFO CheerioCrawler: Starting the crawl Found 36 product links WARN CheerioCrawler: Reclaiming failed request back to the list or queue. Expected requests to be of type array but received type string {"id":"b1h8C8G7WjcTMKd","url":"https://www.amazon.com/s?k=computers","retryCount":1} INFO CheerioCrawler: Crawl finished. Final request statistics: {"requestsFinished":0,"requestsFailed":1,"retryHistogram":[null,null,null,1],"requestAvgFailedDur ationMillis":1880,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"reque stsFailedPerMinute":3,"requestTotalDurationMillis":1880,"requestsTotal":1,"crawlerRuntimeMillis ":18054}
Amazon.com: Computers","retryCount":1}
Amazon.com: computers","retryCount":1}
harsh-harlequin
harsh-harlequinOP3y ago
here is the pdf as well with the codes especially if you are confused on different indents and what each function goes under'
harsh-harlequin
harsh-harlequinOP3y ago
eager-peach
eager-peach3y ago
That's a lot of code, but straight away I see that you're creating a second router. Why? You should use one router per crawler, and use different routes. You could differentiate them with request.label. router.addHandler is not correct syntax - you're not providing label here. It should be either default handler, or router.addHandler('SEARH_PAGE', async ....) while the first request, instead of just URL will be { url: searchUrl], label: 'SEARCH_PAGE' } . router02.queue.addRequest this is also not correct - it should crawler.addRequests([]) , while router is part of the context. Some relevant links: https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler#router https://crawlee.dev/api/cheerio-crawler/function/createCheerioRouter https://crawlee.dev/api/cheerio-crawler/interface/CheerioCrawlingContext
CheerioCrawler | API | Crawlee
Provides a framework for the parallel crawling of web pages using plain HTTP requests and cheerio HTML parser. The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of URLs enabling recursive crawling of websites. Since CheerioCrawler uses raw HTTP requests to download web...

Did you find this page helpful?