Crawlee & Apify•2y ago

infinite scrolling of pages

i have a crawler that goes through collection pages of stores and scrapes their product links and goes through those product page links to get product data when getting the product links in the collection pages, many sites utilize an infinite scrolling to render in all the products how do i implement infinite scrolling into this specific crawler route handler here below while scraping the product page urls to render in all the products to make sure i scraped all the products on the page:

kotnRouter.addHandler('KOTN_DETAIL', async ({ page, log }) => {
    log.info('Scraping product URLs');
    
    await page.goto(page.url(), { waitUntil: 'domcontentloaded' })

  
    const productUrls: string[] = [];
  
    const links = await page.$$eval('a', (elements) =>
      elements.map((el) => el.getAttribute('href'))
    );
  
    for (const link of links) {
      if (link && !link.startsWith('https://')) {
        const productUrl = 'https://www.kotn.com' + link;
        if (productUrl.includes('/products')) {
          productUrls.push(productUrl);
        }
      }
    }
  
    // Push unique URLs to the dataset
    const uniqueProductUrls = Array.from(new Set(productUrls));
    console.log(uniqueProductUrls);
    await Dataset.pushData({
      urls: uniqueProductUrls,
    });
  
    await Promise.all(
      uniqueProductUrls.map((link) => kotnCrawler.addRequests([{ url: link, label: 'KOTN_PRODUCT' }]))
    );
  
    linksCount += uniqueProductUrls.length;
  
    console.log(uniqueProductUrls);
    console.log(`Total product links scraped so far: ${linksCount}`);
  
});
z

kotnRouter.addHandler('KOTN_DETAIL', async ({ page, log }) => {
    log.info('Scraping product URLs');
    
    await page.goto(page.url(), { waitUntil: 'domcontentloaded' })

  
    const productUrls: string[] = [];
  
    const links = await page.$$eval('a', (elements) =>
      elements.map((el) => el.getAttribute('href'))
    );
  
    for (const link of links) {
      if (link && !link.startsWith('https://')) {
        const productUrl = 'https://www.kotn.com' + link;
        if (productUrl.includes('/products')) {
          productUrls.push(productUrl);
        }
      }
    }
  
    // Push unique URLs to the dataset
    const uniqueProductUrls = Array.from(new Set(productUrls));
    console.log(uniqueProductUrls);
    await Dataset.pushData({
      urls: uniqueProductUrls,
    });
  
    await Promise.all(
      uniqueProductUrls.map((link) => kotnCrawler.addRequests([{ url: link, label: 'KOTN_PRODUCT' }]))
    );
  
    linksCount += uniqueProductUrls.length;
  
    console.log(uniqueProductUrls);
    console.log(`Total product links scraped so far: ${linksCount}`);
  
});
z

6 Replies

xenophobic-harlequinOP•2y ago

(PLAYWRIGHT crawler btw)

eastern-cyan•2y ago

hey, what about using the infiniteScroll function: https://crawlee.dev/api/playwright-crawler/namespace/playwrightUtils#infiniteScroll

playwrightUtils | API | Crawlee

A namespace that contains various utilities for Playwright - the headless Chrome Node API. Example usage: ```javascript import { launchPlaywright, playwrightUtils } from 'crawlee'; // Navigate to https://www.example.com in Playwright with a POST request const browser = await launchPlaywright(); c...

xenophobic-harlequinOP•2y ago

I’m not sure on how to implement those playwright Utils properly to keep scrolling incrementally and use that in my touter sorry but I’m not as experienced with the utils

optimistic-gold•2y ago

Here is an example on how to use it, it's using Puppeteer but it works the exact same with Playwright, scroll to the infiniteScroll example: https://docs.apify.com/academy/node-js/dealing-with-dynamic-pages#scraping-dynamic-content

How to scrape from dynamic pages | Academy | Apify Documentation

Learn about dynamic pages and dynamic content. How can we find out if a page is dynamic? How do we programmatically scrape dynamic content?

xenophobic-harlequinOP•2y ago

Thanks! Does this implement the scroll and pause too?

optimistic-gold•2y ago

Here are the options you can pass to it to control it: https://crawlee.dev/api/playwright-crawler/namespace/playwrightUtils#InfiniteScrollOptions

Gaming

Programming

infinite scrolling of pages

Did you find this page helpful?