Crawlee & Apify•3y ago

enqueueLinks with pagination

How can I use pagination with route, I have a route that I call and get a list of cards with links I add to requestqueue and then I need to paginate to next page using same route. My guess is to use router.call(), but I am not sure what to pass I tried also doing: // https://dk.trustpilot.com/categories/*?page=*, but this does not work either. page=0 is 404, so I need to start from 1 and go up.

5 Replies

environmental-rose•3y ago

The better option is to, instead of using enqueueLinks, grab the final page number (in a pagination list, this is usually available), create a range between 2 and lastPageNumber, then generate a set of RequestOptions for each one. Then simply add all the requests with crawler.addRequests(). The range should start at 2 so that you run your queueing logic only once, and run your scraping logic the rest of the time. Here is a full example I built scraping all pages on here: https://dk.trustpilot.com/categories/craftsman

import { CheerioCrawler, Dataset } from 'crawlee';
import type { RequestOptions } from 'crawlee';

enum Selectors {
    CARD = 'div[class*="BusinessListWrapper"] > div',
    TITLE = 'p[class*="typography_heading"]',
    LAST_PAGE = 'nav[class*="pagination"] > a:nth-last-child(2) > span',
}

const crawler = new CheerioCrawler({
    requestHandler: async ({ $, crawler, request: { url, userData }, log }) => {
        const { page } = userData as { page: number };

        log.info(url);

        if (!page) {
            const lastPage = +$(Selectors.LAST_PAGE).text().trim();
            // Range between 2 and 56
            const range = [...Array(lastPage + 1).keys()].slice(2);

            // Creates a pagination request for each number to scrape
            const requests: RequestOptions[] = range.map((num) => ({
                url: `${url}?page=${num}`,
                userData: {
                    page: num,
                },
            }));

            // Queues all the requests at once
            await crawler.addRequests(requests);
        }

        // Scrapes the title from each card
        const items = Array.from($(Selectors.CARD)).map((item) => ({
            title: $(item).text().trim(),
            page: page || 1,
        }));

        await Dataset.pushData(items);
    },
});

await crawler.run(['https://dk.trustpilot.com/categories/craftsman']);

import { CheerioCrawler, Dataset } from 'crawlee';
import type { RequestOptions } from 'crawlee';

enum Selectors {
    CARD = 'div[class*="BusinessListWrapper"] > div',
    TITLE = 'p[class*="typography_heading"]',
    LAST_PAGE = 'nav[class*="pagination"] > a:nth-last-child(2) > span',
}

const crawler = new CheerioCrawler({
    requestHandler: async ({ $, crawler, request: { url, userData }, log }) => {
        const { page } = userData as { page: number };

        log.info(url);

        if (!page) {
            const lastPage = +$(Selectors.LAST_PAGE).text().trim();
            // Range between 2 and 56
            const range = [...Array(lastPage + 1).keys()].slice(2);

            // Creates a pagination request for each number to scrape
            const requests: RequestOptions[] = range.map((num) => ({
                url: `${url}?page=${num}`,
                userData: {
                    page: num,
                },
            }));

            // Queues all the requests at once
            await crawler.addRequests(requests);
        }

        // Scrapes the title from each card
        const items = Array.from($(Selectors.CARD)).map((item) => ({
            title: $(item).text().trim(),
            page: page || 1,
        }));

        await Dataset.pushData(items);
    },
});

await crawler.run(['https://dk.trustpilot.com/categories/craftsman']);

Bedste virksomheder i kategorien Håndværker på Trustpilot

Find og sammenlign de bedste virksomheder i kategorien Håndværker på Trustpilot, og fortæl om dine egne oplevelser

environmental-rose•3y ago

But, if you don't want to go with that method, you can just use the regexps options in enqueueLinks to get the same result:

import { CheerioCrawler, Dataset } from 'crawlee';
import type { RequestOptions } from 'crawlee';

enum Selectors {
    CARD = 'div[class*="BusinessListWrapper"] > div',
    TITLE = 'p[class*="typography_heading"]',
    LAST_PAGE = 'nav[class*="pagination"] > a:nth-last-child(2) > span',
}

const crawler = new CheerioCrawler({
    requestHandler: async ({ $, request: { url, userData }, log, enqueueLinks }) => {
        const { page } = userData as { page: number };

        log.info(url);

        if (!page) {
            await enqueueLinks({
                regexps: [/https:\/\/dk\.trustpilot\.com\/categories\/craftsman\?page=[^\D1]/],
            });
        }

        // Scrapes the title from each card
        const items = Array.from($(Selectors.CARD)).map((item) => ({
            title: $(item).text().trim(),
            page: page || 1,
        }));

        await Dataset.pushData(items);
    },
});

await crawler.run(['https://dk.trustpilot.com/categories/craftsman']);

import { CheerioCrawler, Dataset } from 'crawlee';
import type { RequestOptions } from 'crawlee';

enum Selectors {
    CARD = 'div[class*="BusinessListWrapper"] > div',
    TITLE = 'p[class*="typography_heading"]',
    LAST_PAGE = 'nav[class*="pagination"] > a:nth-last-child(2) > span',
}

const crawler = new CheerioCrawler({
    requestHandler: async ({ $, request: { url, userData }, log, enqueueLinks }) => {
        const { page } = userData as { page: number };

        log.info(url);

        if (!page) {
            await enqueueLinks({
                regexps: [/https:\/\/dk\.trustpilot\.com\/categories\/craftsman\?page=[^\D1]/],
            });
        }

        // Scrapes the title from each card
        const items = Array.from($(Selectors.CARD)).map((item) => ({
            title: $(item).text().trim(),
            page: page || 1,
        }));

        await Dataset.pushData(items);
    },
});

await crawler.run(['https://dk.trustpilot.com/categories/craftsman']);

grumpy-cyanOP•3y ago

it is some nice suggestions, but would it be possible to do: https://dk.trustpilot.com/categories/*?pages=1 => infinite (until next page button does not exist more) ? Also I have implemented your code, but since I am running it with a router the crawler is not available which is in main.ts, so is it possible to use RequestQueue or similar?

environmental-rose•3y ago

Why do you want to do the implementation above? It is less practical to enqueue the next page on every single request Also, the “crawler” object is available in the Content object passed to a router function

grumpy-cyanOP•3y ago

I thought it was the easiest method of iterating pages for each category. However the 1 st solution you provided seems to work. I need to let it run for some time to see if results from second pages appear in dataset Ah yeah I missed the crawler object in the router Thanks

Gaming

Programming

enqueueLinks with pagination

Did you find this page helpful?