enqueueLinks with pagination

How can I use pagination with route, I have a route that I call and get a list of cards with links I add to requestqueue and then I need to paginate to next page using same route. My guess is to use router.call(), but I am not sure what to pass I tried also doing: // https://dk.trustpilot.com/categories/*?page=*, but this does not work either. page=0 is 404, so I need to start from 1 and go up.
5 Replies
environmental-rose
environmental-rose3y ago
The better option is to, instead of using enqueueLinks, grab the final page number (in a pagination list, this is usually available), create a range between 2 and lastPageNumber, then generate a set of RequestOptions for each one. Then simply add all the requests with crawler.addRequests(). The range should start at 2 so that you run your queueing logic only once, and run your scraping logic the rest of the time. Here is a full example I built scraping all pages on here: https://dk.trustpilot.com/categories/craftsman
import { CheerioCrawler, Dataset } from 'crawlee';
import type { RequestOptions } from 'crawlee';

enum Selectors {
CARD = 'div[class*="BusinessListWrapper"] > div',
TITLE = 'p[class*="typography_heading"]',
LAST_PAGE = 'nav[class*="pagination"] > a:nth-last-child(2) > span',
}

const crawler = new CheerioCrawler({
requestHandler: async ({ $, crawler, request: { url, userData }, log }) => {
const { page } = userData as { page: number };

log.info(url);

if (!page) {
const lastPage = +$(Selectors.LAST_PAGE).text().trim();
// Range between 2 and 56
const range = [...Array(lastPage + 1).keys()].slice(2);

// Creates a pagination request for each number to scrape
const requests: RequestOptions[] = range.map((num) => ({
url: `${url}?page=${num}`,
userData: {
page: num,
},
}));

// Queues all the requests at once
await crawler.addRequests(requests);
}

// Scrapes the title from each card
const items = Array.from($(Selectors.CARD)).map((item) => ({
title: $(item).text().trim(),
page: page || 1,
}));

await Dataset.pushData(items);
},
});

await crawler.run(['https://dk.trustpilot.com/categories/craftsman']);
import { CheerioCrawler, Dataset } from 'crawlee';
import type { RequestOptions } from 'crawlee';

enum Selectors {
CARD = 'div[class*="BusinessListWrapper"] > div',
TITLE = 'p[class*="typography_heading"]',
LAST_PAGE = 'nav[class*="pagination"] > a:nth-last-child(2) > span',
}

const crawler = new CheerioCrawler({
requestHandler: async ({ $, crawler, request: { url, userData }, log }) => {
const { page } = userData as { page: number };

log.info(url);

if (!page) {
const lastPage = +$(Selectors.LAST_PAGE).text().trim();
// Range between 2 and 56
const range = [...Array(lastPage + 1).keys()].slice(2);

// Creates a pagination request for each number to scrape
const requests: RequestOptions[] = range.map((num) => ({
url: `${url}?page=${num}`,
userData: {
page: num,
},
}));

// Queues all the requests at once
await crawler.addRequests(requests);
}

// Scrapes the title from each card
const items = Array.from($(Selectors.CARD)).map((item) => ({
title: $(item).text().trim(),
page: page || 1,
}));

await Dataset.pushData(items);
},
});

await crawler.run(['https://dk.trustpilot.com/categories/craftsman']);
Bedste virksomheder i kategorien Håndværker på Trustpilot
Find og sammenlign de bedste virksomheder i kategorien Håndværker på Trustpilot, og fortæl om dine egne oplevelser
environmental-rose
environmental-rose3y ago
But, if you don't want to go with that method, you can just use the regexps options in enqueueLinks to get the same result:
import { CheerioCrawler, Dataset } from 'crawlee';
import type { RequestOptions } from 'crawlee';

enum Selectors {
CARD = 'div[class*="BusinessListWrapper"] > div',
TITLE = 'p[class*="typography_heading"]',
LAST_PAGE = 'nav[class*="pagination"] > a:nth-last-child(2) > span',
}

const crawler = new CheerioCrawler({
requestHandler: async ({ $, request: { url, userData }, log, enqueueLinks }) => {
const { page } = userData as { page: number };

log.info(url);

if (!page) {
await enqueueLinks({
regexps: [/https:\/\/dk\.trustpilot\.com\/categories\/craftsman\?page=[^\D1]/],
});
}

// Scrapes the title from each card
const items = Array.from($(Selectors.CARD)).map((item) => ({
title: $(item).text().trim(),
page: page || 1,
}));

await Dataset.pushData(items);
},
});

await crawler.run(['https://dk.trustpilot.com/categories/craftsman']);
import { CheerioCrawler, Dataset } from 'crawlee';
import type { RequestOptions } from 'crawlee';

enum Selectors {
CARD = 'div[class*="BusinessListWrapper"] > div',
TITLE = 'p[class*="typography_heading"]',
LAST_PAGE = 'nav[class*="pagination"] > a:nth-last-child(2) > span',
}

const crawler = new CheerioCrawler({
requestHandler: async ({ $, request: { url, userData }, log, enqueueLinks }) => {
const { page } = userData as { page: number };

log.info(url);

if (!page) {
await enqueueLinks({
regexps: [/https:\/\/dk\.trustpilot\.com\/categories\/craftsman\?page=[^\D1]/],
});
}

// Scrapes the title from each card
const items = Array.from($(Selectors.CARD)).map((item) => ({
title: $(item).text().trim(),
page: page || 1,
}));

await Dataset.pushData(items);
},
});

await crawler.run(['https://dk.trustpilot.com/categories/craftsman']);
grumpy-cyan
grumpy-cyanOP3y ago
it is some nice suggestions, but would it be possible to do: https://dk.trustpilot.com/categories/*?pages=1 => infinite (until next page button does not exist more) ? Also I have implemented your code, but since I am running it with a router the crawler is not available which is in main.ts, so is it possible to use RequestQueue or similar?
environmental-rose
environmental-rose3y ago
Why do you want to do the implementation above? It is less practical to enqueue the next page on every single request Also, the “crawler” object is available in the Content object passed to a router function
grumpy-cyan
grumpy-cyanOP3y ago
I thought it was the easiest method of iterating pages for each category. However the 1 st solution you provided seems to work. I need to let it run for some time to see if results from second pages appear in dataset Ah yeah I missed the crawler object in the router Thanks

Did you find this page helpful?