enqueueLinks with pagination
How can I use pagination with route, I have a route that I call and get a list of cards with links I add to requestqueue and then I need to paginate to next page using same route.
My guess is to use router.call(), but I am not sure what to pass
I tried also doing: // https://dk.trustpilot.com/categories/*?page=*, but this does not work either. page=0 is 404, so I need to start from 1 and go up.
5 Replies
environmental-rose•3y ago
The better option is to, instead of using
enqueueLinks
, grab the final page number (in a pagination list, this is usually available), create a range between 2 and lastPageNumber
, then generate a set of RequestOptions
for each one. Then simply add all the requests with crawler.addRequests()
.
The range should start at 2 so that you run your queueing logic only once, and run your scraping logic the rest of the time. Here is a full example I built scraping all pages on here: https://dk.trustpilot.com/categories/craftsman
Bedste virksomheder i kategorien Håndværker på Trustpilot
Find og sammenlign de bedste virksomheder i kategorien Håndværker på Trustpilot, og fortæl om dine egne oplevelser
environmental-rose•3y ago
But, if you don't want to go with that method, you can just use the
regexps
options in enqueueLinks
to get the same result:
grumpy-cyanOP•3y ago
it is some nice suggestions, but would it be possible to do:
https://dk.trustpilot.com/categories/*?pages=1 => infinite (until next page button does not exist more) ?
Also I have implemented your code, but since I am running it with a router the crawler is not available which is in main.ts, so is it possible to use RequestQueue or similar?
environmental-rose•3y ago
Why do you want to do the implementation above? It is less practical to enqueue the next page on every single request
Also, the “crawler” object is available in the Content object passed to a router function
grumpy-cyanOP•3y ago
I thought it was the easiest method of iterating pages for each category. However the 1 st solution you provided seems to work. I need to let it run for some time to see if results from second pages appear in dataset
Ah yeah I missed the crawler object in the router
Thanks