Firecrawl•11mo ago

Unable to crawl more than the base page of https://www.trustpilot.com/review/huel.com

The page has pagination at the bottom with direct links that tick on "?page=2/3/4/etc" to the end of the URL. Shouldn't Firecrawl pick up on that? My setup: I've tried 'includePaths' : ['page='] and similar variations with no luck. What am I missing?

6 Replies

MattOP•11mo ago

@Moderator edit: in the playground if I put the base url https://www.trustpilot.com/ to crawl, it never returns more than one page (limit set to 10). How are they avoiding the crawler?

Trustpilot Reviews: Experience the power of customer reviews

We're all about consumer reviews. Get the real inside story from shoppers like you. Read, write and share reviews on Trustpilot today.

rafaelmiller•11mo ago

hey @Matt ! I did a quick check, and it seems this page has all URLs blocked in its robots.txt file. That’s why Firecrawl isn’t able to crawl beyond the first page. sending you the log for this crawl. As you can see, the "isRobotsAllowed" parameter is false for almost all found pages.

rafaelmiller•11mo ago

log.txt

MattOP•11mo ago

That explains it. No way to bypass that I assume? @rafaelmiller sorry to ping you mate. Does Firecrawl have a switch to ignore robots? Thanks

MBLRJ•10mo ago

Hey @Matt, im in a similar problem right here, did you manage to solve it? It would be very helpful, thanks!

BloomFilter•9mo ago

I wrote a helper script, which finds all the links, extracts the links and then scrape each page individually. Let me know if you want that code. Works perfectly fine.

Gaming

Programming

Unable to crawl more than the base page of https://www.trustpilot.com/review/huel.com

Did you find this page helpful?