F
Firecrawl11mo ago
Matt

Unable to crawl more than the base page of https://www.trustpilot.com/review/huel.com

The page has pagination at the bottom with direct links that tick on "?page=2/3/4/etc" to the end of the URL. Shouldn't Firecrawl pick up on that? My setup: I've tried 'includePaths' : ['page='] and similar variations with no luck. What am I missing?
6 Replies
Matt
MattOP11mo ago
@Moderator edit: in the playground if I put the base url https://www.trustpilot.com/ to crawl, it never returns more than one page (limit set to 10). How are they avoiding the crawler?
Trustpilot Reviews: Experience the power of customer reviews
We're all about consumer reviews. Get the real inside story from shoppers like you. Read, write and share reviews on Trustpilot today.
rafaelmiller
rafaelmiller11mo ago
hey @Matt ! I did a quick check, and it seems this page has all URLs blocked in its robots.txt file. That’s why Firecrawl isn’t able to crawl beyond the first page. sending you the log for this crawl. As you can see, the "isRobotsAllowed" parameter is false for almost all found pages.
Matt
MattOP11mo ago
That explains it. No way to bypass that I assume? @rafaelmiller sorry to ping you mate. Does Firecrawl have a switch to ignore robots? Thanks
MBLRJ
MBLRJ10mo ago
Hey @Matt, im in a similar problem right here, did you manage to solve it? It would be very helpful, thanks!
BloomFilter
BloomFilter9mo ago
I wrote a helper script, which finds all the links, extracts the links and then scrape each page individually. Let me know if you want that code. Works perfectly fine.

Did you find this page helpful?