Unable to crawl more than the base page of https://www.trustpilot.com/review/huel.com
The page has pagination at the bottom with direct links that tick on "?page=2/3/4/etc" to the end of the URL. Shouldn't Firecrawl pick up on that? My setup:
I've tried 'includePaths' : ['page='] and similar variations with no luck. What am I missing?
6 Replies
@Moderator edit: in the playground if I put the base url https://www.trustpilot.com/ to crawl, it never returns more than one page (limit set to 10). How are they avoiding the crawler?
Trustpilot Reviews: Experience the power of customer reviews
We're all about consumer reviews. Get the real inside story from shoppers like you. Read, write and share reviews on Trustpilot today.
hey @Matt ! I did a quick check, and it seems this page has all URLs blocked in its robots.txt file. That’s why Firecrawl isn’t able to crawl beyond the first page.
sending you the log for this crawl. As you can see, the "isRobotsAllowed" parameter is false for almost all found pages.
That explains it. No way to bypass that I assume?
@rafaelmiller sorry to ping you mate. Does Firecrawl have a switch to ignore robots? Thanks
Hey @Matt, im in a similar problem right here, did you manage to solve it?
It would be very helpful, thanks!
I wrote a helper script, which finds all the links, extracts the links and then scrape each page individually. Let me know if you want that code. Works perfectly fine.