Firecrawl•14mo ago

How to see why a crawl path was chosen?

Sometimes /crawl doesn't include the base url I set in the set of source urls back. Ideally the crawler would start at the home page + then some series of urls afterward. How best to see why it skipped the base url / force it to start there?

5 Replies

James PetersonOP•14mo ago

Related: I just got the same url twice in a crawl list (with default settings), the only difference between the two being a trailing slash and www. sub. How best to prevent it scraping the same url twice?

[p.get('metadata').get('sourceURL') for p in crawl_status.get('data')]

[p.get('metadata').get('sourceURL') for p in crawl_status.get('data')]

['https://www.shippit.com/post-purchase-experiences/', 'https://www.shippit.com/couriers/', 'https://www.shippit.com/ecommerce/', 'https://www.shippit.com/fulfilment-and-optimisation/', 'https://www.shippit.com/track-your-package/', 'https://www.shippit.com/enterprise/', 'https://www.shippit.com/', 'https://www.shippit.com/demo/', 'https://www.shippit.com/shipping-and-delivery/', 'http://shippit.com']

Sachin•13mo ago

Hi @James Peterson I believe this is something related to the sitemap of the website and not a FireCrawl thing. If you look closer, the URLs are different in the sense that one has HTTPS protocol while the other has a simple HTTP protocol. This is maybe because of how the developers of the website has set up things initially. So these are in a sense 2 different URLs. I have also seen similar results in the past.

mogery•13mo ago

Hi @James Peterson, Sachin is correct, this is a sitemap/linkage issue. We've been thinking about a good way to deduplicate these, but the issue is that there is no guarantee that www.shippit.com and shippit.com would return the same page. Hell, there isn't even a guarantee that shippit.com and shippit.com/ would return the same page! (Aren't web standards wonderful?) Anyways, this is something that's been at the back of my head for a while, and I'll continue thinking about a way to avoid these duplicate scrapes. Until then, unfortunately, there's no way to do that, unless ignoreSitemap gets better results for you. Skipping the home page sounds odd -- could you grab me your start URL or crawl ID that produced that result?

James PetersonOP•13mo ago

Naiively, I think it'd be kinda hard to find a counter-example? As a user this feels very much like throwing out the baby with the bath water.

mogery•13mo ago

True, I'll be bringing this up to the team.

Gaming

Programming

How to see why a crawl path was chosen?

Did you find this page helpful?