How to see why a crawl path was chosen?
Sometimes /crawl doesn't include the base url I set in the set of source urls back.
Ideally the crawler would start at the home page + then some series of urls afterward.
How best to see why it skipped the base url / force it to start there?
5 Replies
Related: I just got the same url twice in a crawl list (with default settings), the only difference between the two being a trailing slash and www. sub. How best to prevent it scraping the same url twice?
['https://www.shippit.com/post-purchase-experiences/', 'https://www.shippit.com/couriers/', 'https://www.shippit.com/ecommerce/', 'https://www.shippit.com/fulfilment-and-optimisation/', 'https://www.shippit.com/track-your-package/', 'https://www.shippit.com/enterprise/', 'https://www.shippit.com/', 'https://www.shippit.com/demo/', 'https://www.shippit.com/shipping-and-delivery/', 'http://shippit.com']
Hi @James Peterson I believe this is something related to the sitemap of the website and not a FireCrawl thing.
If you look closer, the URLs are different in the sense that one has HTTPS protocol while the other has a simple HTTP protocol.
This is maybe because of how the developers of the website has set up things initially. So these are in a sense 2 different URLs.
I have also seen similar results in the past.
Hi @James Peterson, Sachin is correct, this is a sitemap/linkage issue. We've been thinking about a good way to deduplicate these, but the issue is that there is no guarantee that
www.shippit.com
and shippit.com
would return the same page. Hell, there isn't even a guarantee that shippit.com
and shippit.com/
would return the same page! (Aren't web standards wonderful?)
Anyways, this is something that's been at the back of my head for a while, and I'll continue thinking about a way to avoid these duplicate scrapes. Until then, unfortunately, there's no way to do that, unless ignoreSitemap
gets better results for you.
Skipping the home page sounds odd -- could you grab me your start URL or crawl ID that produced that result?Naiively, I think it'd be kinda hard to find a counter-example? As a user this feels very much like throwing out the baby with the bath water.
True, I'll be bringing this up to the team.