How to see why a crawl path was chosen?

Sometimes /crawl doesn't include the base url I set in the set of source urls back. Ideally the crawler would start at the home page + then some series of urls afterward. How best to see why it skipped the base url / force it to start there?
5 Replies
James Peterson
James PetersonOP14mo ago
Related: I just got the same url twice in a crawl list (with default settings), the only difference between the two being a trailing slash and www. sub. How best to prevent it scraping the same url twice?
[p.get('metadata').get('sourceURL') for p in crawl_status.get('data')]
[p.get('metadata').get('sourceURL') for p in crawl_status.get('data')]
['https://www.shippit.com/post-purchase-experiences/', 'https://www.shippit.com/couriers/', 'https://www.shippit.com/ecommerce/', 'https://www.shippit.com/fulfilment-and-optimisation/', 'https://www.shippit.com/track-your-package/', 'https://www.shippit.com/enterprise/', 'https://www.shippit.com/', 'https://www.shippit.com/demo/', 'https://www.shippit.com/shipping-and-delivery/', 'http://shippit.com']
Sachin
Sachin13mo ago
Hi @James Peterson I believe this is something related to the sitemap of the website and not a FireCrawl thing. If you look closer, the URLs are different in the sense that one has HTTPS protocol while the other has a simple HTTP protocol. This is maybe because of how the developers of the website has set up things initially. So these are in a sense 2 different URLs. I have also seen similar results in the past.
mogery
mogery13mo ago
Hi @James Peterson, Sachin is correct, this is a sitemap/linkage issue. We've been thinking about a good way to deduplicate these, but the issue is that there is no guarantee that www.shippit.com and shippit.com would return the same page. Hell, there isn't even a guarantee that shippit.com and shippit.com/ would return the same page! (Aren't web standards wonderful?) Anyways, this is something that's been at the back of my head for a while, and I'll continue thinking about a way to avoid these duplicate scrapes. Until then, unfortunately, there's no way to do that, unless ignoreSitemap gets better results for you. Skipping the home page sounds odd -- could you grab me your start URL or crawl ID that produced that result?
James Peterson
James PetersonOP13mo ago
Naiively, I think it'd be kinda hard to find a counter-example? As a user this feels very much like throwing out the baby with the bath water.
mogery
mogery13mo ago
True, I'll be bringing this up to the team.

Did you find this page helpful?