Crawl endpoint retrieving wrong sourceURL
I’ve found a mismatch in the crawl endpoint. When I try scraping a specific website, the sourceURL is being returned as it was currentURL. However, when I run the same URL in scrape mode, the response looks correct and the sourceURL is fine.
The URL I’m testing with is: https://forum.tufin.com/support/kc/ext/tm/
On the crawl I receive this sourceURL: https://forum.tufin.com/support/kc/ext/tm/Content/ext/tm/intro.htm
On the scrape I receive this sourceURL: https://forum.tufin.com/support/kc/ext/tm/
10 Replies
The
sourceURL is the user-provided URL from the scrape request - if you're looking for the resolved url, that's under the url key.Yeah, I see what you mean. The issue is that the crawl endpoint isn’t returning the user-provided source URL.
For example, I crawled this website:
https://forum.tufin.com/support/kc/ext/tm/
But the sourceURL came back as:
https://forum.tufin.com/support/kc/ext/tm/Content/ext/tm/intro.htm
In this case, shouldn’t the sourceURL be the original one provided —
https://forum.tufin.com/support/kc/ext/?
The difference there is that when you do a crawl, you are only specifying a starting URL of the first page, not all of the other webpages.
But what should be the sourceURL of the first page? The one that the user specified, isn't it correct?
Firecrawl Docs
Crawl | Firecrawl
Firecrawl can recursively search through a urls subdomains, and gather the content

@Gaurav Chadha, there is no specification about sourceURL on this link you send
Based on the docs the sourceURl is normally used to identify the original source url, because there are some sites that can redirect you like > firecrawl.dev to www.firecrawl.dev
This exactly my case
I crawled this url: https://forum.tufin.com/support/kc/ext/tm/ and the source url for this link becames this one: https://forum.tufin.com/support/kc/ext/tm/Content/ext/tm/intro.htm
This basically happens because https://forum.tufin.com/support/kc/ext/tm/ is redirecting to the other page and meanwhile the sourceURL is being overriden
If you try to crawl this website, you will understand
I’m pretty sure I tested this URL before, and it was returning the correct sourceURL
Can you please share the payload you're sending.
This is the body:
hitting the endpoint v2/crawl
If you tried the playground on crawl the samething will happen
Is there any workaround on this? Its breaking production
It's because if you open this URL directly in the browser, https://forum.tufin.com/support/kc/ext/tm/, it redirects to https://forum.tufin.com/support/kc/ext/tm/Content/ext/tm/intro.htm. This is by the website, so the
sourceURL is being redirected. It is not controlled by firecrawl.
Hope this clears the question about sourceURL
Also your shared payload is wrong, use this correct payload:
But originally the sourceURL isn't use to map before redirecting?
But why in the scrape endpoint it works?