Crawl endpoint retrieving wrong sourceURL
I’ve found a mismatch in the crawl endpoint. When I try scraping a specific website, the sourceURL is being returned as it was currentURL. However, when I run the same URL in scrape mode, the response looks correct and the sourceURL is fine.
The URL I’m testing with is: https://forum.tufin.com/support/kc/ext/tm/
On the crawl I receive this sourceURL: https://forum.tufin.com/support/kc/ext/tm/Content/ext/tm/intro.htm
On the scrape I receive this sourceURL: https://forum.tufin.com/support/kc/ext/tm/
8 Replies
The
sourceURL
is the user-provided URL from the scrape request - if you're looking for the resolved url, that's under the url
key.Yeah, I see what you mean. The issue is that the crawl endpoint isn’t returning the user-provided source URL.
For example, I crawled this website:
https://forum.tufin.com/support/kc/ext/tm/
But the sourceURL came back as:
https://forum.tufin.com/support/kc/ext/tm/Content/ext/tm/intro.htm
In this case, shouldn’t the sourceURL be the original one provided —
https://forum.tufin.com/support/kc/ext/?
The difference there is that when you do a crawl, you are only specifying a starting URL of the first page, not all of the other webpages.
But what should be the sourceURL of the first page? The one that the user specified, isn't it correct?
Firecrawl Docs
Crawl | Firecrawl
Firecrawl can recursively search through a urls subdomains, and gather the content

@Gaurav Chadha, there is no specification about sourceURL on this link you send
Based on the docs the sourceURl is normally used to identify the original source url, because there are some sites that can redirect you like > firecrawl.dev to www.firecrawl.dev
This exactly my case
I crawled this url: https://forum.tufin.com/support/kc/ext/tm/ and the source url for this link becames this one: https://forum.tufin.com/support/kc/ext/tm/Content/ext/tm/intro.htm
This basically happens because https://forum.tufin.com/support/kc/ext/tm/ is redirecting to the other page and meanwhile the sourceURL is being overriden
If you try to crawl this website, you will understand
I’m pretty sure I tested this URL before, and it was returning the correct sourceURL
Can you please share the payload you're sending.
This is the body:
hitting the endpoint v2/crawl
If you tried the playground on crawl the samething will happen
Is there any workaround on this? Its breaking production