Crawl endpoint retrieving wrong sourceURL

I’ve found a mismatch in the crawl endpoint. When I try scraping a specific website, the sourceURL is being returned as it was currentURL. However, when I run the same URL in scrape mode, the response looks correct and the sourceURL is fine. The URL I’m testing with is: https://forum.tufin.com/support/kc/ext/tm/ On the crawl I receive this sourceURL: https://forum.tufin.com/support/kc/ext/tm/Content/ext/tm/intro.htm On the scrape I receive this sourceURL: https://forum.tufin.com/support/kc/ext/tm/
10 Replies
amplitudes
amplitudes3mo ago
The sourceURL is the user-provided URL from the scrape request - if you're looking for the resolved url, that's under the url key.
Lucas Mendonça
Lucas MendonçaOP3mo ago
Yeah, I see what you mean. The issue is that the crawl endpoint isn’t returning the user-provided source URL. For example, I crawled this website: https://forum.tufin.com/support/kc/ext/tm/ But the sourceURL came back as: https://forum.tufin.com/support/kc/ext/tm/Content/ext/tm/intro.htm In this case, shouldn’t the sourceURL be the original one provided — https://forum.tufin.com/support/kc/ext/?
micah.stairs
micah.stairs3mo ago
The difference there is that when you do a crawl, you are only specifying a starting URL of the first page, not all of the other webpages.
Lucas Mendonça
Lucas MendonçaOP3mo ago
But what should be the sourceURL of the first page? The one that the user specified, isn't it correct?
Gaurav Chadha
Gaurav Chadha3mo ago
Firecrawl Docs
Crawl | Firecrawl
Firecrawl can recursively search through a urls subdomains, and gather the content
No description
Lucas Mendonça
Lucas MendonçaOP3mo ago
@Gaurav Chadha, there is no specification about sourceURL on this link you send Based on the docs the sourceURl is normally used to identify the original source url, because there are some sites that can redirect you like > firecrawl.dev to www.firecrawl.dev This exactly my case I crawled this url: https://forum.tufin.com/support/kc/ext/tm/ and the source url for this link becames this one: https://forum.tufin.com/support/kc/ext/tm/Content/ext/tm/intro.htm This basically happens because https://forum.tufin.com/support/kc/ext/tm/ is redirecting to the other page and meanwhile the sourceURL is being overriden If you try to crawl this website, you will understand I’m pretty sure I tested this URL before, and it was returning the correct sourceURL
Gaurav Chadha
Gaurav Chadha3mo ago
Can you please share the payload you're sending.
Lucas Mendonça
Lucas MendonçaOP3mo ago
This is the body:
{
https://forum.tufin.com/support/kc/ext/tm/,
limit: LIMIT_PAGES,
maxDiscoveryDepth: 10,
allowExternalLinks: false,
scrapeOptions: {
formats: ["markdown", { type: "changeTracking" }],
maxAge: 0,
mobile: false,
onlyMainContent: true,
},
}
{
https://forum.tufin.com/support/kc/ext/tm/,
limit: LIMIT_PAGES,
maxDiscoveryDepth: 10,
allowExternalLinks: false,
scrapeOptions: {
formats: ["markdown", { type: "changeTracking" }],
maxAge: 0,
mobile: false,
onlyMainContent: true,
},
}
hitting the endpoint v2/crawl If you tried the playground on crawl the samething will happen Is there any workaround on this? Its breaking production
Gaurav Chadha
Gaurav Chadha3mo ago
It's because if you open this URL directly in the browser, https://forum.tufin.com/support/kc/ext/tm/, it redirects to https://forum.tufin.com/support/kc/ext/tm/Content/ext/tm/intro.htm. This is by the website, so the sourceURL is being redirected. It is not controlled by firecrawl. Hope this clears the question about sourceURL Also your shared payload is wrong, use this correct payload:
{
"url": "https://forum.tufin.com/support/kc/ext/tm/",
"limit": 10,
"maxDiscoveryDepth": 10,
"allowExternalLinks": false,
"scrapeOptions": {
"formats": ["markdown", { "type": "changeTracking" }],
"maxAge": 0,
"mobile": false,
"onlyMainContent": true
}
}
{
"url": "https://forum.tufin.com/support/kc/ext/tm/",
"limit": 10,
"maxDiscoveryDepth": 10,
"allowExternalLinks": false,
"scrapeOptions": {
"formats": ["markdown", { "type": "changeTracking" }],
"maxAge": 0,
"mobile": false,
"onlyMainContent": true
}
}
Lucas Mendonça
Lucas MendonçaOP3mo ago
But originally the sourceURL isn't use to map before redirecting? But why in the scrape endpoint it works?

Did you find this page helpful?