Crawl endpoint retrieving wrong sourceURL

I’ve found a mismatch in the crawl endpoint. When I try scraping a specific website, the sourceURL is being returned as it was currentURL. However, when I run the same URL in scrape mode, the response looks correct and the sourceURL is fine. The URL I’m testing with is: https://forum.tufin.com/support/kc/ext/tm/ On the crawl I receive this sourceURL: https://forum.tufin.com/support/kc/ext/tm/Content/ext/tm/intro.htm On the scrape I receive this sourceURL: https://forum.tufin.com/support/kc/ext/tm/
8 Replies
amplitudes
amplitudes4d ago
The sourceURL is the user-provided URL from the scrape request - if you're looking for the resolved url, that's under the url key.
Lucas Mendonça
Lucas MendonçaOP4d ago
Yeah, I see what you mean. The issue is that the crawl endpoint isn’t returning the user-provided source URL. For example, I crawled this website: https://forum.tufin.com/support/kc/ext/tm/ But the sourceURL came back as: https://forum.tufin.com/support/kc/ext/tm/Content/ext/tm/intro.htm In this case, shouldn’t the sourceURL be the original one provided — https://forum.tufin.com/support/kc/ext/?
micah.stairs
micah.stairs4d ago
The difference there is that when you do a crawl, you are only specifying a starting URL of the first page, not all of the other webpages.
Lucas Mendonça
Lucas MendonçaOP4d ago
But what should be the sourceURL of the first page? The one that the user specified, isn't it correct?
Gaurav Chadha
Gaurav Chadha4d ago
Firecrawl Docs
Crawl | Firecrawl
Firecrawl can recursively search through a urls subdomains, and gather the content
No description
Lucas Mendonça
Lucas MendonçaOP3d ago
@Gaurav Chadha, there is no specification about sourceURL on this link you send Based on the docs the sourceURl is normally used to identify the original source url, because there are some sites that can redirect you like > firecrawl.dev to www.firecrawl.dev This exactly my case I crawled this url: https://forum.tufin.com/support/kc/ext/tm/ and the source url for this link becames this one: https://forum.tufin.com/support/kc/ext/tm/Content/ext/tm/intro.htm This basically happens because https://forum.tufin.com/support/kc/ext/tm/ is redirecting to the other page and meanwhile the sourceURL is being overriden If you try to crawl this website, you will understand I’m pretty sure I tested this URL before, and it was returning the correct sourceURL
Gaurav Chadha
Gaurav Chadha3d ago
Can you please share the payload you're sending.
Lucas Mendonça
Lucas MendonçaOP2d ago
This is the body:
{
https://forum.tufin.com/support/kc/ext/tm/,
limit: LIMIT_PAGES,
maxDiscoveryDepth: 10,
allowExternalLinks: false,
scrapeOptions: {
formats: ["markdown", { type: "changeTracking" }],
maxAge: 0,
mobile: false,
onlyMainContent: true,
},
}
{
https://forum.tufin.com/support/kc/ext/tm/,
limit: LIMIT_PAGES,
maxDiscoveryDepth: 10,
allowExternalLinks: false,
scrapeOptions: {
formats: ["markdown", { type: "changeTracking" }],
maxAge: 0,
mobile: false,
onlyMainContent: true,
},
}
hitting the endpoint v2/crawl If you tried the playground on crawl the samething will happen Is there any workaround on this? Its breaking production

Did you find this page helpful?