Firecrawl•3mo ago

Crawl endpoint retrieving wrong sourceURL

I’ve found a mismatch in the crawl endpoint. When I try scraping a specific website, the sourceURL is being returned as it was currentURL. However, when I run the same URL in scrape mode, the response looks correct and the sourceURL is fine. The URL I’m testing with is: https://forum.tufin.com/support/kc/ext/tm/ On the crawl I receive this sourceURL: https://forum.tufin.com/support/kc/ext/tm/Content/ext/tm/intro.htm On the scrape I receive this sourceURL: https://forum.tufin.com/support/kc/ext/tm/

10 Replies

amplitudes•3mo ago

The sourceURL is the user-provided URL from the scrape request - if you're looking for the resolved url, that's under the url key.

Lucas MendonçaOP•3mo ago

Yeah, I see what you mean. The issue is that the crawl endpoint isn’t returning the user-provided source URL. For example, I crawled this website: https://forum.tufin.com/support/kc/ext/tm/ But the sourceURL came back as: https://forum.tufin.com/support/kc/ext/tm/Content/ext/tm/intro.htm In this case, shouldn’t the sourceURL be the original one provided — https://forum.tufin.com/support/kc/ext/?

micah.stairs•3mo ago

The difference there is that when you do a crawl, you are only specifying a starting URL of the first page, not all of the other webpages.

Lucas MendonçaOP•3mo ago

But what should be the sourceURL of the first page? The one that the user specified, isn't it correct?

Gaurav Chadha•3mo ago

https://docs.firecrawl.dev/features/crawl

Firecrawl Docs

Crawl | Firecrawl

Firecrawl can recursively search through a urls subdomains, and gather the content

Lucas MendonçaOP•3mo ago

@Gaurav Chadha, there is no specification about sourceURL on this link you send Based on the docs the sourceURl is normally used to identify the original source url, because there are some sites that can redirect you like > firecrawl.dev to www.firecrawl.dev This exactly my case I crawled this url: https://forum.tufin.com/support/kc/ext/tm/ and the source url for this link becames this one: https://forum.tufin.com/support/kc/ext/tm/Content/ext/tm/intro.htm This basically happens because https://forum.tufin.com/support/kc/ext/tm/ is redirecting to the other page and meanwhile the sourceURL is being overriden If you try to crawl this website, you will understand I’m pretty sure I tested this URL before, and it was returning the correct sourceURL

Gaurav Chadha•3mo ago

Can you please share the payload you're sending.

Lucas MendonçaOP•3mo ago

This is the body:

{
      https://forum.tufin.com/support/kc/ext/tm/,
      limit: LIMIT_PAGES,
      maxDiscoveryDepth: 10,
      allowExternalLinks: false,
      scrapeOptions: {
        formats: ["markdown", { type: "changeTracking" }],
        maxAge: 0,
        mobile: false,
        onlyMainContent: true,
      },
    }

{
      https://forum.tufin.com/support/kc/ext/tm/,
      limit: LIMIT_PAGES,
      maxDiscoveryDepth: 10,
      allowExternalLinks: false,
      scrapeOptions: {
        formats: ["markdown", { type: "changeTracking" }],
        maxAge: 0,
        mobile: false,
        onlyMainContent: true,
      },
    }

hitting the endpoint v2/crawl If you tried the playground on crawl the samething will happen Is there any workaround on this? Its breaking production

Gaurav Chadha•3mo ago

It's because if you open this URL directly in the browser, https://forum.tufin.com/support/kc/ext/tm/, it redirects to https://forum.tufin.com/support/kc/ext/tm/Content/ext/tm/intro.htm. This is by the website, so the sourceURL is being redirected. It is not controlled by firecrawl. Hope this clears the question about sourceURL Also your shared payload is wrong, use this correct payload:

{
      "url": "https://forum.tufin.com/support/kc/ext/tm/",
      "limit": 10,
      "maxDiscoveryDepth": 10,
      "allowExternalLinks": false,
      "scrapeOptions": {
        "formats": ["markdown", { "type": "changeTracking" }],
        "maxAge": 0,
        "mobile": false,
        "onlyMainContent": true
      }
    }

{
      "url": "https://forum.tufin.com/support/kc/ext/tm/",
      "limit": 10,
      "maxDiscoveryDepth": 10,
      "allowExternalLinks": false,
      "scrapeOptions": {
        "formats": ["markdown", { "type": "changeTracking" }],
        "maxAge": 0,
        "mobile": false,
        "onlyMainContent": true
      }
    }

Lucas MendonçaOP•3mo ago

But originally the sourceURL isn't use to map before redirecting? But why in the scrape endpoint it works?

Gaming

Programming

Crawl endpoint retrieving wrong sourceURL

Did you find this page helpful?