Firecrawl•5d ago

Different to actual HTML despite scraping rawHtml

It looks like scraping with rawHtml format provides slightly different HTML body to actual HTML (as given by curl or a web browser). The docs say that rawHtml is "with no modifications" (see https://docs.firecrawl.dev/features/scrape#scrape-formats) but that doesn't seem to be the case. It looks modified. For example, scraping https://www.example.com with these options gives a different HTML:

formats: ["rawHtml"],
onlyMainContent: false,
maxAge: 0,
parsers: [],
blockAds: true

formats: ["rawHtml"],
onlyMainContent: false,
maxAge: 0,
parsers: [],
blockAds: true

Noticeable differences in HTML: * extra line breaks * @charset in <style> tag * <p> tag around <a> Other webpages result in similar and other differences. Is it possible to receive the exact original HTML?

Firecrawl Docs

Scrape | Firecrawl

Turn any url into clean data

9 Replies

Gaurav Chadha•5d ago

Hi @ash4cord, The rawHtml still undergoes processing through the HTML transformation pipeline, which introduces modifications like extra line breaks and @charset handling. so Raw HTML is the raw HTML from the page. html takes raw HTML, and applies to following modifications: - Makes all image and <a> links absolute - Removes all <head> <meta> <noscript> <style> <script> tags - Removes/includes all tags listed in the excludeTags/ includeTags options of the request - Excludes a large list of non-main content tags (identified by class or ID) if onlyMainContent is enabled in the request (true by default)

ash4cordOP•5d ago

Thanks. Extra line breaks and @charset marker are still ok, but adding tags (like the <p> around <a>) can break CSS selectors, so is that something that can be avoided?

Gaurav Chadha•5d ago

Unfortunately not, it always normalizes the document structure, which includes adding <html>, <body>, or <div> wrapper tags that can break CSS selectors. Could you share the requirement of unstructured data without normalization? For that you can directly copy the page source.

ash4cordOP•4d ago

Requirement for our workflow includes scraping direct first, then through Firecrawl if blocked, and finally through a different service if Firecrawl is blocked too. Direct request and the other service provide original and consistent raw HTML, but Firecrawl provides modified raw HTML, which breaks our parsing. Ideally, Firecrawl should return the original unmodified HTML if rawHtml is set.

Gaurav Chadha•4d ago

Hi there, could you please open a GitHub issue for this at https://github.com/firecrawl/firecrawl/issues? We can track this enhancement and get this updated if more users are getting into issues with normalized rawHTML.

ash4cordOP•4d ago

As a workaround, I was able to scrape a less-modified near-original HTML using actions instead of rawHtml format:

formats: [],
onlyMainContent: false,
maxAge: 0,
parsers: [],
blockAds: true
actions: [ { "type": "executeJavascript", "script": "document.documentElement.outerHTML" }]

formats: [],
onlyMainContent: false,
maxAge: 0,
parsers: [],
blockAds: true
actions: [ { "type": "executeJavascript", "script": "document.documentElement.outerHTML" }]

Just wondering if running a executeJavascript action still respects maxAge to fetch from cache (if time in ms is set) or does it always make a fresh request to be able to execute the given script?

Gaurav Chadha•4d ago

if you'll set the maxAge to 0 it will always make a fresh request - https://docs.firecrawl.dev/advanced-scraping-guide#freshness-and-cache-maxage

ash4cordOP•4d ago

Thanks. I meant, if I set maxAge to a non-zero number and set executeJavascript action, then does it still make a fresh request or use a cached copy (if available)?

Gaurav Chadha•3d ago

Yeah, When you set maxAge to a non-zero number AND include an executeJavascript action, Firecrawl will always make a fresh request and not use a cached copy.

Gaming

Programming

Different to actual HTML despite scraping rawHtml

Did you find this page helpful?