Different to actual HTML despite scraping rawHtml

It looks like scraping with rawHtml format provides slightly different HTML body to actual HTML (as given by curl or a web browser). The docs say that rawHtml is "with no modifications" (see https://docs.firecrawl.dev/features/scrape#scrape-formats) but that doesn't seem to be the case. It looks modified. For example, scraping https://www.example.com with these options gives a different HTML:
formats: ["rawHtml"],
onlyMainContent: false,
maxAge: 0,
parsers: [],
blockAds: true
formats: ["rawHtml"],
onlyMainContent: false,
maxAge: 0,
parsers: [],
blockAds: true
Noticeable differences in HTML: * extra line breaks * @charset in <style> tag * <p> tag around <a> Other webpages result in similar and other differences. Is it possible to receive the exact original HTML?
Firecrawl Docs
Scrape | Firecrawl
Turn any url into clean data
9 Replies
Gaurav Chadha
Gaurav Chadha5d ago
Hi @ash4cord, The rawHtml still undergoes processing through the HTML transformation pipeline, which introduces modifications like extra line breaks and @charset handling. so Raw HTML is the raw HTML from the page. html takes raw HTML, and applies to following modifications: - Makes all image and <a> links absolute - Removes all <head> <meta> <noscript> <style> <script> tags - Removes/includes all tags listed in the excludeTags/ includeTags options of the request - Excludes a large list of non-main content tags (identified by class or ID) if onlyMainContent is enabled in the request (true by default)
ash4cord
ash4cordOP5d ago
Thanks. Extra line breaks and @charset marker are still ok, but adding tags (like the <p> around <a>) can break CSS selectors, so is that something that can be avoided?
Gaurav Chadha
Gaurav Chadha5d ago
Unfortunately not, it always normalizes the document structure, which includes adding <html>, <body>, or <div> wrapper tags that can break CSS selectors. Could you share the requirement of unstructured data without normalization? For that you can directly copy the page source.
ash4cord
ash4cordOP4d ago
Requirement for our workflow includes scraping direct first, then through Firecrawl if blocked, and finally through a different service if Firecrawl is blocked too. Direct request and the other service provide original and consistent raw HTML, but Firecrawl provides modified raw HTML, which breaks our parsing. Ideally, Firecrawl should return the original unmodified HTML if rawHtml is set.
Gaurav Chadha
Gaurav Chadha4d ago
Hi there, could you please open a GitHub issue for this at https://github.com/firecrawl/firecrawl/issues? We can track this enhancement and get this updated if more users are getting into issues with normalized rawHTML.
ash4cord
ash4cordOP4d ago
As a workaround, I was able to scrape a less-modified near-original HTML using actions instead of rawHtml format:
formats: [],
onlyMainContent: false,
maxAge: 0,
parsers: [],
blockAds: true
actions: [ { "type": "executeJavascript", "script": "document.documentElement.outerHTML" }]
formats: [],
onlyMainContent: false,
maxAge: 0,
parsers: [],
blockAds: true
actions: [ { "type": "executeJavascript", "script": "document.documentElement.outerHTML" }]
Just wondering if running a executeJavascript action still respects maxAge to fetch from cache (if time in ms is set) or does it always make a fresh request to be able to execute the given script?
Gaurav Chadha
Gaurav Chadha4d ago
if you'll set the maxAge to 0 it will always make a fresh request - https://docs.firecrawl.dev/advanced-scraping-guide#freshness-and-cache-maxage
ash4cord
ash4cordOP4d ago
Thanks. I meant, if I set maxAge to a non-zero number and set executeJavascript action, then does it still make a fresh request or use a cached copy (if available)?
Gaurav Chadha
Gaurav Chadha3d ago
Yeah, When you set maxAge to a non-zero number AND include an executeJavascript action, Firecrawl will always make a fresh request and not use a cached copy.

Did you find this page helpful?