Different to actual HTML despite scraping rawHtml
It looks like scraping with
rawHtml format provides slightly different HTML body to actual HTML (as given by curl or a web browser).
The docs say that rawHtml is "with no modifications" (see https://docs.firecrawl.dev/features/scrape#scrape-formats) but that doesn't seem to be the case. It looks modified.
For example, scraping https://www.example.com with these options gives a different HTML:
Noticeable differences in HTML:
* extra line breaks
* @charset in <style> tag
* <p> tag around <a>
Other webpages result in similar and other differences.
Is it possible to receive the exact original HTML?9 Replies
Hi @ash4cord, The
rawHtml still undergoes processing through the HTML transformation pipeline, which introduces modifications like extra line breaks and @charset handling.
so Raw HTML is the raw HTML from the page.
html takes raw HTML, and applies to following modifications:
- Makes all image and <a> links absolute
- Removes all <head> <meta> <noscript> <style> <script> tags
- Removes/includes all tags listed in the excludeTags/ includeTags options of the request
- Excludes a large list of non-main content tags (identified by class or ID) if onlyMainContent is enabled in the request (true by default)Thanks. Extra line breaks and
@charset marker are still ok, but adding tags (like the <p> around <a>) can break CSS selectors, so is that something that can be avoided?Unfortunately not, it always normalizes the document structure, which includes adding
<html>, <body>, or <div> wrapper tags that can break CSS selectors.
Could you share the requirement of unstructured data without normalization? For that you can directly copy the page source.Requirement for our workflow includes scraping direct first, then through Firecrawl if blocked, and finally through a different service if Firecrawl is blocked too. Direct request and the other service provide original and consistent raw HTML, but Firecrawl provides modified raw HTML, which breaks our parsing. Ideally, Firecrawl should return the original unmodified HTML if
rawHtml is set.Hi there, could you please open a GitHub issue for this at https://github.com/firecrawl/firecrawl/issues? We can track this enhancement and get this updated if more users are getting into issues with normalized rawHTML.
As a workaround, I was able to scrape a less-modified near-original HTML using
actions instead of rawHtml format:
Just wondering if running a executeJavascript action still respects maxAge to fetch from cache (if time in ms is set) or does it always make a fresh request to be able to execute the given script?if you'll set the maxAge to 0 it will always make a fresh request - https://docs.firecrawl.dev/advanced-scraping-guide#freshness-and-cache-maxage
Thanks. I meant, if I set
maxAge to a non-zero number and set executeJavascript action, then does it still make a fresh request or use a cached copy (if available)?Yeah, When you set
maxAge to a non-zero number AND include an executeJavascript action, Firecrawl will always make a fresh request and not use a cached copy.