onlyMainContent not working for me
'''
import requests
import json
import os
Example 1: Single URL with minimal options
urls = '["https://misteliquid.co.uk"]'
only_main_content = True
include_html = False
include_raw_html = False
screenshot = False
wait_for = '0'
remove_tags = '[]'
only_include_tags = '[]'
headers = '{}'
replace_paths_with_absolute = False
parse_pdf = False
extraction_mode = "none"
extraction_prompt = ""
extraction_schema = '{}'
timeout = '30000'
Parse user inputs
urls = json.loads(urls)
only_main_content = bool(only_main_content)
screenshot = bool(screenshot)
wait_for = int(wait_for)
remove_tags = json.loads(remove_tags)
timeout = int(timeout)
results = []
for url in urls:
# Set up payload for each URL
payload = {
"url": url,
"pageOptions": {
"onlyMainContent": only_main_content,
"screenshot": screenshot,
"waitFor": wait_for,
"removeTags": remove_tags
},
"timeout": timeout
}
# Set up headers
headers = {
"Authorization": f"Bearer {os.getenv('FIRECRAWL_API_KEY')}",
"Content-Type": "application/json"
}
# Make POST request to Scrape API for each URL
response = requests.post("https://api.firecrawl.dev/v0/scrape", json=payload, headers=headers)
# Parse response for each URL
if response.status_code == 200:
results.append(response.json())
else:
results.append({
"error": f"Request failed for URL: {url} with status code {response.status_code}",
"details": response.text
})
Set the final result to the list of results
result = results
Print the result
print(json.dumps(result, indent=2))'''
getting around 7000 words when running this script on my environment with onlyMainContent set to "True" it's also not displaying in markdown.
when i use the playground with the same URL in getting around 1700 words and extraction looks a lot cleaner.
what am i doing wrong?
2 Replies
Hey @p3nnywh1stl3, could you send out the 7000 words and the 1700 words here? that will help us spot the issue.
heya, thank you, i managed to fix with adding some remove tags. I presume this is what onlymaincontent does, is there any documentation on what that feature actually does as i'm not finding it very effective?
["script", "style", "nav", "header", "footer", ".advertisement", ".sidebar", ".nav", ".menu", "#comments", "img", "a"]