onlyMainContent not working for me

''' import requests import json import os Example 1: Single URL with minimal options urls = '["https://misteliquid.co.uk"]' only_main_content = True include_html = False include_raw_html = False screenshot = False wait_for = '0' remove_tags = '[]' only_include_tags = '[]' headers = '{}' replace_paths_with_absolute = False parse_pdf = False extraction_mode = "none" extraction_prompt = "" extraction_schema = '{}' timeout = '30000' Parse user inputs urls = json.loads(urls) only_main_content = bool(only_main_content) screenshot = bool(screenshot) wait_for = int(wait_for) remove_tags = json.loads(remove_tags) timeout = int(timeout) results = [] for url in urls: # Set up payload for each URL payload = { "url": url, "pageOptions": { "onlyMainContent": only_main_content, "screenshot": screenshot, "waitFor": wait_for, "removeTags": remove_tags }, "timeout": timeout } # Set up headers headers = { "Authorization": f"Bearer {os.getenv('FIRECRAWL_API_KEY')}", "Content-Type": "application/json" } # Make POST request to Scrape API for each URL response = requests.post("https://api.firecrawl.dev/v0/scrape", json=payload, headers=headers) # Parse response for each URL if response.status_code == 200: results.append(response.json()) else: results.append({ "error": f"Request failed for URL: {url} with status code {response.status_code}", "details": response.text }) Set the final result to the list of results result = results Print the result print(json.dumps(result, indent=2))''' getting around 7000 words when running this script on my environment with onlyMainContent set to "True" it's also not displaying in markdown. when i use the playground with the same URL in getting around 1700 words and extraction looks a lot cleaner. what am i doing wrong?
2 Replies
Caleb
Caleb15mo ago
Hey @p3nnywh1stl3, could you send out the 7000 words and the 1700 words here? that will help us spot the issue.
p3nnywh1stl3
p3nnywh1stl3OP15mo ago
heya, thank you, i managed to fix with adding some remove tags. I presume this is what onlymaincontent does, is there any documentation on what that feature actually does as i'm not finding it very effective? ["script", "style", "nav", "header", "footer", ".advertisement", ".sidebar", ".nav", ".menu", "#comments", "img", "a"]

Did you find this page helpful?