Firecrawl batch cannot crawl some urls
Hello everyone I am using firecrawl python sdk when I use batch_scrape_urls it can't crawl. Just 1 url is not successfully crawled it will logs error. I wish it will ignore the error url and execute the remaining urls. I have looked for the "ignoreInvalidURLs" attribute but can't find it in python sdk. here is my list of urls: ['https://www.britannica.com/biography/Stephen-Colbert', 'https://www.biography.com/movies-tv/stephen-colbert', 'https://www.cbs.com/shows/the-late-show-with-stephen-colbert/', 'https://www.imdb.com/name/nm0170306/bio/', 'https://en.wikipedia.org/wiki/Stephen_Colbert', 'https://en.wikipedia.org/wiki/The_Late_Show_with_Stephen_Colbert', 'https://www.televisionacademy.com/bios/stephen-colbert', 'https://en.wikipedia.org/wiki/List_of_awards_and_nominations_received_by_Stephen_Colbert', 'https://www.imdb.com/name/nm0170306/awards/', 'https://www.youtube.com/channel/UCMtFAi84ehTSYSE9XoHefig']
5 Replies
The Python SDK does support the ignoreInvalidURLs parameter. I just tested this on my end and it worked as expected:
{
"urls": ['https://www.britannica.com/biography/Stephen-Colbert', 'https://www.biography.com/movies-tv/stephen-colbert', 'https://www.cbs.com/shows/the-late-show-with-stephen-colbert/', 'https://www.imdb.com/name/nm0170306/bio/', ' https://en.wikipedia.org/wiki/Stephen\_Colbert', ' https://en.wikipedia.org/wiki/The\_Late\_Show\_with\_Stephen\_Colbert', 'https://www.televisionacademy.com/bios/stephen-colbert', ' https://en.wikipedia.org/wiki/List\_of\_awards\_and\_nominations\_received\_by\_Stephen\_Colbert', 'https://www.imdb.com/name/nm0170306/awards/', 'https://www.youtube.com/channel/UCMtFAi84ehTSYSE9XoHefig'\],
"ignoreInvalidURLs": True,
}
Can you make sure you're using the latest version of the Python SDK?
Dear @Firecrawl Team i checked the source code of firecrawl python sdk it doesn't mention ignore_invalid_urls and also validate kwargs function "batch_scrape_urls": {"formats", "headers", "include_tags", "exclude_tags", "only_main_content",
"wait_for", "timeout", "location", "mobile", "skip_tls_verification",
"remove_base64_images", "block_ads", "proxy", "extract", "json_options",
"actions", "agent", "webhook"},

Did you try it though? It was working on my side with the latest version of the SDK,
can you share your code with me
Here you go!