Python API doesn't provide nice way to iterate over crawled data

Checking the crawling status (https://docs.firecrawl.dev/sdks/python#checking-crawl-status) gives a "next" link (e.g. https://api.firecrawl.dev/v1/crawl/789e6a93-81b6-44f5-9f0e-67f6263059e8?skip=0), but there doesn't appear to be a way to use the FirecrawlApp to fetch this data and instead I need to use a separate Python library like "requests" to make the request. Am I missing something?
Firecrawl Docs
Python SDK | Firecrawl
Firecrawl Python SDK is a wrapper around the Firecrawl API to help you easily turn websites into markdown.
5 Replies
rafaelmiller
rafaelmiller11mo ago
Hi @micah.stairs ! Good catch! I think we missed this case tbh. I’ll work on adding support for fetching "next" links in the coming days. Currently, check_crawl_status will return the crawl id but doesn’t handle the "next" URLs. FYI, if you’re using the SDK’s crawl_url, it handles pagination automatically.
micah.stairs
micah.stairsOP11mo ago
I'm using the async endpoint. I've worked together something pretty hacky but looking forward to cleaning up my code once this is ready.
rafaelmiller
rafaelmiller11mo ago
hey @micah.stairs it's in review now 🙂 https://github.com/mendableai/firecrawl/pull/880 thank you again for the feedback
micah.stairs
micah.stairsOP11mo ago
@rafaelmiller can you clarify what that change does? It looks like if the scraping is done, it will automatically try to fetch all of the scraped data when I call check_crawl_status. What if other users want to check quickly without downloading all of that data? Should this be configurable with a function parameter? @rafaelmiller I'm also concerned that too much data will be fetched. I think this API needs more thoughtful design FYI I was able to simplify the logic on my side (by eliminating the check_crawl_status call entirely). Now I'm getting using the requests library to directly fetch the data. My code looks something like this:
# If the page cursor is not provided, get the first page
if next_page is None:
next_page = f"https://api.firecrawl.dev/v1/crawl/{crawl_id}"

# Retrieve the paginated crawled data
response = requests.get(
next_page, headers=self.firecrawl_client._prepare_headers()
)

# Handle errors while retrieving crawled data
if response.status_code != 200:
self.firecrawl_client._handle_error(response, "get crawled data")
# If the page cursor is not provided, get the first page
if next_page is None:
next_page = f"https://api.firecrawl.dev/v1/crawl/{crawl_id}"

# Retrieve the paginated crawled data
response = requests.get(
next_page, headers=self.firecrawl_client._prepare_headers()
)

# Handle errors while retrieving crawled data
if response.status_code != 200:
self.firecrawl_client._handle_error(response, "get crawled data")
gauthier
gauthier4mo ago
Hello Micah, I came with the same pb using the check_crawl_status through SDK. This function is not appropriate and misleading when trying to handle crawl results during the crawl is running, page by page. I think I will have to do the same as you do and do the calls through API. Thank for your code

Did you find this page helpful?