Firecrawl•12mo ago

Python API doesn't provide nice way to iterate over crawled data

Checking the crawling status (https://docs.firecrawl.dev/sdks/python#checking-crawl-status) gives a "next" link (e.g. https://api.firecrawl.dev/v1/crawl/789e6a93-81b6-44f5-9f0e-67f6263059e8?skip=0), but there doesn't appear to be a way to use the FirecrawlApp to fetch this data and instead I need to use a separate Python library like "requests" to make the request. Am I missing something?

Firecrawl Docs

Python SDK | Firecrawl

Firecrawl Python SDK is a wrapper around the Firecrawl API to help you easily turn websites into markdown.

5 Replies

rafaelmiller•11mo ago

Hi @micah.stairs ! Good catch! I think we missed this case tbh. I’ll work on adding support for fetching "next" links in the coming days. Currently, check_crawl_status will return the crawl id but doesn’t handle the "next" URLs. FYI, if you’re using the SDK’s crawl_url, it handles pagination automatically.

micah.stairsOP•11mo ago

I'm using the async endpoint. I've worked together something pretty hacky but looking forward to cleaning up my code once this is ready.

rafaelmiller•11mo ago

hey @micah.stairs it's in review now 🙂 https://github.com/mendableai/firecrawl/pull/880 thank you again for the feedback

GitHub

[SDK] Added next handler for python sdk (js is ok) by rafaelsidegui...

micah.stairsOP•11mo ago

@rafaelmiller can you clarify what that change does? It looks like if the scraping is done, it will automatically try to fetch all of the scraped data when I call check_crawl_status. What if other users want to check quickly without downloading all of that data? Should this be configurable with a function parameter? @rafaelmiller I'm also concerned that too much data will be fetched. I think this API needs more thoughtful design FYI I was able to simplify the logic on my side (by eliminating the check_crawl_status call entirely). Now I'm getting using the requests library to directly fetch the data. My code looks something like this:

            # If the page cursor is not provided, get the first page
            if next_page is None:
                next_page = f"https://api.firecrawl.dev/v1/crawl/{crawl_id}"

            # Retrieve the paginated crawled data
            response = requests.get(
                next_page, headers=self.firecrawl_client._prepare_headers()
            )

            # Handle errors while retrieving crawled data
            if response.status_code != 200:
                self.firecrawl_client._handle_error(response, "get crawled data")

            # If the page cursor is not provided, get the first page
            if next_page is None:
                next_page = f"https://api.firecrawl.dev/v1/crawl/{crawl_id}"

            # Retrieve the paginated crawled data
            response = requests.get(
                next_page, headers=self.firecrawl_client._prepare_headers()
            )

            # Handle errors while retrieving crawled data
            if response.status_code != 200:
                self.firecrawl_client._handle_error(response, "get crawled data")

gauthier•4mo ago

Hello Micah, I came with the same pb using the check_crawl_status through SDK. This function is not appropriate and misleading when trying to handle crawl results during the crawl is running, page by page. I think I will have to do the same as you do and do the calls through API. Thank for your code

Gaming

Programming

Python API doesn't provide nice way to iterate over crawled data

Did you find this page helpful?