Python API doesn't provide nice way to iterate over crawled data
Checking the crawling status (https://docs.firecrawl.dev/sdks/python#checking-crawl-status) gives a "next" link (e.g. https://api.firecrawl.dev/v1/crawl/789e6a93-81b6-44f5-9f0e-67f6263059e8?skip=0), but there doesn't appear to be a way to use the FirecrawlApp to fetch this data and instead I need to use a separate Python library like "requests" to make the request.
Am I missing something?
Firecrawl Docs
Python SDK | Firecrawl
Firecrawl Python SDK is a wrapper around the Firecrawl API to help you easily turn websites into markdown.
5 Replies
Hi @micah.stairs ! Good catch! I think we missed this case tbh. I’ll work on adding support for fetching "next" links in the coming days. Currently,
check_crawl_status
will return the crawl id but doesn’t handle the "next" URLs.
FYI, if you’re using the SDK’s crawl_url
, it handles pagination automatically.I'm using the async endpoint. I've worked together something pretty hacky but looking forward to cleaning up my code once this is ready.
hey @micah.stairs it's in review now 🙂 https://github.com/mendableai/firecrawl/pull/880 thank you again for the feedback
@rafaelmiller can you clarify what that change does? It looks like if the scraping is done, it will automatically try to fetch all of the scraped data when I call
check_crawl_status
. What if other users want to check quickly without downloading all of that data? Should this be configurable with a function parameter?
@rafaelmiller I'm also concerned that too much data will be fetched. I think this API needs more thoughtful design
FYI I was able to simplify the logic on my side (by eliminating the check_crawl_status call entirely). Now I'm getting using the requests library to directly fetch the data. My code looks something like this:
Hello Micah, I came with the same pb using the check_crawl_status through SDK. This function is not appropriate and misleading when trying to handle crawl results during the crawl is running, page by page.
I think I will have to do the same as you do and do the calls through API. Thank for your code