Firecrawl•16mo ago

Streaming Crawler Results & General Scalability

I'm looking at using the crawl API to traverse websites. I have permission to crawl these sites. Most of them are ~1000 pages. But some are 40k+. I don't see any listed limits on the crawl API. Can it handle returning this much data from the API? Is there a way to paginate? I do see there is a method for streaming responses, but it seems like if I miss parsing something from the stream, I might not be able to get back to it until the job is done? If I need to "stream" crawl results, am I better off just using the scraping API from within a streaming system/queue that I build myself?

2 Replies

Adobe.Flash•16mo ago

Hi Evan, I would recommend splitting up the workload. I believe we have a set limit on the /crawl of around 5000. @rafaelmiller can you confirm? If that's not on the docs, we will make sure to update it. IF you know the structure of the website, i would recommend splitting your crawl up. So if the website has like 2000 /blog pages, i would try to crawl only those in 1 job and do the other routes on other jobs.

EvanOP•16mo ago

@Adobe.Flash how do I stream results from a crawl in real time? I see that there is a pseudo-cursor, but it looks like a sliding window that gets lost if my processing system does not keep up or lags behind. Idealy what I want to do here is to submit a sitemap for a crawl, and stream results back for indexing into my own system as they are ready.

Gaming

Programming

Streaming Crawler Results & General Scalability

Did you find this page helpful?