100% of URLs Domain?
Is there any way to make MAP return 100% of a domain's public URLs? I ask this because there are certain domains that return less than half of the URLs.
7 Replies
Hi @Enzo Surreal how many expected URLs does the domain you're testing might have?
Since the /map endpoint has a maximum limit of 30,000 URLs per request and no pagination, it cannot return 100% of URLs for large domains. If the number exceeds the limit and URLs are included in subdomain pagination, you may try /search with the site query operator https://docs.firecrawl.dev/api-reference/endpoint/search#supported-query-operators
Actually the /map endpoint has a hard limit of 100k now, not 30k.
Also, the /map endpoint is quite cheap and fast, however, it doesn't guarantee that all of the webpages will be found or that they are all still valid, since we do not actually do a full crawl of the website. Instead, you could consider using the /crawl endpoint for this use case. You can check out the documentation for it here.
also tossed a quick fix correct hard limit of
/map
on docs as well - https://github.com/firecrawl/firecrawl-docs/pull/96 :)Thanks merged!
An example: 3000 and return only 150, or 587 and returna less than 200. Most of the domains I map don't return anywhere near all the URLs. I believe the problem is that MAP cannot find all URLs for some domains.
I think I'll have to use the crawl endpoint, which will be a bit more expensive than I thought, right? One suggestion would be for the team to create two types of MAP endpoints, one for speed and one for quality, returning all URLs for a domain.
My automation workflow is:
- The system starts the flow on a scheduled basis by period and time (e.g., every day) at 3:00 AM.
- The system reads the list of active sites in a database table called 'sites' and discovers all the page URLs for each auction lot in the domain using MAP.
- The system compares each discovered URL with the URLs already stored in the database in the 'lots' table, 'href' column, to avoid duplication.
- The system starts the extraction with Firecrawl v2, using Extract with FIRE1-Agent, to extract the data from the job URL using the prompt and the JSON schema created by the user. - The extracted data is inserted into the 'lots' table.
There are approximately 1,000 auction URLs (the idea is to run the automation once a day just to find new URLs and extract their content).
Do you have any tips for me? Thank you for your time and help.
The /map endpoint is never going to be complete, since it primarily relies on sitemaps, SERP, and URLs we've recently found on that site. The only way to do that is using /crawl, and yes it costs more for that endpoint.
What I'd recommend for your workflow is to use /crawl and JSON mode instead of Extract and FIRE-1. This should be a lot cheaper overall than your /map + /extract approach (FIRE-1 burns through a non-deterministic number of tokens).
/crawl with JSON mode with 1,000 auction URLs is only going to cost you 5,000 credits per day
Thank you very much mate! Will do that!