Firecrawl•3w ago

maxDiscoveryDepth

Hello, I would like to ask how does the maxDiscoveryDepth works? Right now I am trying the depth to be two and the limit to be 10 for https://books.toscrape.com/ to test this parameters. I somehow don't get it. Tho the results were like this:

 "data": [
        {
       "links": [ ... },
       "metadata": {
              "url": "https://books.toscrape.com/",
      },
      {
       "links": [ ... },
       "metadata": {
              "url": "https://books.toscrape.com/catalogue/category/books_1/index.html",
      },
      {
       "links": [ ... },
       "metadata": {
              "url": "https://books.toscrape.com/catalogue/category/books/travel_2/index.html",
      },
 # ... 7 pages...
]

 "data": [
        {
       "links": [ ... },
       "metadata": {
              "url": "https://books.toscrape.com/",
      },
      {
       "links": [ ... },
       "metadata": {
              "url": "https://books.toscrape.com/catalogue/category/books_1/index.html",
      },
      {
       "links": [ ... },
       "metadata": {
              "url": "https://books.toscrape.com/catalogue/category/books/travel_2/index.html",
      },
 # ... 7 pages...
]

Does it follow the maxDepth that I wanted?

All products | Books to Scrape - Sandbox

9 Replies

Gaurav Chadha•3w ago

Hi @edsaur maxDiscoveryDepth means the maximum number of “hops” from the first/parent page that Firecrawl will follow when discovering new URLs. if you set maxDiscoveryDepth:2 Depth 0 → only the first/parent page. https://books.toscrape.com Depth 1 → the first/parent page + pages directly linked from it. https://books.toscrape.com/catalogue/category/books/travel_2/index.html Depth 2 → the first/parent page → pages linked from it → pages linked from those. and limit is the number total pages it will fetch, even if more are available. I hope this answer's how maxDiscovery depth works - https://docs.firecrawl.dev/advanced-scraping-guide#maxdiscoverydepth

edsaurOP•3w ago

Would it be the "links" that is the depth 2? Because I think it's pure https://books.toscrape.com/catalogue/category/books/travel_2/index.html that I recieve from depth that is two and the limit is 10... So to have other pages, the limit should be more than 10 right? Thank you so much! I am just so new with this T_T

Travel | Books to Scrape - Sandbox …

Gaurav Chadha•3w ago

Yes, correct.

edsaurOP•3w ago

Thanks alot for the help @Gaurav Chadha! Another questions since, we have the "prompt" in scraping and also I believe in crawl endpoint, if a website is in another language than english could I prompt to translate what are we scraping? And does firecrawl support multilingual sites?

micah.stairs•3w ago

You could use JSON mode to get translated content. But the /crawl's prompt is just there to automatically generate the crawl parameters for you.

edsaurOP•3w ago

Thanks! Do I need the OpenAI_BASE_URL if ever? Because I tried /extract and it gave me an error saying that I dont have any BASE_URL I am using a self-hosted btw

Gaurav Chadha•3w ago

yes, as /extract uses structured data extraction from scraped content using LLM so OPENAI_BASE_URL will be rquired for self-hosted environment.

edsaurOP•3w ago

Noted sir, thank you so much! But for normal /scrape and /crawl prompt we just need the OPENAI_API_KEY right?

Gaurav Chadha•3w ago

only if you require to use the LLM otherwise you can skip

Gaming

Programming

maxDiscoveryDepth

Did you find this page helpful?