maxDiscoveryDepth

Hello, I would like to ask how does the maxDiscoveryDepth works? Right now I am trying the depth to be two and the limit to be 10 for https://books.toscrape.com/ to test this parameters. I somehow don't get it. Tho the results were like this:
"data": [
{
"links": [ ... },
"metadata": {
"url": "https://books.toscrape.com/",
},
{
"links": [ ... },
"metadata": {
"url": "https://books.toscrape.com/catalogue/category/books_1/index.html",
},
{
"links": [ ... },
"metadata": {
"url": "https://books.toscrape.com/catalogue/category/books/travel_2/index.html",
},
# ... 7 pages...
]
"data": [
{
"links": [ ... },
"metadata": {
"url": "https://books.toscrape.com/",
},
{
"links": [ ... },
"metadata": {
"url": "https://books.toscrape.com/catalogue/category/books_1/index.html",
},
{
"links": [ ... },
"metadata": {
"url": "https://books.toscrape.com/catalogue/category/books/travel_2/index.html",
},
# ... 7 pages...
]
Does it follow the maxDepth that I wanted?
9 Replies
Gaurav Chadha
Gaurav Chadha3w ago
Hi @edsaur maxDiscoveryDepth means the maximum number of “hops” from the first/parent page that Firecrawl will follow when discovering new URLs. if you set maxDiscoveryDepth:2 Depth 0 → only the first/parent page. https://books.toscrape.com Depth 1 → the first/parent page + pages directly linked from it. https://books.toscrape.com/catalogue/category/books/travel_2/index.html Depth 2 → the first/parent page → pages linked from it → pages linked from those. and limit is the number total pages it will fetch, even if more are available. I hope this answer's how maxDiscovery depth works - https://docs.firecrawl.dev/advanced-scraping-guide#maxdiscoverydepth
edsaur
edsaurOP3w ago
Would it be the "links" that is the depth 2? Because I think it's pure https://books.toscrape.com/catalogue/category/books/travel_2/index.html that I recieve from depth that is two and the limit is 10... So to have other pages, the limit should be more than 10 right? Thank you so much! I am just so new with this T_T
Gaurav Chadha
Gaurav Chadha3w ago
Yes, correct.
edsaur
edsaurOP3w ago
Thanks alot for the help @Gaurav Chadha! Another questions since, we have the "prompt" in scraping and also I believe in crawl endpoint, if a website is in another language than english could I prompt to translate what are we scraping? And does firecrawl support multilingual sites?
micah.stairs
micah.stairs3w ago
You could use JSON mode to get translated content. But the /crawl's prompt is just there to automatically generate the crawl parameters for you.
edsaur
edsaurOP3w ago
Thanks! Do I need the OpenAI_BASE_URL if ever? Because I tried /extract and it gave me an error saying that I dont have any BASE_URL I am using a self-hosted btw
Gaurav Chadha
Gaurav Chadha3w ago
yes, as /extract uses structured data extraction from scraped content using LLM so OPENAI_BASE_URL will be rquired for self-hosted environment.
edsaur
edsaurOP3w ago
Noted sir, thank you so much! But for normal /scrape and /crawl prompt we just need the OPENAI_API_KEY right?
Gaurav Chadha
Gaurav Chadha3w ago
only if you require to use the LLM otherwise you can skip

Did you find this page helpful?