Optimizing Firecrawl API for Extracting Embedding-Ready Data from GitHub Markdown Repos
I'm looking for advice on effectively using the Firecrawl API to process Markdown files from a GitHub documentation repo, with the goal of generating JSON outputs that can be used for creating embeddings. I've noticed that the default metadata extraction is somewhat limited, and I suspect that I may not be setting the parameters optimally. To improve metadata quality, I'm considering using a custom scrape prompt but am unsure of the best approach.
Specifically, I’m curious about:
Crawling Strategy: Should I have Firecrawl crawl the entire repository at once, or would it be more effective to process individual files separately for extracting meaningful content?
Optimizing for Embeddings: What are the best parameters, prompts, or configurations to ensure Firecrawl extracts the most relevant data from each MD file to use for embedding generation?
Any insights on maximizing data extraction from Markdown files for this purpose would be greatly appreciated!
0 Replies