Firecrawl

F

Firecrawl

Join the community to ask questions about Firecrawl and get answers from other members.

Join

❓┃community-help

💬┃general

🛠️┃self-hosting

Self-host: unable to scrape/crawl, "Unauthorized" error

Hi all, I'm trying to self-host on a Ubuntu system. Every time I run a crawl or scrape cURL request, I get {"error":"Unauthorized"}{base}, regardless of whatever URL I'm trying to crawl/scrape. My environment.env is basically a copy/paste of the example, except I've set USE_DB_AUTHENTICATION=false, since I'm not using supabase. I've also opened port 6379 and 3002 in my firewall, so I can't see why there would be any permissions issues....

Scrape a single page with 'loading content'

Tried to scrape a single page and only get the header and footer. Added waitFor option but doesn't seem to work? Tried again on the playground but no waitFor option? https://www.firecrawl.dev/app/playground?url=https%3A%2F%2Fchannelstore.roku.com%2Fdetails%2F7fa2b9b3df2e3227f26917c9e5570be0%2Fbbc-america...

Discrepancy in the count of URLs returned by the FireCrawl API v/s actual sitemap of the Website

I'm trying to get the list of URLs present in the sitemap of a website using FireCrawl API with following parameters in CRAWL mode - **params = { 'crawlerOptions': { "returnOnlyUrls": True }...

scrape job status

Hi! I noticed when scraping web sites with a lot of pages (>1000) scrape job gets stuck (or just job status) in a state from where I don't know any more what is going on. For example, right now I have a running job (job has been limited to max of 1000 scrape urls), and if I fetch the status using the API, I get the following data:...

408 Timeouts For LLM Extraction

I'm getting 408s repeatedly when trying to use LLM extraction on a page that is ~6k tokens of input markdown. Any guidance here?

Streaming Crawler Results & General Scalability

I'm looking at using the crawl API to traverse websites. I have permission to crawl these sites. Most of them are ~1000 pages. But some are 40k+. I don't see any listed limits on the crawl API. Can it handle returning this much data from the API? Is there a way to paginate? I do see there is a method for streaming responses, but it seems like if I miss parsing something from the stream, I might not be able to get back to it until the job is done? If I need to "stream" crawl results, am I better off just using the scraping API from within a streaming system/queue that I build myself?...

Self hosting

I want to try firecrawl but using self hosting, i tried the instructions on github but nothing shows up on port 3002, is there a detailed guide on this

download markdown .md instead of json?

is there an easy way to download all pages as markdown instead of json or manually copying?

174 chrome instances running on firecrawl droplet

Hi, I just observed the usage of the self deployment of firecrawl. I'm using the docker compose to deploy firecrawl on a DO droplet. I saw 100% CPU usage and I was very suspicious. I ssh'ed into the droplet and found that 174 chrome instances are running, it almost seems like it doesn't clean up the chrome instances properly. Is anyone having similar issue or is this a somewhat of a known issue?...

Not showing crawled Title

https://github.com/langgenius/dify/issues/5404 Hello firecrawl dev team. I am pinkbanana from Dify.AI, recently we have integrated this awesome tool in our product. And we noticed that somehow the title is missing in the crawled data. I have also check the logs from the firecrawl.dev activity panel. See the metadata in the logs....

Error 500 for LLM endpoint in Clay

Hi team, I'm getting a error 500 when setting up the LLM endpoint in Clay. Heard from others who have had the same issue. Anyone been able to get it to work?...

API Key for Self Hosted Firecrawl

Hi, I have deployed the self host version of Firecrawl in my system to link it with Dify. In Dify, it requires API Key to integrate. In self host version, we don't have any API Keys or is there a way to generate API Keys. Please help me to solve this issue. Thanks in advance...

excluding file type

Hey everyone, Can the excludes option be used to exclude file types like pdfs? ```json...

Are PDFs expected to be buffers?

When trying to upload a pdf url, https://www.nasa.gov/wp-content/uploads/2017/05/ochoa.pdf we just get back a buffer rather than text

Get all website url before crawl

Hi, I wonder how to get the URL list of a website before crawling, like on the Mendable app?

license

I am considering locally hosting Firecrawl. Firecrawl is licensed under AGPL.If I use Firecrawl with Dify, which is also locally hosted, will the AGPL conditions apply to Dify? Additionally, will the AGPL conditions propagate to the LLM applications created on Dify that use Firecrawl?

Crawl not working on self-hosted with external redis (Upstash)

The bull dashboard doesn't seem to be showing the web scraper queue when it is connected to Upstash redis. I did not do anything like IP whitelisting etc that should block the connection. I also got no concerning log messages ```...
No description

Crawler Help

Hi, I discovered Firecrawl yesterday and I'd like to get your help to try and extract data for a website. Essentially, I'd like to extract all methods and parameters related to them from Elevenlabs (https://elevenlabs.io/docs/api-reference/text-to-speech) documentation and create a complete JSON with the data

Awesome tool! How do I extract the text only?

Great tool here, excited to start testing. I don't need the markdown. I don't want the links, image hosting URLs, etc. How do I just get the text content from the page out?...