Firecrawl•14mo ago

stuck with python sdk, works on and off with curl

I just selfhosted a firecrawl instance yesterday, it works immediately with a curl request to /v1/crawl, which is a surprise! and then.... 1. with curl i built playwright-ts and it doesn't work at first because i didn't really think when I quoted PLAYWRIGHT_MICROSERVICE_URL=http://localhost:3000/scrape from the SELF_HOST.md into my envfile, which is obviously incorrect. after I fixed it i works with example request

{
  "url": "https://news.ycombinator.com",
  "formats": [
    "extract"
  ],
  "extract": {
    "prompt": "Top 5 stories"
  }
}

{
  "url": "https://news.ycombinator.com",
  "formats": [
    "extract"
  ],
  "extract": {
    "prompt": "Top 5 stories"
  }
}

then, without doing anything to the server, this same request and crawl request to any website wont work anymore. i keeps getting

{
  "success": false,
  "error": "Request timed out"
}

{
  "success": false,
  "error": "Request timed out"
}

from LLM scrape request. ( of course I got my LLM api key right, otherwise it won't work the 1st time). THEN I RESTARTED ALL DOCKER SERVICES, /v1/crawl works again, but /v1/scrape still doesn't work. 2. python sdk i put aside curl for now, and went on to try python sdk. after looking into the code, i changed example.py like this

app = FirecrawlApp(api_key="no", api_url="http://localhost:3002")

# Scrape a website:
scrape_result = app.scrape_url('firecrawl.dev')
print(scrape_result['markdown'])

# Crawl a website:
# idempotency_key = str(uuid.uuid4()) # optional idempotency key
crawl_result = app.crawl_url('firecrawl.dev', {'excludePaths': ['blog/*']}, 2)
print(crawl_result)

app = FirecrawlApp(api_key="no", api_url="http://localhost:3002")

# Scrape a website:
scrape_result = app.scrape_url('firecrawl.dev')
print(scrape_result['markdown'])

# Crawl a website:
# idempotency_key = str(uuid.uuid4()) # optional idempotency key
crawl_result = app.crawl_url('firecrawl.dev', {'excludePaths': ['blog/*']}, 2)
print(crawl_result)

should be correct, right? but I never got it working... - stuck with crawling at sitemap

[2024-09-06T01:53:50.254Z]DEBUG - Fetching sitemap links from http://firecrawl.dev

[2024-09-06T01:53:50.254Z]DEBUG - Fetching sitemap links from http://firecrawl.dev

- timeout with scraping request

ERROR - Error in scrapeController: Error: Job wait

ERROR - Error in scrapeController: Error: Job wait

PS. my worker container keeps printing:

worker-1              | Cant accept connection

worker-1              | Cant accept connection

I could deal with failure, but I am so struggling with these inconsistencies... help please, thanks a million!

10 Replies

mogery•13mo ago

Hi @thousandmiles -- Cant accept connection usually means the CPU/RAM usage is too high for the worker to take up new jobs. We use this metric in production to ensure our worker machines never get too overloaded. If you have tighter margins on your end, you can adjust the MAX_CPU and MAX_RAM environment variables (both are percentages, our defaults are 0.8 = 80% for each)

thousandmilesOP•13mo ago

that's definitely a possibility... the server I used has only 1GB memory. I will try deploying on better hardware.

mogery•13mo ago

That'll be the issue. On production each of our workers have 8GB of RAM.

thousandmilesOP•13mo ago

now i got it up and running on a server with 4C4G, weird thing is that: when I launched a crawl task with about 60+ pages, playwright-ts actually runs fine, but same "Cant accept connection" appeared. in less than 1 minute the task almost finishes with

"success": true,
"status": "scraping",
"completed": 62,
"total": 63,

"success": true,
"status": "scraping",
"completed": 62,
"total": 63,

but it refuses to accept any new tasks... can I simply change the limit of MAX_CPU and MAX_RAM to at least allow worker continue working? i really don't have powerful enough hardware in the cloud for now.

mogery•13mo ago

you can set them as environment variables MAX_CPU=1 and MAX_RAM=1 will essentially disable the checks (only stops if 100%)

thousandmilesOP•13mo ago

thanks, env just set. I'll keep an eye on that for some time. or, should i reduce the NUM_WORKERS_PER_QUEUE to reduce the mem and cpu usage?

mogery•13mo ago

alright, might want to monitor cpu and ram usage as well so you can collerate issues with redlining we don't use that variable anymore, it's replaced by MAX_CPU and MAX_RAM

thousandmilesOP•13mo ago

oh! but it's still in the guide to SELF_HOST. where can I get an up2date example of full env vars?

mogery•13mo ago

nowhere unfortunately. we want to update the guide at some point but were super busy with maintaining v1 right now that change is mostly it though

thousandmilesOP•13mo ago

no worries then, functionality is the key, bravo!

Gaming

Programming

stuck with python sdk, works on and off with curl

Did you find this page helpful?