stuck with python sdk, works on and off with curl

I just selfhosted a firecrawl instance yesterday, it works immediately with a curl request to /v1/crawl, which is a surprise! and then.... 1. with curl i built playwright-ts and it doesn't work at first because i didn't really think when I quoted PLAYWRIGHT_MICROSERVICE_URL=http://localhost:3000/scrape from the SELF_HOST.md into my envfile, which is obviously incorrect. after I fixed it i works with example request
{
"url": "https://news.ycombinator.com",
"formats": [
"extract"
],
"extract": {
"prompt": "Top 5 stories"
}
}
{
"url": "https://news.ycombinator.com",
"formats": [
"extract"
],
"extract": {
"prompt": "Top 5 stories"
}
}
then, without doing anything to the server, this same request and crawl request to any website wont work anymore. i keeps getting
{
"success": false,
"error": "Request timed out"
}
{
"success": false,
"error": "Request timed out"
}
from LLM scrape request. ( of course I got my LLM api key right, otherwise it won't work the 1st time). THEN I RESTARTED ALL DOCKER SERVICES, /v1/crawl works again, but /v1/scrape still doesn't work. 2. python sdk i put aside curl for now, and went on to try python sdk. after looking into the code, i changed example.py like this
app = FirecrawlApp(api_key="no", api_url="http://localhost:3002")

# Scrape a website:
scrape_result = app.scrape_url('firecrawl.dev')
print(scrape_result['markdown'])

# Crawl a website:
# idempotency_key = str(uuid.uuid4()) # optional idempotency key
crawl_result = app.crawl_url('firecrawl.dev', {'excludePaths': ['blog/*']}, 2)
print(crawl_result)
app = FirecrawlApp(api_key="no", api_url="http://localhost:3002")

# Scrape a website:
scrape_result = app.scrape_url('firecrawl.dev')
print(scrape_result['markdown'])

# Crawl a website:
# idempotency_key = str(uuid.uuid4()) # optional idempotency key
crawl_result = app.crawl_url('firecrawl.dev', {'excludePaths': ['blog/*']}, 2)
print(crawl_result)
should be correct, right? but I never got it working... - stuck with crawling at sitemap
[2024-09-06T01:53:50.254Z]DEBUG - Fetching sitemap links from http://firecrawl.dev
[2024-09-06T01:53:50.254Z]DEBUG - Fetching sitemap links from http://firecrawl.dev
- timeout with scraping request
ERROR - Error in scrapeController: Error: Job wait
ERROR - Error in scrapeController: Error: Job wait
PS. my worker container keeps printing:
worker-1 | Cant accept connection
worker-1 | Cant accept connection
I could deal with failure, but I am so struggling with these inconsistencies... help please, thanks a million!
10 Replies
mogery
mogery13mo ago
Hi @thousandmiles -- Cant accept connection usually means the CPU/RAM usage is too high for the worker to take up new jobs. We use this metric in production to ensure our worker machines never get too overloaded. If you have tighter margins on your end, you can adjust the MAX_CPU and MAX_RAM environment variables (both are percentages, our defaults are 0.8 = 80% for each)
thousandmiles
thousandmilesOP13mo ago
that's definitely a possibility... the server I used has only 1GB memory. I will try deploying on better hardware.
mogery
mogery13mo ago
That'll be the issue. On production each of our workers have 8GB of RAM.
thousandmiles
thousandmilesOP13mo ago
now i got it up and running on a server with 4C4G, weird thing is that: when I launched a crawl task with about 60+ pages, playwright-ts actually runs fine, but same "Cant accept connection" appeared. in less than 1 minute the task almost finishes with
"success": true,
"status": "scraping",
"completed": 62,
"total": 63,
"success": true,
"status": "scraping",
"completed": 62,
"total": 63,
but it refuses to accept any new tasks... can I simply change the limit of MAX_CPU and MAX_RAM to at least allow worker continue working? i really don't have powerful enough hardware in the cloud for now.
mogery
mogery13mo ago
you can set them as environment variables MAX_CPU=1 and MAX_RAM=1 will essentially disable the checks (only stops if 100%)
thousandmiles
thousandmilesOP13mo ago
thanks, env just set. I'll keep an eye on that for some time. or, should i reduce the NUM_WORKERS_PER_QUEUE to reduce the mem and cpu usage?
mogery
mogery13mo ago
alright, might want to monitor cpu and ram usage as well so you can collerate issues with redlining we don't use that variable anymore, it's replaced by MAX_CPU and MAX_RAM
thousandmiles
thousandmilesOP13mo ago
oh! but it's still in the guide to SELF_HOST. where can I get an up2date example of full env vars?
mogery
mogery13mo ago
nowhere unfortunately. we want to update the guide at some point but were super busy with maintaining v1 right now that change is mostly it though
thousandmiles
thousandmilesOP13mo ago
no worries then, functionality is the key, bravo!

Did you find this page helpful?