F
Firecrawl14mo ago
stan

(YC W24) Inconsistent crawl results between prod and local

I'm trying to test out crawling https://fanfiction.net with a script locally before I switch to the Firecrawl API, however I'm getting different results. Locally I am just running Firecrawl with the docker setup: docker compose up from the SELF_HOST.md instructions and default .env variables with no DB. I am initiating the crawl sequence with the following command and a 200 response is returned with the jobId:
curl -X POST http://localhost:3002/v0/crawl \
-H 'Content-Type: application/json' \
-d '{"url": "https://www.fanfiction.net/anime/Naruto/", "crawlerOptions": {"limit": 5}}'
{"jobId":"b2338037-55cb-4a23-b17b-d09d41436266"}%
curl -X POST http://localhost:3002/v0/crawl \
-H 'Content-Type: application/json' \
-d '{"url": "https://www.fanfiction.net/anime/Naruto/", "crawlerOptions": {"limit": 5}}'
{"jobId":"b2338037-55cb-4a23-b17b-d09d41436266"}%
The response I'm seeing from the completed job in the MQ at http://localhost:3002/admin/@/queues/queue/web-scraper?status=completed is:
{
"jobData": {
"url": "https://www.fanfiction.net/",
"mode": "crawl",
"crawlerOptions": {
"allowBackwardCrawling": false,
"limit": 5
},
"pageOptions": {
"onlyMainContent": false,
"includeHtml": false,
"removeTags": [],
"parsePDF": true
},
"origin": "api"
},
"returnValue": []
}
{
"jobData": {
"url": "https://www.fanfiction.net/",
"mode": "crawl",
"crawlerOptions": {
"allowBackwardCrawling": false,
"limit": 5
},
"pageOptions": {
"onlyMainContent": false,
"includeHtml": false,
"removeTags": [],
"parsePDF": true
},
"origin": "api"
},
"returnValue": []
}
However, when I try crawling from the Playground: https://www.firecrawl.dev/playground?url=https%3A%2F%2Fwww.fanfiction.net%2F&mode=crawl&limit=5&excludes=&includes=&returnOnlyUrls=false&ignoreSitemap=false&maxDepth=&onlyMainContent=false&includeHtml=false&removeTags=&onlyIncludeTags=&waitFor= I am getting appropriately returned results from there. Can you tell me what is the difference between running /crawl locally and the playground environment is? I tried crawling other websites locally (like mendable.ai) and they seemed to be crawled appropriately with a reasonable returnValue.
1 Reply
Caleb
Caleb14mo ago
Hey there rachael! On the cloud hosted version, we use fire-engine, a custom built scraping service that does a better job grabbing content. Also, nice to see someone from YC 🙂 🟧

Did you find this page helpful?