F
Firecrawl17mo ago
Magick

`crawl` results in `waiting` but `scrape` works

Hello, when running locally, I'm able to scrape, using curl, successfully. However, if I try the crawl endpoint it results in a job that is constantly waiting. Is this because it depends on scrapingbee? I do see the following log which may be relevant:
Corepack is about to download https://registry.npmjs.org/pnpm/-/pnpm-9.1.4.tgz

> firecrawl-scraper-js@1.0.0 start:production /app
> tsc && node dist/src/index.js

Authentication is disabled. Supabase client will not be initialized.
POSTHOG_API_KEY is not provided - your events will not be logged. Using MockPostHog as a fallback. See posthog.ts for more.
Web scraper queue created
Server listening on port 3002
For the UI, open http://0.0.0.0:3002/admin//queues

1. Make sure Redis is running on port 6379 by default
2. If you want to run nango, make sure you do port forwarding in 3002 using ngrok http 3002
WARNING - You're bypassing authentication
WARNING - You're bypassing authentication
Attempted to access Supabase client when it's not configured.
Error logging crawl job:
Error: Supabase client is not configured.
at Proxy.<anonymous> (/app/dist/src/services/supabase.js:38:23)
at logCrawl (/app/dist/src/services/logging/crawl_log.js:9:14)
at crawlController (/app/dist/src/controllers/crawl.js:87:40)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
WARNING - You're bypassing authentication
WARNING - You're bypassing authentication
[Playwright] Error fetching url: https://www.stuff.co.nz/ with status: 404
Falling back to fetch
WARNING - You're bypassing authentication
Corepack is about to download https://registry.npmjs.org/pnpm/-/pnpm-9.1.4.tgz

> firecrawl-scraper-js@1.0.0 start:production /app
> tsc && node dist/src/index.js

Authentication is disabled. Supabase client will not be initialized.
POSTHOG_API_KEY is not provided - your events will not be logged. Using MockPostHog as a fallback. See posthog.ts for more.
Web scraper queue created
Server listening on port 3002
For the UI, open http://0.0.0.0:3002/admin//queues

1. Make sure Redis is running on port 6379 by default
2. If you want to run nango, make sure you do port forwarding in 3002 using ngrok http 3002
WARNING - You're bypassing authentication
WARNING - You're bypassing authentication
Attempted to access Supabase client when it's not configured.
Error logging crawl job:
Error: Supabase client is not configured.
at Proxy.<anonymous> (/app/dist/src/services/supabase.js:38:23)
at logCrawl (/app/dist/src/services/logging/crawl_log.js:9:14)
at crawlController (/app/dist/src/controllers/crawl.js:87:40)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
WARNING - You're bypassing authentication
WARNING - You're bypassing authentication
[Playwright] Error fetching url: https://www.stuff.co.nz/ with status: 404
Falling back to fetch
WARNING - You're bypassing authentication
The 404 status seems misleading as the same url works from the scrape endpoint.
33 Replies
Adobe.Flash
Adobe.Flash17mo ago
@Magick you need to run the workers seperately. Try doing npm run workers in a seperate terminal
Magick
MagickOP17mo ago
Thanks @Adobe.Flash - if I'm using docker compose up to start, which container should I run this command in?
Adobe.Flash
Adobe.Flash17mo ago
Oh I see. I think it should have automatically handled that for you. ccing @rafaelmiller which can prob help you better in this area
rafaelmiller
rafaelmiller17mo ago
@Magick you should have a worker container running automatically when you run docker compose at root. Can you confirm if this container is running? You can use docker ps to check
Magick
MagickOP16mo ago
Hi @rafaelmiller - yes, I do see the worker container running. I also see this in the api container api-1 | Worker 73 listening on port 3002 As I am running docker compose up - not including the -d I'm seeing all output from the running containers. The last log output I see from the worker container is:
worker-1 | Web scraper queue created
worker-1 | Connected to Redis Session Store!
worker-1 | Web scraper queue created
worker-1 | Connected to Redis Session Store!
It seems as if it never gets queue message from the api. When I send a crawl request, this is the only thing logged:
Error logging crawl job:
api-1 | Error: Supabase client is not configured.
api-1 | at Proxy.<anonymous> (/app/dist/src/services/supabase.js:38:23)
api-1 | at logCrawl (/app/dist/src/services/logging/crawl_log.js:9:14)
api-1 | at crawlController (/app/dist/src/controllers/crawl.js:92:40)
api-1 | at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
Error logging crawl job:
api-1 | Error: Supabase client is not configured.
api-1 | at Proxy.<anonymous> (/app/dist/src/services/supabase.js:38:23)
api-1 | at logCrawl (/app/dist/src/services/logging/crawl_log.js:9:14)
api-1 | at crawlController (/app/dist/src/controllers/crawl.js:92:40)
api-1 | at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
lvjiajjjlvjia
lvjiajjjlvjia16mo ago
I also encountered this problem
Janice
Janice15mo ago
i am also running into this problem
Nijn
Nijn15mo ago
Im also running into this problem... scraping works but crawling keeps timing out. THe update you sent is only for the people using the API key right? not for self-host?
Adobe.Flash
Adobe.Flash15mo ago
Correct, are you having this issue while self hosting ?
Nijn
Nijn15mo ago
Yes i am, im trying to do it via docker now
Adobe.Flash
Adobe.Flash15mo ago
Make sure that you are running the workers
Nijn
Nijn15mo ago
MaxRetriesPerRequestError: Reached the max retries per request limit (which is 20). Refer to "maxRetriesPerRequest" option for details.
at Socket.<anonymous> (C:\Users\Patrick\Desktop\firecrawl-main\apps\api\node_modules\.pnpm\ioredis@5.4.1\node_modules\ioredis\built\redis\event_handler.js:182:37)
at Object.onceWrapper (node:events:633:26)
at Socket.emit (node:events:518:28)
at Socket.emit (node:domain:488:12)
at TCP.<anonymous> (node:net:337:12)
MaxRetriesPerRequestError: Reached the max retries per request limit (which is 20). Refer to "maxRetriesPerRequest" option for details.
at Socket.<anonymous> (C:\Users\Patrick\Desktop\firecrawl-main\apps\api\node_modules\.pnpm\ioredis@5.4.1\node_modules\ioredis\built\redis\event_handler.js:182:37)
at Object.onceWrapper (node:events:633:26)
at Socket.emit (node:events:518:28)
at Socket.emit (node:domain:488:12)
at TCP.<anonymous> (node:net:337:12)
Adobe.Flash
Adobe.Flash15mo ago
Got it, if you are manually doing it do npm run start and npm run workers on separate terminals Hmm Do You have redis running ?
Nijn
Nijn15mo ago
Yes i did that, had workers all online Quick question inbetween, i got it in docker desktop now but what .env is it using? from the folder where i docked it from?
Adobe.Flash
Adobe.Flash15mo ago
Gotcha I believe it should be using it from the apps/api folder
Nijn
Nijn15mo ago
when i docker-compose up it uses this:
name: firecrawl
version: '3.9'

x-common-service: &common-service
build: apps/api
networks:
- backend
environment:
- REDIS_URL=${REDIS_URL:-redis://redis:6379}
- REDIS_RATE_LIMIT_URL=${REDIS_URL:-redis://redis:6379}
- PLAYWRIGHT_MICROSERVICE_URL=${PLAYWRIGHT_MICROSERVICE_URL:-http://playwright-service:3000}
- USE_DB_AUTHENTICATION=${USE_DB_AUTHENTICATION}
- PORT=${PORT:-3002}
- NUM_WORKERS_PER_QUEUE=${NUM_WORKERS_PER_QUEUE}
- OPENAI_API_KEY=${OPENAI_API_KEY}
- SLACK_WEBHOOK_URL=${SLACK_WEBHOOK_URL}
- SERPER_API_KEY=${SERPER_API_KEY}
- LLAMAPARSE_API_KEY=${LLAMAPARSE_API_KEY}
- LOGTAIL_KEY=${LOGTAIL_KEY}
- BULL_AUTH_KEY=${BULL_AUTH_KEY}
- TEST_API_KEY=${TEST_API_KEY}
- POSTHOG_API_KEY=${POSTHOG_API_KEY}
- POSTHOG_HOST=${POSTHOG_HOST}
- SUPABASE_ANON_TOKEN=${SUPABASE_ANON_TOKEN}
- SUPABASE_URL=${SUPABASE_URL}
- SUPABASE_SERVICE_TOKEN=${SUPABASE_SERVICE_TOKEN}
- SCRAPING_BEE_API_KEY=${SCRAPING_BEE_API_KEY}
- HOST=${HOST:-0.0.0.0}
- SELF_HOSTED_WEBHOOK_URL=${SELF_HOSTED_WEBHOOK_URL}
extra_hosts:
- "host.docker.internal:host-gateway"

services:
playwright-service:
build: apps/playwright-service
environment:
- PORT=3000
- PROXY_SERVER=${PROXY_SERVER}
- PROXY_USERNAME=${PROXY_USERNAME}
- PROXY_PASSWORD=${PROXY_PASSWORD}
- BLOCK_MEDIA=${BLOCK_MEDIA}
networks:
- backend

api:
<<: *common-service
depends_on:
- redis
- playwright-service
ports:
- "3002:3002"
command: [ "pnpm", "run", "start:production" ]

worker:
<<: *common-service
depends_on:
- redis
- playwright-service
- api
command: [ "pnpm", "run", "workers" ]

redis:
image: redis:alpine
networks:
- backend
command: redis-server --bind 0.0.0.0

networks:
backend:
driver: bridge
name: firecrawl
version: '3.9'

x-common-service: &common-service
build: apps/api
networks:
- backend
environment:
- REDIS_URL=${REDIS_URL:-redis://redis:6379}
- REDIS_RATE_LIMIT_URL=${REDIS_URL:-redis://redis:6379}
- PLAYWRIGHT_MICROSERVICE_URL=${PLAYWRIGHT_MICROSERVICE_URL:-http://playwright-service:3000}
- USE_DB_AUTHENTICATION=${USE_DB_AUTHENTICATION}
- PORT=${PORT:-3002}
- NUM_WORKERS_PER_QUEUE=${NUM_WORKERS_PER_QUEUE}
- OPENAI_API_KEY=${OPENAI_API_KEY}
- SLACK_WEBHOOK_URL=${SLACK_WEBHOOK_URL}
- SERPER_API_KEY=${SERPER_API_KEY}
- LLAMAPARSE_API_KEY=${LLAMAPARSE_API_KEY}
- LOGTAIL_KEY=${LOGTAIL_KEY}
- BULL_AUTH_KEY=${BULL_AUTH_KEY}
- TEST_API_KEY=${TEST_API_KEY}
- POSTHOG_API_KEY=${POSTHOG_API_KEY}
- POSTHOG_HOST=${POSTHOG_HOST}
- SUPABASE_ANON_TOKEN=${SUPABASE_ANON_TOKEN}
- SUPABASE_URL=${SUPABASE_URL}
- SUPABASE_SERVICE_TOKEN=${SUPABASE_SERVICE_TOKEN}
- SCRAPING_BEE_API_KEY=${SCRAPING_BEE_API_KEY}
- HOST=${HOST:-0.0.0.0}
- SELF_HOSTED_WEBHOOK_URL=${SELF_HOSTED_WEBHOOK_URL}
extra_hosts:
- "host.docker.internal:host-gateway"

services:
playwright-service:
build: apps/playwright-service
environment:
- PORT=3000
- PROXY_SERVER=${PROXY_SERVER}
- PROXY_USERNAME=${PROXY_USERNAME}
- PROXY_PASSWORD=${PROXY_PASSWORD}
- BLOCK_MEDIA=${BLOCK_MEDIA}
networks:
- backend

api:
<<: *common-service
depends_on:
- redis
- playwright-service
ports:
- "3002:3002"
command: [ "pnpm", "run", "start:production" ]

worker:
<<: *common-service
depends_on:
- redis
- playwright-service
- api
command: [ "pnpm", "run", "workers" ]

redis:
image: redis:alpine
networks:
- backend
command: redis-server --bind 0.0.0.0

networks:
backend:
driver: bridge
And so it creates another env i just cant seem to find my way in docker desktop hahah
Nijn
Nijn15mo ago
Okay so when i run docker-compose config it prints me my .env file and thats all correct. it states USE_DB_AUTHENTICATION: "false" But in docker desktop it shows this in logs
No description
rafaelmiller
rafaelmiller15mo ago
hey @Nijn ! This looks like a warning message, are you able to run crawl or scrape?
Nijn
Nijn15mo ago
I tried opening the workers and queue in 2 seperate cmds. when i tried post via python the scrape worked but the crawl keeps timing out im trying to compose it into docker desktop to see if it works via there but for some reason it takes a different env Okay so i got it to work in docker now by deleting and re-composing it! For some reason when using cmd i needed to change .env.local to .env and in docker it probably needed the .local
rafaelmiller
rafaelmiller15mo ago
oh ok! Does crawl works now?
Nijn
Nijn15mo ago
Crawl now gives a jobId while it never has before so thats a step forward! However i now get this:
rafaelmiller
rafaelmiller15mo ago
oh I think there's a bug there 1 sec
Nijn
Nijn15mo ago
alright Okay so despite the error it does work i can retrieve it by job id
rafaelmiller
rafaelmiller15mo ago
I just pushed a fix for this error you can update your firecrawl repo for solving that awesome
Nijn
Nijn15mo ago
Thanks so much! if i find any more errors/bugs i will let you know! Really cool what you guys working on!
Adobe.Flash
Adobe.Flash15mo ago
Awesome that it worked! And thank you! 🔥
Nijn
Nijn15mo ago
Question, how do I use llm extract locally? And are these inputs correct for pyton?
No description
No description
Nijn
Nijn15mo ago
or is llm extract only for scrape function? I see that you need an api key for llm extraction which seems logical because this is ran on your network. So im trying out the markdown function but for some reason the markdown is the same as the output... And it also gives markdown with standard crawl function
rafaelmiller
rafaelmiller15mo ago
Hey @Nijn , to use llm extract, you need to set up the extractorOptions parameter when using the scrape functions. Also, if you're using it self-hosted, you'll need to configure the OPENAI_API_KEY in your .env file.
Nijn
Nijn15mo ago
Im trying to keep away from using paid llm like OpenAI. Is there any way to use a self-hosted llm like llama to llm extract?
rafaelmiller
rafaelmiller15mo ago
we have one open PR for that. It's still under review, though.
Nijn
Nijn15mo ago
And how about prioritizing certain paths instead of including them?

Did you find this page helpful?