Firecrawl•17mo ago

`crawl` results in `waiting` but `scrape` works

Hello, when running locally, I'm able to scrape, using curl, successfully. However, if I try the crawl endpoint it results in a job that is constantly waiting. Is this because it depends on scrapingbee? I do see the following log which may be relevant:

Corepack is about to download https://registry.npmjs.org/pnpm/-/pnpm-9.1.4.tgz

> firecrawl-scraper-js@1.0.0 start:production /app
> tsc && node dist/src/index.js

Authentication is disabled. Supabase client will not be initialized.
POSTHOG_API_KEY is not provided - your events will not be logged. Using MockPostHog as a fallback. See posthog.ts for more.
Web scraper queue created
Server listening on port 3002
For the UI, open http://0.0.0.0:3002/admin//queues

1. Make sure Redis is running on port 6379 by default
2. If you want to run nango, make sure you do port forwarding in 3002 using ngrok http 3002 
WARNING - You're bypassing authentication
WARNING - You're bypassing authentication
Attempted to access Supabase client when it's not configured.
Error logging crawl job:
 Error: Supabase client is not configured.
    at Proxy.<anonymous> (/app/dist/src/services/supabase.js:38:23)
    at logCrawl (/app/dist/src/services/logging/crawl_log.js:9:14)
    at crawlController (/app/dist/src/controllers/crawl.js:87:40)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
WARNING - You're bypassing authentication
WARNING - You're bypassing authentication
[Playwright] Error fetching url: https://www.stuff.co.nz/ with status: 404
Falling back to fetch
WARNING - You're bypassing authentication

Corepack is about to download https://registry.npmjs.org/pnpm/-/pnpm-9.1.4.tgz

> firecrawl-scraper-js@1.0.0 start:production /app
> tsc && node dist/src/index.js

Authentication is disabled. Supabase client will not be initialized.
POSTHOG_API_KEY is not provided - your events will not be logged. Using MockPostHog as a fallback. See posthog.ts for more.
Web scraper queue created
Server listening on port 3002
For the UI, open http://0.0.0.0:3002/admin//queues

1. Make sure Redis is running on port 6379 by default
2. If you want to run nango, make sure you do port forwarding in 3002 using ngrok http 3002 
WARNING - You're bypassing authentication
WARNING - You're bypassing authentication
Attempted to access Supabase client when it's not configured.
Error logging crawl job:
 Error: Supabase client is not configured.
    at Proxy.<anonymous> (/app/dist/src/services/supabase.js:38:23)
    at logCrawl (/app/dist/src/services/logging/crawl_log.js:9:14)
    at crawlController (/app/dist/src/controllers/crawl.js:87:40)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
WARNING - You're bypassing authentication
WARNING - You're bypassing authentication
[Playwright] Error fetching url: https://www.stuff.co.nz/ with status: 404
Falling back to fetch
WARNING - You're bypassing authentication

The 404 status seems misleading as the same url works from the scrape endpoint.

33 Replies

Adobe.Flash•17mo ago

@Magick you need to run the workers seperately. Try doing npm run workers in a seperate terminal

MagickOP•17mo ago

Thanks @Adobe.Flash - if I'm using docker compose up to start, which container should I run this command in?

Adobe.Flash•17mo ago

Oh I see. I think it should have automatically handled that for you. ccing @rafaelmiller which can prob help you better in this area

rafaelmiller•17mo ago

@Magick you should have a worker container running automatically when you run docker compose at root. Can you confirm if this container is running? You can use docker ps to check

MagickOP•16mo ago

Hi @rafaelmiller - yes, I do see the worker container running. I also see this in the api container api-1 | Worker 73 listening on port 3002 As I am running docker compose up - not including the -d I'm seeing all output from the running containers. The last log output I see from the worker container is:

worker-1              | Web scraper queue created
worker-1              | Connected to Redis Session Store!

worker-1              | Web scraper queue created
worker-1              | Connected to Redis Session Store!

It seems as if it never gets queue message from the api. When I send a crawl request, this is the only thing logged:

Error logging crawl job:
api-1                 |  Error: Supabase client is not configured.
api-1                 |     at Proxy.<anonymous> (/app/dist/src/services/supabase.js:38:23)
api-1                 |     at logCrawl (/app/dist/src/services/logging/crawl_log.js:9:14)
api-1                 |     at crawlController (/app/dist/src/controllers/crawl.js:92:40)
api-1                 |     at process.processTicksAndRejections (node:internal/process/task_queues:95:5)

Error logging crawl job:
api-1                 |  Error: Supabase client is not configured.
api-1                 |     at Proxy.<anonymous> (/app/dist/src/services/supabase.js:38:23)
api-1                 |     at logCrawl (/app/dist/src/services/logging/crawl_log.js:9:14)
api-1                 |     at crawlController (/app/dist/src/controllers/crawl.js:92:40)
api-1                 |     at process.processTicksAndRejections (node:internal/process/task_queues:95:5)

lvjiajjjlvjia•16mo ago

I also encountered this problem

Janice•15mo ago

i am also running into this problem

Adobe.Flash•15mo ago

Hey yall, quick update https://discord.com/channels/1226707384710332458/1261330279348572251/1261330647910191254

Nijn•15mo ago

Im also running into this problem... scraping works but crawling keeps timing out. THe update you sent is only for the people using the API key right? not for self-host?

Adobe.Flash•15mo ago

Correct, are you having this issue while self hosting ?

Nijn•15mo ago

Yes i am, im trying to do it via docker now

Adobe.Flash•15mo ago

Make sure that you are running the workers

Nijn•15mo ago

MaxRetriesPerRequestError: Reached the max retries per request limit (which is 20). Refer to "maxRetriesPerRequest" option for details.
    at Socket.<anonymous> (C:\Users\Patrick\Desktop\firecrawl-main\apps\api\node_modules\.pnpm\ioredis@5.4.1\node_modules\ioredis\built\redis\event_handler.js:182:37)
    at Object.onceWrapper (node:events:633:26)
    at Socket.emit (node:events:518:28)
    at Socket.emit (node:domain:488:12)
    at TCP.<anonymous> (node:net:337:12)

MaxRetriesPerRequestError: Reached the max retries per request limit (which is 20). Refer to "maxRetriesPerRequest" option for details.
    at Socket.<anonymous> (C:\Users\Patrick\Desktop\firecrawl-main\apps\api\node_modules\.pnpm\ioredis@5.4.1\node_modules\ioredis\built\redis\event_handler.js:182:37)
    at Object.onceWrapper (node:events:633:26)
    at Socket.emit (node:events:518:28)
    at Socket.emit (node:domain:488:12)
    at TCP.<anonymous> (node:net:337:12)

Adobe.Flash•15mo ago

Got it, if you are manually doing it do npm run start and npm run workers on separate terminals Hmm Do You have redis running ?

Nijn•15mo ago

Yes i did that, had workers all online Quick question inbetween, i got it in docker desktop now but what .env is it using? from the folder where i docked it from?

Adobe.Flash•15mo ago

Gotcha I believe it should be using it from the apps/api folder

Nijn•15mo ago

when i docker-compose up it uses this:

name: firecrawl
version: '3.9'

x-common-service: &common-service
  build: apps/api
  networks:
    - backend
  environment:
    - REDIS_URL=${REDIS_URL:-redis://redis:6379}
    - REDIS_RATE_LIMIT_URL=${REDIS_URL:-redis://redis:6379}
    - PLAYWRIGHT_MICROSERVICE_URL=${PLAYWRIGHT_MICROSERVICE_URL:-http://playwright-service:3000}
    - USE_DB_AUTHENTICATION=${USE_DB_AUTHENTICATION}
    - PORT=${PORT:-3002}
    - NUM_WORKERS_PER_QUEUE=${NUM_WORKERS_PER_QUEUE}
    - OPENAI_API_KEY=${OPENAI_API_KEY}
    - SLACK_WEBHOOK_URL=${SLACK_WEBHOOK_URL}
    - SERPER_API_KEY=${SERPER_API_KEY}
    - LLAMAPARSE_API_KEY=${LLAMAPARSE_API_KEY}
    - LOGTAIL_KEY=${LOGTAIL_KEY}
    - BULL_AUTH_KEY=${BULL_AUTH_KEY}
    - TEST_API_KEY=${TEST_API_KEY}
    - POSTHOG_API_KEY=${POSTHOG_API_KEY}
    - POSTHOG_HOST=${POSTHOG_HOST}
    - SUPABASE_ANON_TOKEN=${SUPABASE_ANON_TOKEN}
    - SUPABASE_URL=${SUPABASE_URL}
    - SUPABASE_SERVICE_TOKEN=${SUPABASE_SERVICE_TOKEN}
    - SCRAPING_BEE_API_KEY=${SCRAPING_BEE_API_KEY}
    - HOST=${HOST:-0.0.0.0}
    - SELF_HOSTED_WEBHOOK_URL=${SELF_HOSTED_WEBHOOK_URL}
  extra_hosts:
    - "host.docker.internal:host-gateway"

services:
  playwright-service:
    build: apps/playwright-service
    environment:
      - PORT=3000
      - PROXY_SERVER=${PROXY_SERVER}
      - PROXY_USERNAME=${PROXY_USERNAME}
      - PROXY_PASSWORD=${PROXY_PASSWORD}
      - BLOCK_MEDIA=${BLOCK_MEDIA}
    networks:
      - backend

  api:
    <<: *common-service
    depends_on:
      - redis
      - playwright-service
    ports:
      - "3002:3002"
    command: [ "pnpm", "run", "start:production" ]

  worker:
    <<: *common-service
    depends_on:
      - redis
      - playwright-service
      - api
    command: [ "pnpm", "run", "workers" ]

  redis:
    image: redis:alpine
    networks:
      - backend
    command: redis-server --bind 0.0.0.0

networks:
  backend:
    driver: bridge

name: firecrawl
version: '3.9'

x-common-service: &common-service
  build: apps/api
  networks:
    - backend
  environment:
    - REDIS_URL=${REDIS_URL:-redis://redis:6379}
    - REDIS_RATE_LIMIT_URL=${REDIS_URL:-redis://redis:6379}
    - PLAYWRIGHT_MICROSERVICE_URL=${PLAYWRIGHT_MICROSERVICE_URL:-http://playwright-service:3000}
    - USE_DB_AUTHENTICATION=${USE_DB_AUTHENTICATION}
    - PORT=${PORT:-3002}
    - NUM_WORKERS_PER_QUEUE=${NUM_WORKERS_PER_QUEUE}
    - OPENAI_API_KEY=${OPENAI_API_KEY}
    - SLACK_WEBHOOK_URL=${SLACK_WEBHOOK_URL}
    - SERPER_API_KEY=${SERPER_API_KEY}
    - LLAMAPARSE_API_KEY=${LLAMAPARSE_API_KEY}
    - LOGTAIL_KEY=${LOGTAIL_KEY}
    - BULL_AUTH_KEY=${BULL_AUTH_KEY}
    - TEST_API_KEY=${TEST_API_KEY}
    - POSTHOG_API_KEY=${POSTHOG_API_KEY}
    - POSTHOG_HOST=${POSTHOG_HOST}
    - SUPABASE_ANON_TOKEN=${SUPABASE_ANON_TOKEN}
    - SUPABASE_URL=${SUPABASE_URL}
    - SUPABASE_SERVICE_TOKEN=${SUPABASE_SERVICE_TOKEN}
    - SCRAPING_BEE_API_KEY=${SCRAPING_BEE_API_KEY}
    - HOST=${HOST:-0.0.0.0}
    - SELF_HOSTED_WEBHOOK_URL=${SELF_HOSTED_WEBHOOK_URL}
  extra_hosts:
    - "host.docker.internal:host-gateway"

services:
  playwright-service:
    build: apps/playwright-service
    environment:
      - PORT=3000
      - PROXY_SERVER=${PROXY_SERVER}
      - PROXY_USERNAME=${PROXY_USERNAME}
      - PROXY_PASSWORD=${PROXY_PASSWORD}
      - BLOCK_MEDIA=${BLOCK_MEDIA}
    networks:
      - backend

  api:
    <<: *common-service
    depends_on:
      - redis
      - playwright-service
    ports:
      - "3002:3002"
    command: [ "pnpm", "run", "start:production" ]

  worker:
    <<: *common-service
    depends_on:
      - redis
      - playwright-service
      - api
    command: [ "pnpm", "run", "workers" ]

  redis:
    image: redis:alpine
    networks:
      - backend
    command: redis-server --bind 0.0.0.0

networks:
  backend:
    driver: bridge

And so it creates another env i just cant seem to find my way in docker desktop hahah

Nijn•15mo ago

Okay so when i run docker-compose config it prints me my .env file and thats all correct. it states USE_DB_AUTHENTICATION: "false" But in docker desktop it shows this in logs

rafaelmiller•15mo ago

hey @Nijn ! This looks like a warning message, are you able to run crawl or scrape?

Nijn•15mo ago

I tried opening the workers and queue in 2 seperate cmds. when i tried post via python the scrape worked but the crawl keeps timing out im trying to compose it into docker desktop to see if it works via there but for some reason it takes a different env Okay so i got it to work in docker now by deleting and re-composing it! For some reason when using cmd i needed to change .env.local to .env and in docker it probably needed the .local

rafaelmiller•15mo ago

oh ok! Does crawl works now?

Nijn•15mo ago

Crawl now gives a jobId while it never has before so thats a step forward! However i now get this:

message.txt

rafaelmiller•15mo ago

oh I think there's a bug there 1 sec

Nijn•15mo ago

alright Okay so despite the error it does work i can retrieve it by job id

rafaelmiller•15mo ago

I just pushed a fix for this error you can update your firecrawl repo for solving that awesome

Nijn•15mo ago

Thanks so much! if i find any more errors/bugs i will let you know! Really cool what you guys working on!

Adobe.Flash•15mo ago

Awesome that it worked! And thank you! 🔥

Nijn•15mo ago

Question, how do I use llm extract locally? And are these inputs correct for pyton?

Nijn•15mo ago

or is llm extract only for scrape function? I see that you need an api key for llm extraction which seems logical because this is ran on your network. So im trying out the markdown function but for some reason the markdown is the same as the output... And it also gives markdown with standard crawl function

rafaelmiller•15mo ago

Hey @Nijn , to use llm extract, you need to set up the extractorOptions parameter when using the scrape functions. Also, if you're using it self-hosted, you'll need to configure the OPENAI_API_KEY in your .env file.

Nijn•15mo ago

Im trying to keep away from using paid llm like OpenAI. Is there any way to use a self-hosted llm like llama to llm extract?

rafaelmiller•15mo ago

we have one open PR for that. It's still under review, though.

Nijn•15mo ago

And how about prioritizing certain paths instead of including them?

Gaming

Programming

`crawl` results in `waiting` but `scrape` works

Did you find this page helpful?