F
Firecrawl2mo ago
Suppa

Crawl job not stopping and also the token usage has not been changed for quite some time

The code: import requests import time import os import re from dotenv import load_dotenv load_dotenv() FIRE_CRAWL_API_KEY = os.getenv("FIRE_CRAWL_API_KEY") CONFIGURATION CRAWL_URL = "https://api.firecrawl.dev/v2/crawl" JOB_STATUS_URL = "https://api.firecrawl.dev/v2/crawl/status/" OUTPUT_DIR = "crawleddata" Crawl job parameters payload = { "url": "https://www.fhs.unizg.hr/", "sitemap": "include", "crawlEntireDomain": True, "excludePaths": [ "./en/.", ".*news.", ".news.*" ], "scrapeOptions": { "onlyMainContent": True, "maxAge": 172800000, "parsers": [ "pdf" ], "formats": [ "markdown" ] } } headers = { "Authorization": f"Bearer {FIRE_CRAWL_API_KEY}", "Content-Type": "application/json" } START THE CRAWL JOB print("Starting crawl job...") response = requests.post(CRAWL_URL, json=payload, headers=headers)
No description
No description
9 Replies
Gaurav Chadha
Gaurav Chadha2mo ago
Hi @Suppa this works for me on the playground with the same configuration, is it just not working for the python-sdk for you?
No description
Suppa
SuppaOP2mo ago
@Gaurav Chadha Playground maybe works as it has a limited scrape, but there are over 1000 pages on that domain, and when I run from my own credits it does not stop at all. I worry that it veered off the main page and went to a different domain. In activity logs it says crawl is still running and my credits even though it's running still stay at 83% with nothing changing.
Gaurav Chadha
Gaurav Chadha2mo ago
@Suppa I checked your job ID, it processed 3 pdf documents and then the job timeout at 21:09:38 UTC. Don't worry you're credits are not consumed and the job is terminated in backend. I'll share this with the team to handle status for timeouts and failures.
Suppa
SuppaOP2mo ago
Thank you could you please look into it again and confirm that it has several thousand markdown documents? Pdfs are not of much concern, but I needed to make a markdown of the entire site excluding news in url for our college chatbot. I paid 100 usd, and that was our entire budget. I can only do it through SAAS like yours as the site is in jquery and other older technologies with fragmented urls so scrapers i write are of little consequence. Do you think I could get those files in Activity Log page by the end of the week? I already have the pipeline and everything made I just need the markdown files to complete the project. I started that project with around 10 % of 100 000 credits and now I'm on 83% of 100 000 on that single job. EDIT: Now that I think about it, do you say it only processed pdf files? I'm a bit worried if they posted large pdf files as that was not my intention. I only wanted to scrape the site content. I was instructed to scrape some pdfs regarding guidelines but did not take into concern there might be entire books there. Thank you.
Gaurav Chadha
Gaurav Chadha2mo ago
@Suppa The status should be set to failed or removed from there. Can you please refresh and check again? It should not have cost you credits if it were an unsuccessful request For PDFs, it just parsed them as rest data, didn't go through. You can refresh and make a new request via the API.
Suppa
SuppaOP2mo ago
Thank you for your involvement in the matter. As you can see i cant download files from that job, as I tested with 3 jobs and spent 10000 credits and then ran the 4th one which does not even show as failed. Could you please refund me the credits for that job that failed so I can try to run the job again? This time I will go over the parameters with you before running just to be sure that I didn't miss something. Even though I don't have the crawl status and files it still spent credits.
No description
No description
No description
Gaurav Chadha
Gaurav Chadha2mo ago
Can you please also share your email?
Suppa
SuppaOP2mo ago
Sent. Thank you for your prompt involvement in this matter.
Gaurav Chadha
Gaurav Chadha2mo ago
@Suppa we'll reimburse the wasted credits, I've shared it with the team, also could you please also send us an email at help@firecrawl.dev? You can add a link to this conversation. Also, this is a known bug that will be fixed by this week.

Did you find this page helpful?