Crawler stopped after encountering 404,

I am using start_crawl_and_watch, when the crawler encounters a broken page, 404 error, It stops with error websockets.exceptions.ConnectionClosedError: received 3000 (registered) {"type":"error"}; then sent 3000 (registered) {"type":"error"} doesn't call bacl to on_error hook
18 Replies
Adobe.Flash
Adobe.Flash13mo ago
ccing @mogery here to take a look
mogery
mogery13mo ago
@sushilsainju can you send me the ID of the crawl please?
sushilsainju
sushilsainjuOP13mo ago
0eb7a6c7-0c05-4557-ad5b-59da52b3174e The failed session doesn't appear on the activity log either and it seemed to have used up significant number of my credits, almost 10K credits each request which means I've hearly used about 30K creadits, on this start_crawl_and_watch call without any useful result
sushilsainju
sushilsainjuOP13mo ago
attached are the screenshots of my lates usage and activity logs
No description
No description
Adobe.Flash
Adobe.Flash13mo ago
Hey @sushilsainju , while mogery investigate, can you dm me your email? I will put the 30k credits back to your account
sushilsainju
sushilsainjuOP13mo ago
its sushil@whitehatengineering.com Hope you guys can fix the issue soon
Adobe.Flash
Adobe.Flash13mo ago
Thanks! Just added 30k there 🙂 Yes we will! @mogery is on it!
sushilsainju
sushilsainjuOP13mo ago
Thanks! Is there any settings/config that would save the content of each scraped page to an individual text file? I am trying to achieve this on on_document hook I think this should be 40K credits, As you can see on my activity log I've barely used around 5K credits, think I lost the remaining on the failed start_crawl_and_watch We are still trying to test out firecrawl for our product, any update on this @mogery @Adobe.Flash
mogery
mogery13mo ago
Should be fixed. Crawls don't fail anymore
sushilsainju
sushilsainjuOP13mo ago
I'm still facing issues with crawl_url_and_watch
It crashes unexpectedly and I can not find the failed crawl requests on the dashboard, It does not return to on_error hook either
sushilsainju
sushilsainjuOP13mo ago
No description
sushilsainju
sushilsainjuOP13mo ago
async def start_crawl_and_watch():

exclude_paths_with_wildcards = [f"{path}/*" for path in paths_to_skip]

watcher = app.crawl_url_and_watch(
"https://seattle.gov/sdci",
params={
"excludePaths": exclude_paths_with_wildcards,
"scrapeOptions": {"formats": ["markdown"], "onlyMainContent": True},
"allowExternalLinks": True,
"allowBackwardLinks": True,
"limit": 100,
},
)

# Add event listeners
watcher.add_event_listener("document", on_document)
watcher.add_event_listener("error", on_error)
watcher.add_event_listener("done", on_done)

await watcher.connect()
async def start_crawl_and_watch():

exclude_paths_with_wildcards = [f"{path}/*" for path in paths_to_skip]

watcher = app.crawl_url_and_watch(
"https://seattle.gov/sdci",
params={
"excludePaths": exclude_paths_with_wildcards,
"scrapeOptions": {"formats": ["markdown"], "onlyMainContent": True},
"allowExternalLinks": True,
"allowBackwardLinks": True,
"limit": 100,
},
)

# Add event listeners
watcher.add_event_listener("document", on_document)
watcher.add_event_listener("error", on_error)
watcher.add_event_listener("done", on_done)

await watcher.connect()
Caleb
Caleb12mo ago
Hey sushil, sorry for the late turn around on this. ccing @mogery to take another look.
sushilsainju
sushilsainjuOP12mo ago
We are planning to cancel our subscription until this issue is fixed, we are not getting what we expected
Caleb
Caleb12mo ago
Hey, sorry again. We're working on it asap.
sushilsainju
sushilsainjuOP12mo ago
I hope our monthly subscription will be extended, since we havn't been able to use the service properly Hi, any updates on this? I was using the playground to crawl the website and it was in progress , scraping around 9000+ pages, i refreshed the page and navigated to activity logs, I could not find the log of my latest crawl on the activity log table? neither can i see it on my usage charts
mogery
mogery12mo ago
crawls are added to the activity log once the entire run is finished
sushilsainju
sushilsainjuOP12mo ago
how about this?

Did you find this page helpful?