F
Firecrawl13mo ago
Kaleb

`next` pagination using js-sdk

Using version 1.2.2, does the js-sdk have a built-in method to follow the next url? I call asyncCrawlUrl() and then checkCrawlStatus() every 5 seconds until the job status is complete. However, I'm not sure how to get the next page of results from the next property. More detailed examples of how to use asyncCrawlUrl, checkCrawlStatus, and the next url would be appreciated!
import dotenv from 'dotenv'
import FirecrawlApp, {type FirecrawlDocument} from '@mendable/firecrawl-js'

dotenv.config()
const apiKey = process.env.FIRECRAWL_API_KEY
const firecrawl = new FirecrawlApp({apiKey})

const crawlUrl = async (url: string) => {
// start the crawl
const crawlResponse = await firecrawl.asyncCrawlUrl(url, crawlerConfig)

if (!crawlResponse.success) {
throw new Error(`Error starting crawl: ${crawlResponse.error}`)
}

// loop until the crawl is complete
let completedFlag = false
let statusCheck = null
const pages: FirecrawlDocument[] = []

while (!completedFlag) {
// Get the status
console.log('checking status')
statusCheck = await firecrawl.checkCrawlStatus(crawlResponse.id)

if (!statusCheck.success) {
throw new Error(`Error checking crawl status: ${statusCheck.error}`)
}

if (statusCheck.status === 'failed') {
throw new Error('Error: crawl failed')
}

// Check if crawl is completed
if (statusCheck.status === 'completed') {

if (!Array.isArray(statusCheck.data)) {
throw new Error('Error: crawl resulted in no data')
}

statusCheck.data.forEach((page) => pages.push(page))
completedFlag = true

} else {
// polling interval
await new Promise((resolve) => setTimeout(resolve, 5000))
}
}
return pages
}
import dotenv from 'dotenv'
import FirecrawlApp, {type FirecrawlDocument} from '@mendable/firecrawl-js'

dotenv.config()
const apiKey = process.env.FIRECRAWL_API_KEY
const firecrawl = new FirecrawlApp({apiKey})

const crawlUrl = async (url: string) => {
// start the crawl
const crawlResponse = await firecrawl.asyncCrawlUrl(url, crawlerConfig)

if (!crawlResponse.success) {
throw new Error(`Error starting crawl: ${crawlResponse.error}`)
}

// loop until the crawl is complete
let completedFlag = false
let statusCheck = null
const pages: FirecrawlDocument[] = []

while (!completedFlag) {
// Get the status
console.log('checking status')
statusCheck = await firecrawl.checkCrawlStatus(crawlResponse.id)

if (!statusCheck.success) {
throw new Error(`Error checking crawl status: ${statusCheck.error}`)
}

if (statusCheck.status === 'failed') {
throw new Error('Error: crawl failed')
}

// Check if crawl is completed
if (statusCheck.status === 'completed') {

if (!Array.isArray(statusCheck.data)) {
throw new Error('Error: crawl resulted in no data')
}

statusCheck.data.forEach((page) => pages.push(page))
completedFlag = true

} else {
// polling interval
await new Promise((resolve) => setTimeout(resolve, 5000))
}
}
return pages
}
7 Replies
mogery
mogery13mo ago
Hi there @Kaleb, this is a bug, working on this today. Will let you know when it's done. Hi @Kaleb, we added a getAllData parameter to checkCrawlStatus. Please update to 1.3.0 and set the parameter to true. Thank you for your patience!
Kaleb
KalebOP13mo ago
hey @mogery thank you, we will give it a shot. I appreciate the update! Hey @mogery, we're using the new v1.3 SDK and getAllData flag. If the response exceeds 10mb do will we still need to handle pagination by fetching the next url? I want to make sure we handle that case if possible.
mogery
mogery13mo ago
Nope getAllData will fire off requests until next doesn't exist anymore, i.e. all data has been retrieved
Kaleb
KalebOP13mo ago
ah, I see! thanks so much, that really simplifies our script.
mogery
mogery13mo ago
Yup! Should've been there from the start, sorry about that 😅
Kaleb
KalebOP13mo ago
Well I can't complain because this is still 10x easier than puppeteer 😁
Caleb
Caleb13mo ago
Love to hear, nice name btw 🙂

Did you find this page helpful?