Is there any method to crawl the next pages

Here is my current current crawl setting crawl = firecrawl.crawl( url=i, max_discovery_depth=4, scrape_options={"formats": ["html"]}, exclude_paths=[ r"..jpeg$", r"..jpg$", r"..png$", r"..gif$", r"..webp$", r"..svg$", r"..ico$", r"..pdf$", r".*.xml$", ] ) Is there any methods to also the next pages shown in the photo? Besides, is it a must to crawl again? Can I prevent repeated crawling in the new crawl? The token is burning so fast.
No description
1 Reply
Gaurav Chadha
Gaurav Chadha5w ago
HI @LunarTear1014 You can checkout pagination - https://docs.firecrawl.dev/advanced-scraping-guide#pagination%2Fnext-url Also, specifically for your case: The calendar pagination in your image should be automatically discovered by the crawler as long as the links are present in the HTML. If they're loaded via JavaScript, ensure the site renders them (Firecrawl handles JavaScript rendering by default). Your exclude_paths regex patterns need fixing - use r".*\.jpeg$" instead of r"..jpeg$" to properly match file extensions. Also, try using include_paths to explicitly target date patterns:
crawl = firecrawl.crawl(
url=i,
max_discovery_depth=4,
include_paths=[r".*nov-2[0-3].*"], # Match NOV 20-23
scrape_options={"formats": ["html"]},
exclude_paths=[
r".*\.jpeg$", r".*\.jpg$", r".*\.png$", r".*\.gif$",
r".*\.webp$", r".*\.svg$", r".*\.ico$",
r".*\.pdf$", r".*\.xml$",
]
)
crawl = firecrawl.crawl(
url=i,
max_discovery_depth=4,
include_paths=[r".*nov-2[0-3].*"], # Match NOV 20-23
scrape_options={"formats": ["html"]},
exclude_paths=[
r".*\.jpeg$", r".*\.jpg$", r".*\.png$", r".*\.gif$",
r".*\.webp$", r".*\.svg$", r".*\.ico$",
r".*\.pdf$", r".*\.xml$",
]
)
Firecrawl Docs
Advanced Scraping Guide | Firecrawl
Learn how to improve your Firecrawl scraping with advanced options.

Did you find this page helpful?