Firecrawl•2mo ago

Firecrawl seems to ignore URL hash fragments for pagination.

- Site: https://europa.provincia.bz.it/it/bandi-e-avvisi (pagination via #start=N) - Expectation (Playwright): total=15, page 0 -> 10 items, page 1 -> 5 items (selector div.result.lv_faq). - Firecrawl scrape_url results: page 1 returns same content as page 0 or misses p.result_status entirely. - Tried: formats=["html"], formats=["rawHtml"], wait_for=12000–20000, max_age=0 (fresh fetch), correct base URL from container to host. Ask: Does scrape_url honor initial URL hash fragments and trigger the page’s JS that reads location.hash on load? Any flag/workaround for hash-based pagination?

Europa

6 Replies

LeoOP•2mo ago

This is the code I use and that works (Playwright handles the hash-based pagination correctly):

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

BASE = "https://europa.provincia.bz.it/it/bandi-e-avvisi"

def page_url(i: int) -> str:
    return f"{BASE}#start={i*10}" if i else BASE

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    ctx = browser.new_context(locale="it-IT")
    page = ctx.new_page()

    for i in [0, 1]:
        url = page_url(i)
        page.goto(url, wait_until="networkidle")
        page.wait_for_selector("div.result.lv_faq", timeout=10000)
        if i > 0:
            page.wait_for_timeout(2000)  # ensure hash-based update applied

        html = page.content()
        soup = BeautifulSoup(html, "html.parser")
        items = soup.select("div.result.lv_faq")
        total = soup.select_one("p.result_status")
        print(f"Page {i}: items={len(items)}, total={(total.get_text(strip=True) if total else None)}")

    browser.close()
# Expected:
# Page 0: items=10, total=15
# Page 1: items=5,  total=15

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

BASE = "https://europa.provincia.bz.it/it/bandi-e-avvisi"

def page_url(i: int) -> str:
    return f"{BASE}#start={i*10}" if i else BASE

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    ctx = browser.new_context(locale="it-IT")
    page = ctx.new_page()

    for i in [0, 1]:
        url = page_url(i)
        page.goto(url, wait_until="networkidle")
        page.wait_for_selector("div.result.lv_faq", timeout=10000)
        if i > 0:
            page.wait_for_timeout(2000)  # ensure hash-based update applied

        html = page.content()
        soup = BeautifulSoup(html, "html.parser")
        items = soup.select("div.result.lv_faq")
        total = soup.select_one("p.result_status")
        print(f"Page {i}: items={len(items)}, total={(total.get_text(strip=True) if total else None)}")

    browser.close()
# Expected:
# Page 0: items=10, total=15
# Page 1: items=5,  total=15

Goal: I want to switch to using Firecrawl to fetch the HTML (formats=["html"], wait_for, max_age=0) and parse it myself, so we can migrate away from direct Playwright usage in our container.

micah.stairs•2mo ago

Good question! We are about to roll out this feature actually: https://github.com/firecrawl/firecrawl/pull/2031.

LeoOP•2mo ago

Great! Thanks for the update!

micah.stairs•2mo ago

Hey! So we now support hash-based routes starting with "#/", but that doesn't actually cover your case ( #start=N) after all. I've filed a feature request for this and I will keep you posted, but I can't promise any timelines.

LeoOP•2mo ago

Wait, cause actually it does? I've switched to using scrape with html and it does work as intended

micah.stairs•2mo ago

Oh awesome! Thanks for letting me know

Gaming

Programming

Firecrawl seems to ignore URL hash fragments for pagination.

Did you find this page helpful?