F
Firecrawl2mo ago
Leo

Firecrawl seems to ignore URL hash fragments for pagination.

- Site: https://europa.provincia.bz.it/it/bandi-e-avvisi (pagination via #start=N) - Expectation (Playwright): total=15, page 0 -> 10 items, page 1 -> 5 items (selector div.result.lv_faq). - Firecrawl scrape_url results: page 1 returns same content as page 0 or misses p.result_status entirely. - Tried: formats=["html"], formats=["rawHtml"], wait_for=12000–20000, max_age=0 (fresh fetch), correct base URL from container to host. Ask: Does scrape_url honor initial URL hash fragments and trigger the page’s JS that reads location.hash on load? Any flag/workaround for hash-based pagination?
6 Replies
Leo
LeoOP2mo ago
This is the code I use and that works (Playwright handles the hash-based pagination correctly):
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

BASE = "https://europa.provincia.bz.it/it/bandi-e-avvisi"

def page_url(i: int) -> str:
return f"{BASE}#start={i*10}" if i else BASE

with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
ctx = browser.new_context(locale="it-IT")
page = ctx.new_page()

for i in [0, 1]:
url = page_url(i)
page.goto(url, wait_until="networkidle")
page.wait_for_selector("div.result.lv_faq", timeout=10000)
if i > 0:
page.wait_for_timeout(2000) # ensure hash-based update applied

html = page.content()
soup = BeautifulSoup(html, "html.parser")
items = soup.select("div.result.lv_faq")
total = soup.select_one("p.result_status")
print(f"Page {i}: items={len(items)}, total={(total.get_text(strip=True) if total else None)}")

browser.close()
# Expected:
# Page 0: items=10, total=15
# Page 1: items=5, total=15
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

BASE = "https://europa.provincia.bz.it/it/bandi-e-avvisi"

def page_url(i: int) -> str:
return f"{BASE}#start={i*10}" if i else BASE

with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
ctx = browser.new_context(locale="it-IT")
page = ctx.new_page()

for i in [0, 1]:
url = page_url(i)
page.goto(url, wait_until="networkidle")
page.wait_for_selector("div.result.lv_faq", timeout=10000)
if i > 0:
page.wait_for_timeout(2000) # ensure hash-based update applied

html = page.content()
soup = BeautifulSoup(html, "html.parser")
items = soup.select("div.result.lv_faq")
total = soup.select_one("p.result_status")
print(f"Page {i}: items={len(items)}, total={(total.get_text(strip=True) if total else None)}")

browser.close()
# Expected:
# Page 0: items=10, total=15
# Page 1: items=5, total=15
Goal: I want to switch to using Firecrawl to fetch the HTML (formats=["html"], wait_for, max_age=0) and parse it myself, so we can migrate away from direct Playwright usage in our container.
micah.stairs
micah.stairs2mo ago
Good question! We are about to roll out this feature actually: https://github.com/firecrawl/firecrawl/pull/2031.
Leo
LeoOP2mo ago
Great! Thanks for the update!
micah.stairs
micah.stairs2mo ago
Hey! So we now support hash-based routes starting with "#/", but that doesn't actually cover your case ( #start=N) after all. I've filed a feature request for this and I will keep you posted, but I can't promise any timelines.
Leo
LeoOP2mo ago
Wait, cause actually it does? I've switched to using scrape with html and it does work as intended
micah.stairs
micah.stairs2mo ago
Oh awesome! Thanks for letting me know

Did you find this page helpful?