Firecrawl•3mo ago

Extract all PDF URLs from webpage

Hi, guys! I've been testing the /scrape endpoint here with the json format to extract the PDF files from webpages. For pages with small number of files it's working very well, but for pages with lots of files sometimes it returns only a few and some other times ti returns the full set. Since it's AI generated it can be expected, but is there a way to pass some more instructions to the AI during the API call? I'm using the Python SDK

9 Replies

Gaurav Chadha•3mo ago

Hi @yagomlcarvalho you can set only_main_content=false, timeout=120000 this should give better result with more context. Also, update the prompt such that

Extract all PDF file links on the page. Include every link you can find, even if hidden deep in the content. Return them in a list.

this will give the LLM better context.

micah.stairs•3mo ago

Instead of using JSON mode for this, I would just recommend asking for formats=["links"] in the request and then filtering out the non-PDF links on your side. And yeah, setting onlyMainContent=false would be good here (since we only would return links from the main content of the page otherwise, which is sometimes incorrect for certain sites).

yagomlcarvalhoOP•3mo ago

Thanks for the tips, guys. My idea of using JSON mode is to get some more info about the file, not only the link itself. For example, some files names are just hashes, so it would be nice to have something else about them. Even after adding the prompt some results are not good, though. For example, I'm using both "links" and "json" formats and the crazy thing is that I can get the links correctly on the response.links but not on the response.json field. Here is the pydantic model I'm using:

class PdfFile(BaseModel):
    """Model for the PDF file."""
    
    file_url: str
    descriptive_unique_file_title: str
    file_date_yyyy_mm_dd: str


class PdfFilesPageResults(BaseModel):
    """Model for the PDF files page results."""

    pdf_files: List[PdfFile]

class PdfFile(BaseModel):
    """Model for the PDF file."""
    
    file_url: str
    descriptive_unique_file_title: str
    file_date_yyyy_mm_dd: str


class PdfFilesPageResults(BaseModel):
    """Model for the PDF files page results."""

    pdf_files: List[PdfFile]

My idea is to use the LLM power to get a better filename than just using the last part of the url and maybe get the date the file was released. Also, thanks for the quick reply and sorry for not replying back as quick as you guys did! Got swallowed by other subjects here.

micah.stairs•3mo ago

The JSON format uses the markdown format as input to the LLM, which is why the links are missing from there.

yagomlcarvalhoOP•3mo ago

Hmm, and the markdown is failling on detecting the links? Is there a way of instructing the best behaviour to the LLM?

micah.stairs•3mo ago

Can you try asking for the markdown format to see if the links are there or not?

yagomlcarvalhoOP•3mo ago

Let me check here

micah.stairs•3mo ago

If it's not, can you try setting onlyMainContent to false?

yagomlcarvalhoOP•3mo ago

onlyMainContent is false already. The markdown contains the links, so maybe something is going sideways during the schema parsing/structuring?

Gaming

Programming

Extract all PDF URLs from webpage

Did you find this page helpful?