Extract all PDF URLs from webpage

Hi, guys! I've been testing the /scrape endpoint here with the json format to extract the PDF files from webpages. For pages with small number of files it's working very well, but for pages with lots of files sometimes it returns only a few and some other times ti returns the full set. Since it's AI generated it can be expected, but is there a way to pass some more instructions to the AI during the API call? I'm using the Python SDK
9 Replies
Gaurav Chadha
Gaurav Chadha3mo ago
Hi @yagomlcarvalho you can set only_main_content=false, timeout=120000 this should give better result with more context. Also, update the prompt such that Extract all PDF file links on the page. Include every link you can find, even if hidden deep in the content. Return them in a list. this will give the LLM better context.
micah.stairs
micah.stairs3mo ago
Instead of using JSON mode for this, I would just recommend asking for formats=["links"] in the request and then filtering out the non-PDF links on your side. And yeah, setting onlyMainContent=false would be good here (since we only would return links from the main content of the page otherwise, which is sometimes incorrect for certain sites).
yagomlcarvalho
yagomlcarvalhoOP3mo ago
Thanks for the tips, guys. My idea of using JSON mode is to get some more info about the file, not only the link itself. For example, some files names are just hashes, so it would be nice to have something else about them. Even after adding the prompt some results are not good, though. For example, I'm using both "links" and "json" formats and the crazy thing is that I can get the links correctly on the response.links but not on the response.json field. Here is the pydantic model I'm using:
class PdfFile(BaseModel):
"""Model for the PDF file."""

file_url: str
descriptive_unique_file_title: str
file_date_yyyy_mm_dd: str


class PdfFilesPageResults(BaseModel):
"""Model for the PDF files page results."""

pdf_files: List[PdfFile]
class PdfFile(BaseModel):
"""Model for the PDF file."""

file_url: str
descriptive_unique_file_title: str
file_date_yyyy_mm_dd: str


class PdfFilesPageResults(BaseModel):
"""Model for the PDF files page results."""

pdf_files: List[PdfFile]
My idea is to use the LLM power to get a better filename than just using the last part of the url and maybe get the date the file was released. Also, thanks for the quick reply and sorry for not replying back as quick as you guys did! Got swallowed by other subjects here.
micah.stairs
micah.stairs3mo ago
The JSON format uses the markdown format as input to the LLM, which is why the links are missing from there.
yagomlcarvalho
yagomlcarvalhoOP3mo ago
Hmm, and the markdown is failling on detecting the links? Is there a way of instructing the best behaviour to the LLM?
micah.stairs
micah.stairs3mo ago
Can you try asking for the markdown format to see if the links are there or not?
yagomlcarvalho
yagomlcarvalhoOP3mo ago
Let me check here
micah.stairs
micah.stairs3mo ago
If it's not, can you try setting onlyMainContent to false?
yagomlcarvalho
yagomlcarvalhoOP3mo ago
onlyMainContent is false already. The markdown contains the links, so maybe something is going sideways during the schema parsing/structuring?

Did you find this page helpful?