Extract all PDF URLs from webpage

Hi, guys! I've been testing the /scrape endpoint here with the json format to extract the PDF files from webpages. For pages with small number of files it's working very well, but for pages with lots of files sometimes it returns only a few and some other times ti returns the full set. Since it's AI generated it can be expected, but is there a way to pass some more instructions to the AI during the API call? I'm using the Python SDK
2 Replies
Gaurav Chadha
Gaurav Chadha2w ago
Hi @yagomlcarvalho you can set only_main_content=false, timeout=120000 this should give better result with more context. Also, update the prompt such that Extract all PDF file links on the page. Include every link you can find, even if hidden deep in the content. Return them in a list. this will give the LLM better context.
micah.stairs
micah.stairs2w ago
Instead of using JSON mode for this, I would just recommend asking for formats=["links"] in the request and then filtering out the non-PDF links on your side. And yeah, setting onlyMainContent=false would be good here (since we only would return links from the main content of the page otherwise, which is sometimes incorrect for certain sites).

Did you find this page helpful?