Extract all PDF URLs from webpage
Hi, guys! I've been testing the /scrape endpoint here with the json format to extract the PDF files from webpages. For pages with small number of files it's working very well, but for pages with lots of files sometimes it returns only a few and some other times ti returns the full set. Since it's AI generated it can be expected, but is there a way to pass some more instructions to the AI during the API call? I'm using the Python SDK
2 Replies
Hi @yagomlcarvalho you can set
only_main_content=false
, timeout=120000
this should give better result with more context. Also, update the prompt such that Extract all PDF file links on the page. Include every link you can find, even if hidden deep in the content. Return them in a list.
this will give the LLM better context.Instead of using JSON mode for this, I would just recommend asking for formats=["links"] in the request and then filtering out the non-PDF links on your side. And yeah, setting onlyMainContent=false would be good here (since we only would return links from the main content of the page otherwise, which is sometimes incorrect for certain sites).