solid-orange

Can website content crawler crawl PDFs?

I'm trying to use the actor "website content crawler" and would like to scrape documents on a website. For example: https://www.ema.europa.eu/en/medicines/human/EPAR/vargatef This URL has many PDFs How can I get my scraper to scrape the documents? -> I've set my crawling depth to 1 -> I've clicked "Save Files" under Output Settings The PDF is still not being scraped though

5 Replies

!!!Joefree!!! 👑•2y ago

Have you checked the default key-value store. the files should be stored there The website is protected by cloudfront, you might need a proxy, try use RESIDENTIAL proxy. The website seem static, you can select Raw HTTP Client (cheerio) on Crawler Type drop down

solid-orangeOP•2y ago

Hi, yes I see them in the key-value store, which is great thank you How can I extract the text from the PDFs, in the same way that the text from websites is being extracted? I've got it text from websites feeding into Qdrant, but I'm wondering if there is a way to extract text from PDFs and also feed into Qdrant with the current integration I have setup

!!!Joefree!!! 👑•2y ago

You need another actor to extract PDF content. pick one https://apify.com/store?search=PDF

Apify

Apify Store - 1500+ web scraping and automation tools · Apify

Ready-to-use web scraping tools for popular websites and automation software for any use case. Plus marketplace for developers to earn from coding.

solid-orangeOP•2y ago

I see, thank you

MEE6•2y ago

@Abdul just advanced to level 1! Thanks for your contributions! 🎉

Gaming

Programming

Can website content crawler crawl PDFs?

Did you find this page helpful?