CA
Crawlee & Apifyโ€ข2y ago
solid-orange

Can website content crawler crawl PDFs?

I'm trying to use the actor "website content crawler" and would like to scrape documents on a website. For example: https://www.ema.europa.eu/en/medicines/human/EPAR/vargatef This URL has many PDFs How can I get my scraper to scrape the documents? -> I've set my crawling depth to 1 -> I've clicked "Save Files" under Output Settings The PDF is still not being scraped though
5 Replies
!!!Joefree!!! ๐Ÿ‘‘
Have you checked the default key-value store. the files should be stored there The website is protected by cloudfront, you might need a proxy, try use RESIDENTIAL proxy. The website seem static, you can select Raw HTTP Client (cheerio) on Crawler Type drop down
solid-orange
solid-orangeOPโ€ข2y ago
Hi, yes I see them in the key-value store, which is great thank you How can I extract the text from the PDFs, in the same way that the text from websites is being extracted? I've got it text from websites feeding into Qdrant, but I'm wondering if there is a way to extract text from PDFs and also feed into Qdrant with the current integration I have setup
!!!Joefree!!! ๐Ÿ‘‘
You need another actor to extract PDF content. pick one https://apify.com/store?search=PDF
Apify
Apify Store - 1500+ web scraping and automation tools ยท Apify
Ready-to-use web scraping tools for popular websites and automation software for any use case. Plus marketplace for developers to earn from coding.
solid-orange
solid-orangeOPโ€ข2y ago
I see, thank you
MEE6
MEE6โ€ข2y ago
@Abdul just advanced to level 1! Thanks for your contributions! ๐ŸŽ‰

Did you find this page helpful?