CA
solid-orange
Can website content crawler crawl PDFs?
I'm trying to use the actor "website content crawler" and would like to scrape documents on a website. For example: https://www.ema.europa.eu/en/medicines/human/EPAR/vargatef
This URL has many PDFs
How can I get my scraper to scrape the documents?
-> I've set my crawling depth to 1
-> I've clicked "Save Files" under Output Settings
The PDF is still not being scraped though
5 Replies
Have you checked the default key-value store. the files should be stored there
The website is protected by cloudfront, you might need a proxy, try use RESIDENTIAL proxy. The website seem static, you can select
Raw HTTP Client (cheerio)
on Crawler Type
drop downsolid-orangeOPโข2y ago
Hi, yes I see them in the key-value store, which is great thank you
How can I extract the text from the PDFs, in the same way that the text from websites is being extracted?
I've got it text from websites feeding into Qdrant, but I'm wondering if there is a way to extract text from PDFs and also feed into Qdrant with the current integration I have setup
You need another actor to extract PDF content. pick one https://apify.com/store?search=PDF
Apify
Apify Store - 1500+ web scraping and automation tools ยท Apify
Ready-to-use web scraping tools for popular websites and automation software for any use case. Plus marketplace for developers to earn from coding.
solid-orangeOPโข2y ago
I see, thank you
@Abdul just advanced to level 1! Thanks for your contributions! ๐