General way to scrape blogs, articles and content?

Hi all! Is there a general way to scrape blogs of various types? I want to create a program that: * takes in a list of blog top-level-directory URLs, ie: ["http://paulgraham.com/index.html", "https://www.vitalik.ca", "https://medium.com/@FEhrsam", "https://openai.com/blog/"] * extracts a list of URLs for each blog/article * for each blog, extracts common information, ie: {title, author, date, contentBody, photoURLs=[]} * as well as URLs of any photos contained within the article/blog body (but not eg icons, ads, etc) * ignores irrelevant pages (i.e. "Contact" "requires login") etc - just articles and blog posts Does Crawlee, Apify, Scrapy, or any other (free or paid) program do this? Thank you!
Medium
1 Reply
vicious-gold
vicious-gold3y ago
Apify
Scrape and download articles and news · Apify
📰 Smart Article Extractor extracts articles from any scientific, academic, or news website with just one click. The extractor crawls the whole website and automatically distinguishes articles from other web pages. Download your data as HTML table, JSON, Excel, RSS feed, and more.

Did you find this page helpful?