Awesome tool! How do I extract the text only?
Great tool here, excited to start testing.
I don't need the markdown. I don't want the links, image hosting URLs, etc.
How do I just get the text content from the page out?
1 Reply
Hey @steamwire_labs there are couple ways you can do it. The most straightforward one is passing a
pageOptions.includeHtml = true
and when you get the html back, just use bs4 or cheerio to extract the text with a .text()
function.
Another thing you can do, is pass a pageOptions.removeTags = [ 'img', 'a' ]
that you can pass to remove html elements you don't want parsed to markdown.