I have problems using PDF in a JS project
so I was working on this project and I need to accurately convert pdf to something with structure (I choose markdown for now) so I can analyze it. The result text has been very inaccurate, page numbers would get in, multiple repeated headings like h2 h2 h2
has anyone worked with pdf before? is there a recommended way to make sure pdf structure is kept?
I used this lib: https://github.com/opengovsg/pdf2md
GitHub
GitHub - opengovsg/pdf2md: A PDF to Markdown converter
A PDF to Markdown converter. Contribute to opengovsg/pdf2md development by creating an account on GitHub.
7 Replies
the thing can't do table of contents smh ðŸ˜

are you sure your PDF input is sensible? Have you looked at the contents?
cause some PDFs are hogwild
(and by contents I mean like... ghostscript or whatever the fuck they use, it can be any of a dozen things iirc)
yeah I tried with many pdf files
it doesn't understand headings and make me 2 <h1> for no reason

I must be reaching the dead end where Imma have to add some LLM into the project. ðŸ˜
it's likely just looking at the internal structure of the PDF and translating that directly
PDF and HTML aren't necessarily compatible, as PDF is intended to produce the exact same output on any PC its used on, and HTML is supposed to adapt to its displaying medium
you could do a post processing step where you merge certain duplicated adjacent tags
yeah I could add an extra step to clean things up
problem is pdfs are still very unpredictable so not all results gonna be 100% correct
yup