I have problems using PDF in a JS project

so I was working on this project and I need to accurately convert pdf to something with structure (I choose markdown for now) so I can analyze it. The result text has been very inaccurate, page numbers would get in, multiple repeated headings like h2 h2 h2
async function extractPdfMarkdown(file: File): Promise<string> {
try {
const arrayBuffer = await file.arrayBuffer();
const uint8Array = new Uint8Array(arrayBuffer);
const markdown = await pdf2md(uint8Array);
return markdown;
} catch (e) {
console.error("PDF to Markdown extraction error:", e);
throw e;
}
}
async function extractPdfMarkdown(file: File): Promise<string> {
try {
const arrayBuffer = await file.arrayBuffer();
const uint8Array = new Uint8Array(arrayBuffer);
const markdown = await pdf2md(uint8Array);
return markdown;
} catch (e) {
console.error("PDF to Markdown extraction error:", e);
throw e;
}
}
has anyone worked with pdf before? is there a recommended way to make sure pdf structure is kept? I used this lib: https://github.com/opengovsg/pdf2md
GitHub
GitHub - opengovsg/pdf2md: A PDF to Markdown converter
A PDF to Markdown converter. Contribute to opengovsg/pdf2md development by creating an account on GitHub.
7 Replies
Khoa
KhoaOP•6mo ago
the thing can't do table of contents smh 😭
No description
Jochem
Jochem•6mo ago
are you sure your PDF input is sensible? Have you looked at the contents? cause some PDFs are hogwild (and by contents I mean like... ghostscript or whatever the fuck they use, it can be any of a dozen things iirc)
Khoa
KhoaOP•6mo ago
yeah I tried with many pdf files it doesn't understand headings and make me 2 <h1> for no reason
No description
Khoa
KhoaOP•6mo ago
I must be reaching the dead end where Imma have to add some LLM into the project. 😭
Jochem
Jochem•6mo ago
it's likely just looking at the internal structure of the PDF and translating that directly PDF and HTML aren't necessarily compatible, as PDF is intended to produce the exact same output on any PC its used on, and HTML is supposed to adapt to its displaying medium you could do a post processing step where you merge certain duplicated adjacent tags
Khoa
KhoaOP•6mo ago
yeah I could add an extra step to clean things up problem is pdfs are still very unpredictable so not all results gonna be 100% correct
Jochem
Jochem•6mo ago
yup

Did you find this page helpful?