Kevin Powell - CommunityKP-C
Kevin Powell - Community7mo ago
12 replies
Khoa

I have problems using PDF in a JS project

so I was working on this project and I need to accurately convert pdf to something with structure (I choose markdown for now) so I can analyze it. The result text has been very inaccurate, page numbers would get in, multiple repeated headings like h2 h2 h2

async function extractPdfMarkdown(file: File): Promise<string> {
        try {
            const arrayBuffer = await file.arrayBuffer();
            const uint8Array = new Uint8Array(arrayBuffer);
            const markdown = await pdf2md(uint8Array);
            return markdown;
        } catch (e) {
            console.error("PDF to Markdown extraction error:", e);
            throw e;
        }
    }


has anyone worked with pdf before? is there a recommended way to make sure pdf structure is kept?

I used this lib: https://github.com/opengovsg/pdf2md
GitHub
A PDF to Markdown converter. Contribute to opengovsg/pdf2md development by creating an account on GitHub.
GitHub - opengovsg/pdf2md: A PDF to Markdown converter
Was this page helpful?