How to ingest a folder of PDFs into a markdown knowledge base
언제 쓸까: You have 50 PDF research papers and want clean text to embed.
사전 조건
- pnpm — npm i -g pnpm
- Install + run — git clone the repo, pnpm install, pnpm start
흐름
-
Enumerate filesList every .pdf under /library/papers/.✓ 복사됨→ Path list
-
Convert in batchFor each, call pdf-to-markdown. Write the output to /library/markdown/<same-name>.md.✓ 복사됨→ Markdown files appear
-
Spot-check oneShow me the first 500 chars of /library/markdown/attention-is-all-you-need.md.✓ 복사됨→ Legible markdown, no PDF artifacts
결과: A parallel /markdown/ tree ready to embed.
함정
- Scanned PDFs (image-only) produce empty markdown — Pre-OCR scanned PDFs before conversion
- Math-heavy PDFs lose equations — For math, try pairing with Mathpix or the nougat OCR model separately