This project contains iText 9.5 examples and two main entrypoints:
com.itextpdf.demo.ComplexTaggedPdfMain(creates a complex, tagged PDF 2.0 report)com.itextpdf.demo.TaggedPdfToMarkdownMain(converts a PDF to Markdown using the structure tree)
It reads a PDF and tries to build Markdown using the PDF structure tree (tagged content). If tags are missing or empty, it falls back to plain text extraction.
The converter prepends YAML front matter for dataset pipelines:
source_filepage_countextraction_mode(tagged_structure_treeorplain_text_fallback)
mvn clean compilemvn -q dependency:build-classpath "-Dmdep.outputFile=target\cp.txt"
$cp = Get-Content "target\cp.txt"
java -cp "target\classes;$cp" com.itextpdf.demo.TaggedPdfToMarkdownMain "input.pdf" "output.md"If output.md is omitted, the converter writes next to the input PDF using the same file name with .md extension.
mvn -q dependency:build-classpath "-Dmdep.outputFile=target\cp.txt"
$cp = Get-Content "target\cp.txt"
java -cp "target\classes;$cp" com.itextpdf.demo.ComplexTaggedPdfMainThis creates complex-tagged-report.pdf in the project root.
Install Python dependencies:
python -m pip install -r requirements.txtRun the Docling extractor (defaults to complex-tagged-report.pdf):
python scripts/docling_extract.pyRun with explicit paths:
python scripts/docling_extract.py --input "complex-tagged-report.pdf" --output "complex-tagged-report.docling.md"By default, the script writes <input>.docling.md next to the input PDF.