Fast Office document extraction for LLMs and agents. Converts DOCX, XLSX, CSV, PPTX, and PDF into clean markdown, structured JSON IR, and Docling output.
- Native Rust core - fast, no runtime dependencies
- Three output modes: markdown, structured JSON IR, Docling JSON
- CLI and SDK for Python, Node/Bun, and Rust
- Sheet, slide, and page selection
- Document property extraction
No install needed - run directly:
uvx officemd markdown report.docx
npx office-md markdown report.docx
bunx office-md markdown report.docxbrew tap thomaub/officemd [email protected]:ThomAub/officemd.git
brew install thomaub/officemd/officemd_cliInstallers generated by cargo-dist:
# macOS / Linux
curl --proto '=https' --tlsv1.2 -LsSf https://github.com/ThomAub/officemd/releases/latest/download/officemd_cli-installer.sh | sh
# Windows (PowerShell)
powershell -ExecutionPolicy ByPass -c "irm https://github.com/ThomAub/officemd/releases/latest/download/officemd_cli-installer.ps1 | iex"uv tool install officemdOr add as a dependency:
uv add officemdnpm install office-md
# or
bun add office-mdcargo install officemd_cliAll three surfaces expose a CLI named officemd (Python, Rust) or office-md (Node/Bun).
officemd markdown report.docx
officemd markdown budget.xlsx --sheets "Summary,Q1"
officemd markdown deck.pptx --pages 1-3
officemd render report.docx
officemd diff old.docx new.docxThe Rust CLI has additional subcommands:
officemd stream report.docx # stream to stdout (supports stdin via -)
officemd convert report.docx --output out.md # write to file
officemd inspect report.pdf --output-format json --pretty| Flag | Description |
|---|---|
--format |
Force document format (docx, xlsx, csv, pptx, pdf) |
--pages |
Select pages/slides/sheets by index (e.g. "1,3-5") |
--sheets |
Select sheets by name or index (e.g. "Sales,1-2") |
--include-document-properties |
Include document metadata in output |
--markdown-style |
Output style: compact (default) or human |
from pathlib import Path
from officemd import extract_ir_json, markdown_from_bytes, docling_from_bytes
content = Path("report.docx").read_bytes()
print(markdown_from_bytes(content, format="docx"))
print(extract_ir_json(content, format="docx"))
print(docling_from_bytes(content, format="docx"))from pathlib import Path
import officemd
content = Path("report.docx").read_bytes()
patch = officemd.DocxPatch(
scoped_replacements=[
officemd.ScopedDocxReplace(
officemd.DocxTextScope.ALL_TEXT,
officemd.TextReplace("word", "term"),
)
]
)
# ALL_TEXT includes document content plus free-text metadata/app/custom fields.
single = officemd.patch_docx_with_report(content, patch)
print(single.report.replacements_applied)
batch = officemd.patch_docx_batch_with_report([content, content], patch, workers=4)
for item in batch:
print(item.report.parts_scanned, item.report.parts_modified, item.report.replacements_applied)use officemd_core::{
patch_docx_batch_with_report, DocxPatch, DocxTextScope, ScopedDocxReplace, TextReplace,
};
let patch = DocxPatch {
set_core_title: None,
replace_body_title: None,
scoped_replacements: vec![ScopedDocxReplace {
scope: DocxTextScope::AllText,
replace: TextReplace::all("word", "term"),
}],
};
// AllText includes document content plus free-text metadata/app/custom fields.
let results = patch_docx_batch_with_report(vec![doc1_bytes, doc2_bytes], &patch, Some(4))?;
for item in results {
println!(
"parts_scanned={} parts_modified={} replacements_applied={}",
item.report.parts_scanned,
item.report.parts_modified,
item.report.replacements_applied
);
}
# Ok::<(), officemd_core::PatchError>(())Patch scopes also support free-text metadata/comment fields:
- DOCX:
MetadataCore,MetadataApp,MetadataCustom,MetadataAll - PPTX:
CommentAuthors,MetadataCore,MetadataApp,MetadataCustom,MetadataAll - XLSX:
Comments,CommentAuthors,MetadataCore,MetadataApp,MetadataCustom,MetadataAll
AllText now means all free-text fields: content + metadata/comment-author text.
Formatting-preserving replacement is available for OOXML content text:
use officemd_core::{DocxPatch, DocxTextScope, ScopedDocxReplace, TextReplace};
let patch = DocxPatch {
set_core_title: None,
replace_body_title: None,
scoped_replacements: vec![ScopedDocxReplace {
scope: DocxTextScope::Body,
replace: TextReplace::all("Confidential", "")
.with_preserve_formatting(true),
}],
};Semantics:
- replacement can span multiple runs
- the first matched run keeps the replacement text and therefore its formatting wins
- later consumed runs are left empty in v1
- metadata/comment-author fields still use simple text replacement
import { readFileSync } from "node:fs";
import { markdownFromBytes, extractIrJson, doclingFromBytes } from "office-md";
const content = readFileSync("report.docx");
console.log(markdownFromBytes(content, "docx"));
console.log(extractIrJson(content, "docx"));
console.log(doclingFromBytes(content, "docx"));There is also a browser demo for the WASM bindings in crates/officemd_wasm/README.md. It serves a small page at http://localhost:8080/crates/officemd_wasm/www/ for drag-and-drop and sample-fixture testing.
| Format | Extension | Markdown | JSON IR | Docling |
|---|---|---|---|---|
| Word | .docx | yes | yes | yes |
| Excel | .xlsx | yes | yes | yes |
| CSV | .csv | yes | yes | - |
| PowerPoint | .pptx | yes | yes | yes |
| yes | yes | - |
cargo nextest run --workspace
cargo clippy --workspace --all-targets --exclude officemd_pdf -- -D warningsFor JS and Python tests, see examples/README.md.
PDF extraction vendors pdf-inspector by Firecrawl (MIT).
PDF primitives lopdf by J-F-Liu (MIT).
Apache 2.0 - see LICENSE.