Skip to content

ThomAub/officemd

Repository files navigation

OfficeMD

CI crates.io PyPI npm

Fast Office document extraction for LLMs and agents. Converts DOCX, XLSX, CSV, PPTX, and PDF into clean markdown, structured JSON IR, and Docling output.

  • Native Rust core - fast, no runtime dependencies
  • Three output modes: markdown, structured JSON IR, Docling JSON
  • CLI and SDK for Python, Node/Bun, and Rust
  • Sheet, slide, and page selection
  • Document property extraction

Quick Start

No install needed - run directly:

uvx officemd markdown report.docx
npx office-md markdown report.docx
bunx office-md markdown report.docx

Install

Homebrew (macOS / Linux)

brew tap thomaub/officemd [email protected]:ThomAub/officemd.git
brew install thomaub/officemd/officemd_cli

Shell / PowerShell (from GitHub Release)

Installers generated by cargo-dist:

# macOS / Linux
curl --proto '=https' --tlsv1.2 -LsSf https://github.com/ThomAub/officemd/releases/latest/download/officemd_cli-installer.sh | sh

# Windows (PowerShell)
powershell -ExecutionPolicy ByPass -c "irm https://github.com/ThomAub/officemd/releases/latest/download/officemd_cli-installer.ps1 | iex"

Python

uv tool install officemd

Or add as a dependency:

uv add officemd

Node / Bun

npm install office-md
# or
bun add office-md

Rust

cargo install officemd_cli

CLI

All three surfaces expose a CLI named officemd (Python, Rust) or office-md (Node/Bun).

officemd markdown report.docx
officemd markdown budget.xlsx --sheets "Summary,Q1"
officemd markdown deck.pptx --pages 1-3
officemd render report.docx
officemd diff old.docx new.docx

The Rust CLI has additional subcommands:

officemd stream report.docx                    # stream to stdout (supports stdin via -)
officemd convert report.docx --output out.md   # write to file
officemd inspect report.pdf --output-format json --pretty

Common options

Flag Description
--format Force document format (docx, xlsx, csv, pptx, pdf)
--pages Select pages/slides/sheets by index (e.g. "1,3-5")
--sheets Select sheets by name or index (e.g. "Sales,1-2")
--include-document-properties Include document metadata in output
--markdown-style Output style: compact (default) or human

SDK

Python

from pathlib import Path
from officemd import extract_ir_json, markdown_from_bytes, docling_from_bytes

content = Path("report.docx").read_bytes()

print(markdown_from_bytes(content, format="docx"))
print(extract_ir_json(content, format="docx"))
print(docling_from_bytes(content, format="docx"))

Python patch reports and batch patching

from pathlib import Path
import officemd

content = Path("report.docx").read_bytes()
patch = officemd.DocxPatch(
    scoped_replacements=[
        officemd.ScopedDocxReplace(
            officemd.DocxTextScope.ALL_TEXT,
            officemd.TextReplace("word", "term"),
        )
    ]
)
# ALL_TEXT includes document content plus free-text metadata/app/custom fields.

single = officemd.patch_docx_with_report(content, patch)
print(single.report.replacements_applied)

batch = officemd.patch_docx_batch_with_report([content, content], patch, workers=4)
for item in batch:
    print(item.report.parts_scanned, item.report.parts_modified, item.report.replacements_applied)

Rust batch patching with reports

use officemd_core::{
    patch_docx_batch_with_report, DocxPatch, DocxTextScope, ScopedDocxReplace, TextReplace,
};

let patch = DocxPatch {
    set_core_title: None,
    replace_body_title: None,
    scoped_replacements: vec![ScopedDocxReplace {
        scope: DocxTextScope::AllText,
        replace: TextReplace::all("word", "term"),
    }],
};
// AllText includes document content plus free-text metadata/app/custom fields.

let results = patch_docx_batch_with_report(vec![doc1_bytes, doc2_bytes], &patch, Some(4))?;
for item in results {
    println!(
        "parts_scanned={} parts_modified={} replacements_applied={}",
        item.report.parts_scanned,
        item.report.parts_modified,
        item.report.replacements_applied
    );
}
# Ok::<(), officemd_core::PatchError>(())

Patch scopes also support free-text metadata/comment fields:

  • DOCX: MetadataCore, MetadataApp, MetadataCustom, MetadataAll
  • PPTX: CommentAuthors, MetadataCore, MetadataApp, MetadataCustom, MetadataAll
  • XLSX: Comments, CommentAuthors, MetadataCore, MetadataApp, MetadataCustom, MetadataAll

AllText now means all free-text fields: content + metadata/comment-author text.

Formatting-preserving replacement is available for OOXML content text:

use officemd_core::{DocxPatch, DocxTextScope, ScopedDocxReplace, TextReplace};

let patch = DocxPatch {
    set_core_title: None,
    replace_body_title: None,
    scoped_replacements: vec![ScopedDocxReplace {
        scope: DocxTextScope::Body,
        replace: TextReplace::all("Confidential", "")
            .with_preserve_formatting(true),
    }],
};

Semantics:

  • replacement can span multiple runs
  • the first matched run keeps the replacement text and therefore its formatting wins
  • later consumed runs are left empty in v1
  • metadata/comment-author fields still use simple text replacement

JavaScript

import { readFileSync } from "node:fs";
import { markdownFromBytes, extractIrJson, doclingFromBytes } from "office-md";

const content = readFileSync("report.docx");

console.log(markdownFromBytes(content, "docx"));
console.log(extractIrJson(content, "docx"));
console.log(doclingFromBytes(content, "docx"));

WebAssembly demo

There is also a browser demo for the WASM bindings in crates/officemd_wasm/README.md. It serves a small page at http://localhost:8080/crates/officemd_wasm/www/ for drag-and-drop and sample-fixture testing.

Supported Formats

Format Extension Markdown JSON IR Docling
Word .docx yes yes yes
Excel .xlsx yes yes yes
CSV .csv yes yes -
PowerPoint .pptx yes yes yes
PDF .pdf yes yes -

Development

cargo nextest run --workspace
cargo clippy --workspace --all-targets --exclude officemd_pdf -- -D warnings

For JS and Python tests, see examples/README.md.

Acknowledgements

PDF extraction vendors pdf-inspector by Firecrawl (MIT).

PDF primitives lopdf by J-F-Liu (MIT).

License

Apache 2.0 - see LICENSE.

About

Turn any Office style document to markdown

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors