A Dockerized pipeline that converts entire books and large PDF documents into clean, organized Markdown files using DeepSeek-OCR for high-accuracy text extraction.
Pair it with Claude Code to get detailed visual descriptions of every figure, diagram, and clinical image in the document.
Everything runs inside Docker containers — nothing is installed on the host machine except Docker itself.
PDF ──► Split pages ──► Extract as images ──► DeepSeek-OCR (GPU) ──► Clean Markdown
│
Claude Code ──► Image Descriptions
│
Final Markdown
- Split — Extract a page range (or whole book) from the source PDF using
pypdf - Extract — Convert each page to a high-resolution PNG image using
poppler - OCR — Run DeepSeek-OCR on each page image to extract structured Markdown (text, tables, figure captions, math notation)
- Cleanup — Strip OCR annotation tags and combine all pages into a single organized
.mdfile - Image Descriptions (optional, via Claude Code) — Visually read every figure and generate detailed
[IMAGE DESCRIPTION]blocks
- Convert entire textbooks (500-2000+ pages) into searchable, editable Markdown
- Extract specific chapters or page ranges from large reference books
- Digitize medical, scientific, or technical texts with complex layouts
- Create accessible versions of image-heavy documents with figure descriptions
- Build study materials or knowledge bases from PDF sources
| Component | Minimum | Recommended |
|---|---|---|
| GPU | NVIDIA GPU, 8GB VRAM | NVIDIA RTX 3090/4090, 24GB VRAM |
| RAM | 16GB | 32GB+ |
| Disk | 10GB free | 50GB+ (for large PDFs) |
| CPU | 4 cores | 8+ cores |
| Dependency | Version | Notes |
|---|---|---|
| Docker | 20.10+ | With NVIDIA Container Toolkit |
| NVIDIA Driver | 525+ | CUDA 12.x compatible |
| nvidia-docker | 2.x | nvidia-container-toolkit package |
| Python 3 | 3.8+ | Only needed for cleanup scripts (runs on host) |
# Ubuntu/Debian
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
# Verify
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi# Clone the repo
git clone https://github.com/0aub/pdf2md.git
cd pdf2md
# Convert an entire book
./run.sh "/path/to/book.pdf" 1 500 --title "My Book"
# Or a specific page range
./run.sh "/mnt/d/Books/Campbell_Urology.pdf" 622 1402 --title "Campbell-Walsh-Wein Urology"
# Output will be at ./output/result.mdpdf2md/
├── run.sh # Main pipeline script
├── CLAUDE.md # Claude Code project instructions
├── docker/
│ ├── Dockerfile.split # PDF splitting (pypdf)
│ ├── Dockerfile.extract # Page-to-image extraction (poppler)
│ └── Dockerfile.ocr # DeepSeek-OCR (CUDA + flash-attn)
├── scripts/
│ ├── split_pdf.py # Split PDF into page range
│ ├── extract_pages.py # Convert pages to PNG images
│ ├── ocr_batch.py # Run DeepSeek-OCR on page images
│ ├── cleanup_ocr.py # Clean and combine OCR output
│ └── merge_descriptions.py # Merge image descriptions into MD
├── pages/ # (generated) Page images
├── ocr_results/ # (generated) Per-page OCR output
├── image_descriptions/ # (generated) Figure descriptions
└── output/ # (generated) Final markdown files
If you prefer running each step individually:
docker build -t pdf2md-split -f docker/Dockerfile.split .
docker run --rm \
-v "/path/to/pdfs:/data:ro" \
-v "$(pwd)/output:/output" \
pdf2md-split /data/book.pdf /output/subset.pdf 1 500docker build -t pdf2md-extract -f docker/Dockerfile.extract .
# Process in batches of 50
for start in $(seq 1 50 500); do
end=$((start + 49))
docker run --rm \
-v "$(pwd)/output:/data:ro" \
-v "$(pwd)/pages:/output/pages" \
-e PDF_INPUT="/data/subset.pdf" \
pdf2md-extract $start $end
donedocker build -t pdf2md-ocr -f docker/Dockerfile.ocr .
docker run --rm --gpus all --init \
-v "$(pwd)/pages:/data/pages:ro" \
-v "$(pwd)/ocr_results:/output/ocr_results" \
-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
pdf2md-ocr 1 500python3 scripts/cleanup_ocr.py ./ocr_results ./output/result.md --page-offset 0 --title "My Book"The OCR pipeline captures all text, tables, and figure captions — but for figures, diagrams, and clinical images, you need a vision model to describe what the image actually shows. Claude Code can do this by reading the page images and writing detailed descriptions.
- Claude Code installed (
npm install -g @anthropic-ai/claude-code) - OCR pipeline completed (Steps 1-4 above)
- Page images available in the
pages/directory
- Claude Code reads each page image that contains a figure
- It writes a detailed visual description (colors, panels, labeled structures, staining, imaging modality, etc.)
- Descriptions are saved as batch files in
image_descriptions/ - A merge script inserts them into the final markdown before each figure caption
cd /path/to/pdf2md
claudeCopy and paste this prompt into Claude Code:
I have a PDF that was converted to markdown using DeepSeek-OCR. The text is in
output/result.md and the page images are in pages/page_NNNN.png.
I need you to add detailed visual descriptions for every figure in the document.
For each page that has a figure:
1. Read the page image
2. Write a detailed [IMAGE DESCRIPTION] block describing what you see
3. Save descriptions in batches to image_descriptions/batch_NN.md
Use this format for each figure:
<!-- Page image: page_NNNN.png -->
### Figure [number]
**[IMAGE DESCRIPTION]:** [Detailed description - colors, panels A/B/C, labeled
anatomical structures, arrows, scale bars, staining techniques, imaging modality]
Process all figure pages in parallel batches of 10 using agents for speed.
After all batches are done, run:
python3 scripts/merge_descriptions.py output/result.md image_descriptions output/result_final.md
- Scan the markdown for figure references (
Fig. X.Y.) - Identify which page images contain figures
- Launch parallel agents to read images and write descriptions
- Merge everything into the final output
Your final file at output/result_final.md will contain:
- Full text from every page (via DeepSeek-OCR)
- Detailed
**[IMAGE DESCRIPTION]**blocks before every figure caption - Tables in markdown format
- Mathematical notation in LaTeX
- All citations and references preserved
<!-- Page 623 (file page 2) -->
### Figure 32.1
**[IMAGE DESCRIPTION]:** A composite photograph showing whole mount images of
developing human embryos from the Carnegie collection, arranged against a black
background. The specimens are numbered sequentially from 10 to 23, corresponding
to Carnegie Stages with approximate postovulatory days noted beneath each specimen.
The embryos increase dramatically in size and morphologic complexity across the
stages, progressing from small, curved, translucent forms at Stage 10 to
well-defined fetal forms with visible limbs and head structures at Stage 23.
A scale bar of 10 mm is provided at the bottom left.
Fig. 32.1. Whole mount photos of developing human embryos from the Carnegie
collection. Note increasing size and morphologic complexity with developmental
stage. (Image from Dr. Brad Smith, University of Michigan...)For a 781-page medical textbook with 627 figures:
| Step | Time | Method |
|---|---|---|
| Identify figure pages | ~5 seconds | grep + Python |
| Generate descriptions | ~30 minutes | ~40 parallel agents, 10 pages each |
| Merge into final MD | ~2 seconds | Python script |
| Property | Value |
|---|---|
| Model | deepseek-ai/DeepSeek-OCR |
| Parameters | 3B |
| License | MIT |
| Precision | BF16 |
| VRAM Usage | ~8GB (inference) |
| Speed | ~2 pages/minute (RTX 4090) |
| Attention | Flash Attention 2 |
The pipeline uses the "Gundam" configuration for the best balance of quality and speed:
| Setting | Value |
|---|---|
base_size |
1024 |
image_size |
640 |
crop_mode |
True |
save_results |
True |
- Body text with full paragraph structure
- Section headings and subheadings
- Tables (converted to HTML
<table>format) - Figure/image captions with source citations
- Mathematical notation (LaTeX format)
- References and bibliography
- Bulleted and numbered lists
- Multi-column layouts
- Key points boxes and clinical summaries
Benchmarked on an NVIDIA RTX 4090 (24GB VRAM):
| Stage | Speed | Notes |
|---|---|---|
| PDF Split | Instant | ~1 second for any range |
| Image Extraction | ~50 pages/min | 200 DPI PNG output |
| DeepSeek-OCR | ~2 pages/min | GPU-bound, Gundam config |
| Cleanup | Instant | Python text processing |
| Pages | OCR Time | With Image Descriptions |
|---|---|---|
| 100 | ~50 min | +15 min (Claude Code) |
| 500 | ~4 hours | +30 min (Claude Code) |
| 1000 | ~8 hours | +45 min (Claude Code) |
| 2000 | ~17 hours | +1 hour (Claude Code) |
The OCR is the bottleneck. Image extraction and cleanup are near-instant. Claude Code image descriptions are fast because they run in parallel.
- Whole books: Just pass
1as start page and the total page count as end page. The pipeline handles everything. - First run downloads the model (~6GB). Subsequent runs use the cached version at
~/.cache/huggingface/. - Crash recovery: The OCR skips existing results (files >100 bytes). If it crashes, just re-run and it picks up where it left off.
- Reduce DPI to 150 if you're running low on disk space (the pages directory can get large).
- Monitor GPU usage with
nvidia-smiorwatch -n1 nvidia-smiin a separate terminal. - For WSL2 users: PDF files on Windows drives are accessible at
/mnt/c/...or/mnt/d/.... - Large books: For 1000+ page books, consider running the OCR overnight. It's fully resumable.
- Requires NVIDIA GPU — no CPU-only mode (DeepSeek-OCR needs CUDA + Flash Attention)
- Handwritten text is not well supported (OCR is optimized for printed text)
- Very low-resolution scans (< 150 DPI source) may produce lower quality results
- The model downloads ~6GB on first run and needs ~8GB VRAM during inference
MIT