Skip to content

vlwkaos/ir

Repository files navigation

ir

ENG | 한국어

Local semantic search engine for markdown knowledge bases. Rust port of qmd with three key differences:

  • Per-collection SQLite — each collection is an independent file; no shared global index
  • Persistent daemon — models stay loaded between queries; first search auto-starts it
  • Dual LLM cache — expander outputs and reranker scores are persisted; repeated queries are instant

Search quality benchmarked on 4 BEIR datasets; reranking adds up to +14.5% nDCG@10 over pure vector.

Features
  • Hybrid search — BM25 probe → score fusion (0.80·vec + 0.20·bm25) → LLM reranking
  • Query expansion — typed sub-queries (lex/vec/hyde) when expander model is present
  • Strong-signal shortcut — skips expansion when top BM25 score ≥ 0.75 with gap ≥ 0.10
  • Daemon mode — keeps models warm between queries; auto-starts on first search
  • Dual LLM cache — expander outputs cached globally; reranker scores cached per-collection
  • Per-collection SQLite — independent WAL journals, isolated backup, zero cross-collection contention
  • Content-addressed storage — identical files deduplicated by SHA-256 within a collection
  • FTS5 injection-safe — all user input escaped before FTS5 query construction
  • Metal GPU — all layers offloaded to Metal on macOS by default; IR_GPU_LAYERS=N to override
  • Auto-download — models fetched from HuggingFace Hub on first use; HF_HUB_OFFLINE=1 to disable

Installation

Homebrew (macOS):

brew install vlwkaos/tap/ir

From source:

cargo install --path .

Requires Rust 1.80+. On macOS, links llama.cpp with Metal automatically.

Quick start

ir collection add notes ~/notes   # register a collection
ir update notes                   # scan files → extract text → populate FTS5 index (BM25)
ir embed notes                    # chunk text → run embedding model → store vectors (enables vector + hybrid search)
ir search "memory safety in rust" # search (daemon auto-starts)

ir update is fast (no models, pure text processing). ir embed is slow on first run (model inference per chunk) but only re-embeds changed content on subsequent runs. BM25 search works after update alone; vector and hybrid search require embed.

Models

Models are downloaded automatically from HuggingFace Hub on first use and cached in ~/.cache/huggingface/. No manual setup required.

Model HF Repo Required for
EmbeddingGemma 300M ggml-org/embeddinggemma-300M-GGUF ir embed, vector search, hybrid
Qwen3.5-0.8B unsloth/Qwen3.5-0.8B-GGUF unified expand + rerank (optional)
Qwen3.5-2B unsloth/Qwen3.5-2B-GGUF unified expand + rerank (optional)
Qwen3-Reranker 0.6B ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF reranking only (optional)
qmd-query-expansion 1.7B tobil/qmd-query-expansion-1.7B expansion only (optional)
BGE-M3 568M ggml-org/bge-m3-Q8_0-GGUF Korean embedding alternative (optional)

BM25 search works without any models. When IR_COMBINED_MODEL is set (or a Qwen3.5 GGUF is found in ~/local-models/), it replaces both the expander and reranker.

Local models:

export IR_MODEL_DIRS="$HOME/my-models"
export IR_COMBINED_MODEL="$HOME/local-models/Qwen3.5-2B-Q4_K_M.gguf"   # unified
export IR_EMBEDDING_MODEL="$HOME/my-models/embeddinggemma-300M-Q8_0.gguf"
export IR_RERANKER_MODEL="$HOME/my-models/qwen3-reranker-0.6b-q8_0.gguf"
export IR_EXPANDER_MODEL="$HOME/my-models/qmd-query-expansion-1.7B-q4_k_m.gguf"

Search order: env → IR_MODEL_DIRS~/local-models/~/.cache/ir/models/~/.cache/qmd/models/ → HF Hub auto-download.

IR_*_MODEL env vars accept a path to a .gguf file, a directory containing a known model file, or a HuggingFace repo ID (owner/name). Unrecognized values error immediately instead of silently loading the default.

Known HF repo IDs: ggml-org/embeddinggemma-300M-GGUF, ggml-org/bge-m3-Q8_0-GGUF, ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF, tobil/qmd-query-expansion-1.7B, unsloth/Qwen3.5-0.8B-GGUF, unsloth/Qwen3.5-2B-GGUF.

Compatibility aliases: QMD_EMBEDDING_MODEL, QMD_RERANKER_MODEL, QMD_EXPANDER_MODEL, QMD_MODEL_DIRS.

Config directory:

export IR_CONFIG_DIR="~/vault/.config/ir"   # portable across machines

IR_CONFIG_DIR sets the directory for config, collection DBs, and daemon files. Supports ~ and $VAR expansion, so the value is safe to use in MCP configs synced across machines. Precedence: IR_CONFIG_DIRXDG_CONFIG_HOME/ir (deprecated) → ~/.config/ir.

GPU:

IR_GPU_LAYERS=0 ir search "query"   # force CPU
IR_GPU_LAYERS=32 ir search "query"  # partial offload
Usage

Collections:

ir collection add notes ~/notes
ir collection add code  ~/code
ir collection ls
ir collection rm notes
ir status                    # index health per collection

Index and embed:

ir update                    # index all collections
ir update notes              # one collection
ir update notes --force      # full re-index from scratch

ir embed                     # embed all unembedded documents
ir embed notes --force       # re-embed everything

Search:

ir search "memory safety in rust"
ir search "sqlite architecture" --mode bm25
ir search "async patterns"     --mode vector
ir search "error handling"     --mode hybrid -c notes --min-score 0.4

# Output formats
ir search "ownership" --json
ir search "ownership" --md
ir search "ownership" --files       # paths only
ir search "ownership" --full        # include full document content in results
ir search "ownership" --chunk       # include best-matching chunk text (vector results)
ir search "ownership" --quiet       # suppress stderr (progress, logs) — for scripting

# Filter by field (-f/--filter, repeatable; all clauses ANDed)
ir search "design" -f "modified_at>=2026-01-01"
ir search "design" -f "meta.tags=rust"
ir search "design" -f "path~notes/"
ir search "design" -f "modified_at>=2025-01-01" -f "meta.author=vlwkaos"

Retrieve documents:

ir get "2026/Daily/04/2026-04-07.md"           # collection-relative path
ir get "Notes/2026/Daily/04/2026-04-07.md"     # vault-root path (strips collection dir prefix)
ir get "2026-04-07" -c periodic                # substring match, scoped to collection
ir get "some/path.md" --json                   # full metadata as JSON
ir get "some/path.md" --section "Installation" # extract named heading section only
ir get "some/path.md" --max-chars 3000         # first 3000 chars
ir get "some/path.md" --offset 1000 --max-chars 2000  # chars 1000–3000

ir multi-get "file1.md" "file2.md" "file3.md"  # batch fetch
ir multi-get "file1.md" "file2.md" --json       # {found: [...], not_found: [...]}
ir multi-get "file1.md" "file2.md" --files      # paths only (found ones)
ir multi-get "file1.md" "file2.md" --max-chars 2000  # truncate each doc

Path matching order: exact → suffix (%/path) → substring. Vault-root paths (where the first component matches the collection's directory name) are resolved before the normal match.

Filter syntax (-f/--filter):

Each clause is a string FIELD OP VALUE. Multiple -f flags are ANDed together.

Field Description
path Document path (relative to collection root)
modified_at File modification time (UTC RFC3339)
created_at File creation time (UTC RFC3339)
meta.<name> Frontmatter field (e.g. meta.tags, meta.author)
Op Meaning
= / != Equal / not equal (case-sensitive)
> / >= / < / <= Lexicographic compare (dates normalize to UTC RFC3339)
~ / !~ Contains / not-contains (case-insensitive)

Date values for modified_at, created_at, and meta.date are normalized to UTC RFC3339 (YYYY-MM-DD becomes YYYY-MM-DDT00:00:00Z). Multi-valued frontmatter fields (e.g. tag arrays) match if any element satisfies the clause — including !=. A doc tagged ["rust", "go"] passes meta.tags!=rust because "go" satisfies the condition. Documents with no metadata rows always fail meta.* clauses.

Note: Collection DBs are upgraded to schema version 2 on first use after this release. The one-time backfill (populating document_metadata from existing frontmatter) is fast (<1s for <10k docs).

Daemon:

ir daemon start              # start (auto-started on first search)
ir daemon stop
ir daemon status

The daemon keeps models warm in memory. Subsequent queries over the Unix socket skip model loading entirely (~30ms round-trip vs 3s cold).

Incremental Indexing

IR efficiently handles updates by only processing changed files through content-addressed storage with SHA-256 hashing.

How it works:

  • Change detection: Files are hashed (SHA-256) and compared against stored hashes
  • Smart updates: Only modified or new files are re-processed
  • Deletion handling: Removed files are marked as inactive (soft delete)
  • Deduplication: Identical content within a collection shares storage

Index operations:

# Regular incremental update (default)
ir update                    # all collections
ir update notes              # specific collection

# Force full re-index from scratch
ir update notes --force      # rebuilds entire index

# Check what changed (see the summary)
ir update notes
# Output: "2 added, 1 updated, 0 deactivated"

Embedding operations:

# Incremental embedding (only new/changed documents)
ir embed                     # embeds unembedded content
ir embed notes               # specific collection

# Force re-embedding everything
ir embed notes --force       # re-computes all vectors

Performance characteristics:

  • Initial indexing: fast (no models, pure text extraction)
  • Incremental updates: only processes changed files
  • Hash comparison: instant even for thousands of files
  • Embedding: slow first time, fast incremental updates

Example workflow:

# Monday: initial setup
ir collection add notes ~/notes
ir update notes              # indexes 500 files
ir embed notes               # computes 500 embeddings (slow)

# Tuesday: added 3 files, modified 2
ir update notes              # Output: "3 added, 2 updated, 0 deactivated"
ir embed notes               # only embeds 5 documents (fast)

# Wednesday: deleted 1 file
ir update notes              # Output: "0 added, 0 updated, 1 deactivated"
# No embedding needed for deletions

The incremental approach means you can run ir update frequently without performance penalty — only changed content is processed.

MCP server — Claude Desktop / Claude Code

ir mcp runs a Model Context Protocol server so Claude can search your indexed documents directly.

Claude Desktop (~/.config/claude/claude_desktop_config.json):

{
  "mcpServers": {
    "ir": {
      "command": "ir",
      "args": ["mcp"]
    }
  }
}

Claude Code (.mcp.json in project root or ~/.claude/mcp.json):

{
  "mcpServers": {
    "ir": {
      "command": "ir",
      "args": ["mcp"]
    }
  }
}

Five tools are exposed:

Tool Description
search Hybrid BM25+vector search. Returns path, title, score, snippet. Params: mode, limit, min_score, collections, full (include full doc text), include_chunk (include best-matching chunk text), filter (array of {field, op, value} objects, ANDed).
get Retrieve document text by path (exact → suffix → substring match). Params: collections, section (heading text, case-insensitive), offset (char offset), max_chars (truncate).
multi_get Batch document retrieval. Params: paths[], collections, max_chars (truncate each doc). Returns found and not_found.
status Index health — collection names, doc counts, DB sizes, daemon status.
update Re-index collections after file changes. Accepts collection and force params.

The filter array accepts structured clauses: {"field": "modified_at", "op": ">=", "value": "2024-01-01"}. Fields: path, modified_at, created_at, meta.<name>. Ops: =, !=, >, >=, <, <=, ~ (contains), !~ (not-contains).

HTTP mode (for remote access or multi-client setups):

ir mcp --http 3620    # serve on all interfaces, port 3620

Configure clients to point at http://<host>:3620/mcp. The daemon starts automatically on first search tool call.

Security note: HTTP mode is unauthenticated and binds to all interfaces. Only expose it on trusted networks. The update tool can trigger re-indexing, so treat it like any other local write-access service.

Preprocessors — Korean / Japanese / Chinese

Preprocessors tokenize text before BM25 indexing. Without one, agglutinated words ("이스탄불의", "東京都") are treated as single FTS tokens and never match morpheme-level queries. The same preprocessor runs at index time and query time.

Korean (lindera, Mode::Decompose):

ir preprocessor install ko          # downloads official lindera CLI + ko-dic, registers as "ko"
                                    # shows collection picker to bind immediately
ir collection add wiki ~/wiki       # add collection (if not yet added)
ir preprocessor bind ko wiki        # wire "ko" to collection and re-index
ir search "서울 지하철" -c wiki

ir preprocessor install ko downloads the official lindera CLI binary and ko-dic dictionary from lindera's GitHub releases. Supported platforms: macOS (arm64, x86_64) and Linux (x86_64, aarch64). No system deps, no Rust toolchain required. The install step shows an interactive picker so you can bind to collections right away.

Other languages:

ir preprocessor install ja    # Japanese (Lindera + ipadic)
ir preprocessor install zh    # Chinese (Lindera + jieba)

Manage:

ir preprocessor list          # shows registered + available bundled preprocessors
ir preprocessor remove ko     # unregister (keeps binary)
ir preprocessor remove ko -d  # unregister and delete binary

The protocol is stdin/stdout line-by-line: one UTF-8 line in, zero or one tokenized line out (zero if all tokens are filtered), process stays alive between lines. The subprocess must pass ASCII-only single-word lines through unchanged — ir uses an internal sentinel token to detect when a line produces no output. Any executable following this protocol can be registered.

Lindera throughput: ~5,600 Korean docs/s · 1.8 MB/s on M-series Mac. Near-zero cold start (Rust binary, embedded dictionary).

Korean BM25 benchmark (MIRACL-Korean, 213 queries):

preprocessor nDCG@10 note
none 0.0009 agglutinated tokens never match
lindera 0.0460 50× gain from morphological tokenization
lindera hybrid+rerank 0.8411 near-ceiling on 2,835 passages

Compound decompounding benchmark (50 queries targeting compound sub-components):

preprocessor nDCG@10 note
none 0.0000 sub-parts absent from FTS index
lindera 0.6326 Mode::Decompose splits compounds

See research/experiment.md for full results and rationale.

Korean embedding models: For Korean-optimized dense retrieval, BGE-M3 can replace the default embedding model via IR_EMBEDDING_MODEL. Filename auto-detection handles pooling and formatting. See README.ko.md for setup. Switching models requires ir embed --force (vector dimensions auto-adapt).

Search Pipeline
Query → BM25 probe → score fusion (0.80·vec + 0.20·bm25) → reranking

Strong-signal shortcut (BM25 score ≥ 0.75, gap ≥ 0.10) skips all LLM work. With expander: expand → lex/vec/hyde sub-queries → RRF → rerank top-20. All LLM outputs cached in SQLite — repeated queries skip inference entirely.

See research/pipeline.md for staged async daemon design.

Benchmark — BEIR (4 datasets, nDCG@10)

EmbeddingGemma 300M embeddings + qmd-expander-1.7B + Qwen3-Reranker-0.6B.

Dataset BM25 Vector Hybrid +Reranker LLM gain
NFCorpus (323q) 0.2046 0.3898 0.3954 0.4001 +1.2%
SciFact (300q) 0.0500 0.7847 0.7873 0.7797 −1.0%
FiQA (648q) 0.0298 0.4324 0.4266 0.4567 +7.1%
ArguAna (1406q) 0.0012 0.4264 0.4263 0.4879 +14.5%

BM25 fusion provides no statistically significant lift over pure vector (paired t-test). Reranker gains are largest on conversational/argument retrieval.

See research/experiment.md for reproduction steps.

vs qmd

ir is a Rust port of qmd with a different storage model and a persistent daemon.

qmd ir
Storage Single SQLite for all collections Per-collection SQLite — rm name.sqlite to delete
Concurrent writes Shared WAL journal Independent WAL per collection
sqlite-vec Dynamically loaded .so Statically compiled in
Process model Spawns per query Daemon keeps models warm
LLM cache Reranker scores (per-collection) Reranker scores + expander outputs (global)
Quality (NFCorpus nDCG@10) No published numbers 0.4001

Performance (macOS M4 Max, same models and query):

ir qmd Ratio
Cold (no cache) 3.0s 9.5s
Warm (daemon + caches hot) 30ms 840ms 28×

Cold difference: ir caps reranking at 20 candidates vs qmd's 40. Warm difference: qmd pays ~800ms process spawn + JS runtime per invocation; ir's daemon round-trip is 30ms (embed + kNN only).

Development
cargo build                  # debug build
cargo build --release        # release build
cargo test                   # unit tests (no models required)
cargo test -- --ignored      # model-dependent tests (requires models)
cargo run --bin eval -- --data test-data/nfcorpus --mode all
Schema

Each collection database (~/.config/ir/collections/<name>.sqlite):

content          — hash → full text (content-addressed)
documents        — path, title, hash, active flag
documents_fts    — FTS5 virtual table (porter tokenizer)
vectors_vec      — sqlite-vec kNN (768d cosine, EmbeddingGemma format)
content_vectors  — chunk metadata (hash, seq, pos, model)
llm_cache        — reranker score cache (sha256(model+query+doc) → score)
meta             — collection metadata (name, schema version)

Global cache (~/.config/ir/expander_cache.sqlite):

expander_cache   — sha256(model+query) → JSON Vec<SubQuery>

Triggers keep documents_fts in sync with documents on insert/update/delete.

About

Fast local markdown search engine (with possible CJK support)

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors