Skip to content

AiDinho/ruosh

Repository files navigation

ruosh

ruosh is a full-text search library for Python. It has a Whoosh-like API, but the search engine underneath is Tantivy, which is written in Rust. The goal is to give Python code something familiar to work with while getting meaningful performance out of a real search engine.

Install

pip install ruosh

Python 3.10 or newer is required. No Rust toolchain needed , wheels are pre-built for Linux, macOS, and Windows.

Quick start

from ruosh import Schema, TEXT, ID, create_in

schema = (
    Schema()
    .add("doc_id", ID(stored=True, unique=True))
    .add("title",  TEXT(stored=True))
    .add("body",   TEXT(stored=True))
)

idx = create_in("my-index", schema)

w = idx.writer()
w.add_document(doc_id="1", title="Getting started", body="A short introduction to full-text search.")
w.add_document(doc_id="2", title="Advanced queries", body="Boolean, phrase, and field-specific queries.")
w.commit()

for hit in idx.search("introduction"):
    print(hit["doc_id"], hit["title"], hit.score)

Queries

String queries search all text fields:

results = idx.search("full-text search", limit=10)

Structured queries let you be specific:

from ruosh.query import Term, And, Or, Phrase

# must contain both terms in the body field
results = idx.search(And(Term("body", "search"), Term("body", "fast")), limit=10)

# exact phrase
results = idx.search(Phrase("body", ["full-text", "search"]), limit=10)

Pagination

page2 = idx.search("search", limit=10, offset=10)
print(f"{page2.total} total hits, showing {len(page2)}")

Snippets

Pass snippet_fields to get highlighted excerpts back with each hit:

results = idx.search("search", limit=10, snippet_fields=["body"])
for hit in results:
    print(hit.snippet("body"))  # returns HTML with <b> tags around matches

Sorting

Add a numeric field with sortable=True and pass sort_by:

from ruosh import NUMERIC

schema = Schema().add("doc_id", ID(stored=True)).add("body", TEXT(stored=True)).add("rank", NUMERIC(stored=True, sortable=True))

# ...index documents...

results = idx.search("search", sort_by="rank", sort_desc=False)

Corpus intelligence

ruosh maintains a lightweight sidecar that tracks approximate term frequencies across the corpus. It updates at write time and loads once, so repeated queries cost nothing.

# how often does "search" appear across documents?
stats = idx.term_stats("body", "search")
# {"term": "search", "estimated_doc_freq": 1840, "very_common": True}

# what are the most common terms in this field?
terms = idx.frequent_terms("body", limit=20)
# [{"term": "search", "estimated_doc_freq": 1840}, ...]

These are estimates, not exact counts. They are useful for things like query planning, stopword detection, and building tag clouds. Neither SQLite FTS5 nor Whoosh exposes a top-terms API at all.

Updating and deleting documents

# replace a document by its unique field
w = idx.writer()
w.update_document(doc_id="1", title="Updated title", body="New body text.")
w.commit()

# delete by field value
w = idx.writer()
w.delete_by_term("doc_id", "1")
w.commit()

Opening an existing index

from ruosh import open_dir

idx = open_dir("my-index")

Performance

ruosh trades raw single-query speed for richer features. Each search call crosses the Python-Rust boundary, which adds a few milliseconds of fixed overhead. For workloads that need very fast single-keyword lookups over small corpora, SQLite FTS5 will be faster. ruosh is a better fit when you need structured queries, pagination with correct totals, highlighted snippets, or corpus statistics.

Compared to Whoosh, ruosh is faster across the board. On a 100,000-document corpus:

Scenario ruosh Whoosh
Keyword search 8.9 ms 43 ms
Boolean AND 13 ms 201 ms
Phrase query 48 ms 371 ms
Paginated results 30 ms 193 ms
Snippet extraction 10 ms 41 ms
Corpus intelligence 0.7 ms 5.7 ms

Development

Requirements: Python 3.11, Rust toolchain, uv.

git clone https://github.com/biswarupghosh/ruosh
cd ruosh
uv venv .venv --python 3.11
uv sync --extra dev
uv run maturin develop
uv run pytest

To run the full feature benchmark:

uv sync --extra dev --extra bench
uv run python scripts/benchmark_features.py

License

MIT

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors