MAKER Framework

Zero-Error Long-Horizon LLM Execution via SPRT Voting, Red-Flagging, and Microagent Orchestration

How MAKER Works - Architecture infographic showing the SPRT voting flow

⚠️ EXPERIMENTAL - NOT FOR PRODUCTION USE

This is a research experiment, not production software.

Purpose: Validate and explore claims from the research paper Solving a Million-Step LLM Task with Zero Errors (Meyerson et al., 2025)

Goals: Implement the MAKER algorithm, discover alternative approaches, test boundaries of zero-error LLM execution

Status: Active research - may never reach production readiness

Use at your own risk: APIs will change, results are experimental, not suitable for critical applications

MAKER (Massively decomposed Agentic processes with K-margin Error Reduction) is a Rust implementation exploring mathematically-grounded error correction in LLM agents. It tests zero-error execution through SPRT-based voting, red-flag validation, and m=1 microagent decomposition.

Based on: Solving a Million-Step LLM Task with Zero Errors (Meyerson et al., 2025)

The Problem

Even with 99% per-step accuracy, a 1,000-step task has only a 0.004% success rate. The MAKER algorithm aims to transform this into 95%+ reliability with logarithmic cost scaling.

This implementation explores whether that claim holds in practice.

Task Length	Naive Success Rate	MAKER Success Rate	MAKER Cost Scaling
7 steps	93%	99%+	21 samples
1,023 steps	0%	95%+	~6,138 samples
1M steps	0%	95%+	Θ(s ln s)

Quick Start

As a Rust Library

use maker::core::{calculate_kmin, MockLlmClient, VoteConfig, vote_with_margin};

// Calculate required k-margin for your task
let k = calculate_kmin(
    0.85,   // p: per-step success probability
    0.95,   // t: target task reliability
    1_023,  // s: total steps (10-disk Hanoi)
    1,      // m: steps per agent (must be 1)
).unwrap();

// Run error-corrected voting
let client = MockLlmClient::constant("correct_answer");
let config = VoteConfig::default();
let result = vote_with_margin("What is 2+2?", k, &client, config).unwrap();
println!("Winner: {} ({} samples)", result.winner, result.total_samples);

As an MCP Server

# Build and run
cargo build --release
cargo run --bin maker-mcp

Add to your Claude Code MCP configuration:

{
  "mcpServers": {
    "maker": {
      "command": "/path/to/maker-mcp",
      "args": [],
      "env": {
        "OPENAI_API_KEY": "your-key",
        "ANTHROPIC_API_KEY": "your-key"
      }
    }
  }
}

Run the Demos

See Validation Demos below for detailed documentation and results.

Validation Demos

These demos validate MAKER's error correction capabilities against different task types.

Hanoi Demo (`examples/hanoi_demo.rs`)

Purpose: Validates MAKER on multi-step reasoning tasks requiring algorithmic thinking.

# 3-disk Hanoi (7 steps) with OpenAI
cargo run --example hanoi_demo -- --disks 3 --provider openai

# 5-disk Hanoi (31 steps) with strict mode (halt on first error)
cargo run --example hanoi_demo -- --disks 5 --provider openai --strict

# With ensemble (multiple providers)
cargo run --example hanoi_demo -- --disks 3 --provider openai --ensemble

# Mock mode for CI/testing
MAKER_USE_MOCK=1 cargo run --example hanoi_demo -- --disks 10 --accuracy 0.85

Key flags:

Flag	Description
`--disks N`	Number of disks (1-20)
`--provider`	LLM provider: `ollama`, `openai`, `anthropic`
`--model`	Model name override
`--strict`	Halt on first error (true zero-error mode)
`--ensemble`	Enable multi-provider ensemble

Results (gpt-5-mini):

31/31 steps (5-disk) with 0 errors using few-shot + chain-of-thought prompt
~2.7 samples per step average
p_hat converges to 0.950 (target reliability)

Key Finding: Raw prompts fail on step 2 (systematic reasoning errors). Few-shot examples + chain-of-thought prompting converts systematic errors into random errors that voting can correct.

Arithmetic Demo (`examples/arithmetic_demo.rs`)

Purpose: Validates MAKER on tasks with truly random errors (calculation mistakes).

# 20 problems with OpenAI
cargo run --example arithmetic_demo -- --problems 20 --provider openai

# 50 problems with higher difficulty
cargo run --example arithmetic_demo -- --problems 50 --provider openai --difficulty 3

# Strict mode with reproducible seed
cargo run --example arithmetic_demo -- --problems 50 --provider openai --strict --seed 42

# Mock mode for CI/testing
MAKER_USE_MOCK=1 cargo run --example arithmetic_demo -- --problems 100 --accuracy 0.90

Key flags:

Flag	Description
`--problems N`	Number of problems to solve
`--difficulty 1-5`	Number magnitude (10^difficulty)
`--provider`	LLM provider
`--strict`	Halt on first error
`--seed N`	Reproducible problem generation

Results (gpt-5-mini):

50/50 problems correct with 0 errors
2.7 samples per problem average
Handles addition, subtraction, multiplication

Key Finding: Random calculation errors are effectively corrected by voting. This validates MAKER's core premise - voting corrects random errors.

Research Findings

Our validation experiments reveal important insights about MAKER's applicability:

What MAKER Can Correct

Error Type	Example	Correctable?	Why
Random errors	LLM occasionally miscalculates 73-38	✅ Yes	Independent errors cancel out through voting
Prompted reasoning	Hanoi with few-shot examples	✅ Yes	Converts systematic to random errors

What MAKER Cannot Correct

Error Type	Example	Correctable?	Why
Systematic errors	LLM can't reason about Hanoi	❌ No	All samples fail the same way
Knowledge gaps	LLM doesn't know an algorithm	❌ No	Voting achieves consensus on wrong answer

Key Insight

MAKER corrects random errors, not systematic reasoning failures.

If an LLM consistently fails at a task, voting will achieve consensus on the wrong answer. Prompt engineering (few-shot examples, chain-of-thought) is critical to convert systematic failures into random errors that voting can correct.

Observed Accuracy

Task	Model	Per-Step Accuracy	Samples/Step
Arithmetic (difficulty 3)	gpt-5-mini	~95%	2.7
Hanoi (few-shot+CoT)	gpt-5-mini	~95%	2.7
Hanoi (raw prompt)	gpt-5-mini	<50%	N/A (fails)

Research Validation

maker-rs explores the boundary between academic research and practical implementation. Here's how this experiment compares to the original paper:

Scorecard (Self-Assessment)

Criterion	Score	Assessment
Algorithm Fidelity	A+	Exact k_min formula, SPRT voting, strict m=1 enforcement
Experimental Coverage	A	Tokio concurrency, event sourcing, MCP integration
Completeness	A-	LLM-driven decomposition, domain decomposers, full recursive orchestration
Validation Status	B+	Demos validate core claims; edge cases need more testing

What's Implemented Well

Strict m=1 Decomposition: Micro-agents execute exactly one step with minimal context, matching the paper's core assertion that "smallest possible subtasks" enable scaling
SPRT-Based Voting: Uses actual Sequential Probability Ratio Test and Gambler's Ruin logic—not heuristic "best of 3"
Dynamic k_min Calculation: Computes margin from the paper's logarithmic formula based on target reliability (t) and task length (s)
Red-Flagging as Primitive: Treats validation as statistical necessity to decorrelate errors, discarding (never repairing) malformed outputs

Implementation Details

Beyond the paper's theoretical model (experimental additions):

Enhancement	Purpose
Tokio Runtime	Massive I/O concurrency for parallel vote sampling across thousands of steps
Event Sourcing	Real-time observability bridging theoretical probability with practical debugging
MCP Integration	Transforms abstract state machine into consumable tools for Claude Code and other clients
Exponential Backoff	Handles API rate limits (429s) with jitter and circuit breakers

Known Limitations

Acknowledged gaps vs. paper's full vision:

Semantic Matching: MVP defaults to exact string matching; embedding-based and AST-based matchers are available but less battle-tested
Decomposition Testing: LLM-driven decomposition is implemented but needs more real-world validation across diverse task types

Practical Compromises

Paper Assumption	Implementation Reality
Idealized sampling	Backoff, circuit breakers, retry budgets for real API constraints
Red-flagging for decorrelation	Dual-purposed as security guardrail against prompt injection
Unlimited parallelism	Configurable concurrency limits to respect provider quotas

Run cargo test --test monte_carlo to validate the statistical guarantees empirically.

Architecture

src/
├── core/               # Core MAKER algorithms
│   ├── kmin.rs         # k_min = ⌈ln(1 - t^(m/s)) / ln((1-p)/p)⌉
│   ├── voting.rs       # VoteRace: first-to-ahead-by-k (thread-safe)
│   ├── redflag.rs      # RedFlagValidator: discard-don't-repair
│   ├── executor.rs     # vote_with_margin() + vote_with_margin_adaptive()
│   ├── adaptive.rs     # KEstimator: EMA-based dynamic k adjustment
│   ├── matcher.rs      # CandidateMatcher trait + ExactMatcher
│   ├── matchers/       # Pluggable matcher implementations
│   │   ├── embedding.rs        # EmbeddingMatcher (cosine similarity)
│   │   ├── ollama_embedding.rs # Ollama embedding client
│   │   ├── openai_embedding.rs # OpenAI embedding client
│   │   └── code.rs             # CodeMatcher (tree-sitter AST, optional)
│   ├── decomposition/  # Recursive task decomposition (Section 7)
│   │   ├── llm_agent.rs        # LLM-driven "Insight Agent" for task discovery
│   │   ├── discriminator.rs    # Vote on decomposition proposals
│   │   ├── orchestrator.rs     # Recursive decomposition coordinator
│   │   ├── solver.rs           # Leaf node executor with voting
│   │   ├── aggregator.rs       # Solution composition
│   │   └── domains/            # Domain-specific decomposers (coding, data, ML)
│   └── orchestration.rs# TaskOrchestrator with m=1 constraint
├── llm/                # Multi-provider LLM abstraction
│   ├── ollama.rs       # Local inference
│   ├── openai.rs       # OpenAI API
│   ├── anthropic.rs    # Anthropic API
│   ├── ensemble.rs     # Multi-model ensemble configuration
│   ├── retry.rs        # Exponential backoff with jitter
│   └── sampler.rs      # Temperature-diverse parallel sampling
├── mcp/                # MCP server (rmcp v0.13)
│   ├── server.rs       # MakerServer with #[tool_router]
│   └── tools/          # maker/vote, validate, calibrate, configure, decompose
└── events/             # Event-driven observability
    ├── bus.rs           # Tokio broadcast channel
    └── observers/       # Logging (tracing) + Metrics

MCP Tools

`maker/vote` - Error-Corrected Voting

Request:

{ "prompt": "...", "k_margin": 3, "max_samples": 20, "matcher": "embedding" }

Response:

{
  "winner": "answer", "vote_counts": {"answer": 5}, "total_samples": 7,
  "k_used": 3, "p_hat": 0.87, "matcher_type": "exact", "candidate_groups": 2
}

`maker/validate` - Red-Flag Checking

Request:

{ "response": "...", "token_limit": 700, "schema": {"required": ["move"]} }

Response:

{ "valid": false, "red_flags": [{"flag_type": "TokenLengthExceeded", "details": "..."}] }

`maker/calibrate` - Success Rate Estimation

Request:

{ "samples": [{"prompt": "...", "ground_truth": "..."}] }

Response:

{ "p_estimate": 0.87, "confidence_interval": [0.82, 0.92], "recommended_k": 4 }

`maker/configure` - Runtime Configuration

Request:

{
  "k_default": 3, "temperature_diversity": 0.1, "token_limit": 700,
  "adaptive_k": true, "ema_alpha": 0.1, "k_bounds": [2, 10],
  "matcher": { "type": "embedding", "threshold": 0.92, "provider": "ollama" }
}

Response:

{ "applied": true, "current_config": {} }

`maker/decompose` - LLM-Driven Task Decomposition

Request:

{
  "task": "Build a web scraper that extracts product data",
  "depth_limit": 10,
  "provider": "openai"
}

Response:

{
  "task_id": "root-task",
  "subtasks": [
    {"id": "1", "description": "Parse HTML structure", "is_leaf": true},
    {"id": "2", "description": "Extract product fields", "is_leaf": true}
  ],
  "composition": "sequential",
  "depth": 1
}

Cost Scaling

MAKER's cost scales as Θ(s ln s) vs. exponential for naive retry:

Approach	7 steps	1,023 steps	1M steps
MAKER	21 samples	~6K samples	~20M samples
Naive retry	7 attempts	Infeasible	Impossible

MAKER vs Naive Retry (Monte Carlo validated, p=0.85, t=0.95)

Steps (s)	MAKER Cost	Naive Retry Cost	Savings
20	80	1,520	94.7%
50	200	506,400	99.96%
100	500	3.4 billion	~100%

Naive retry must rerun the entire task on any step failure. With p=0.85, the probability of completing 100 steps without error is 0.85^100 ≈ 7×10⁻⁸, requiring billions of retries. MAKER's per-step voting keeps costs logarithmic.

Run cargo test --test monte_carlo -- --nocapture to reproduce these results.

Adaptive K-Margin

MAKER dynamically adjusts the k-margin based on observed accuracy, reducing API calls when the model is performing well:

use maker::core::{KEstimator, vote_with_margin_adaptive, MockLlmClient, VoteConfig};

let client = MockLlmClient::constant("answer");
let config = VoteConfig::default();
let mut estimator = KEstimator::new(0.85, 0.95, 100);

// k starts high, decreases as accuracy is confirmed
let result = vote_with_margin_adaptive("prompt", &client, config, &mut estimator).unwrap();
println!("Used k={}, estimated p={:.2}", result.k_used, estimator.p_hat());

Configure via MCP: {"adaptive_k": true, "ema_alpha": 0.1, "k_bounds": [2, 10]}

Semantic Matching

For non-deterministic tasks (code generation, natural language), MAKER supports pluggable matchers that group equivalent responses:

Matcher	Use Case	Method
`ExactMatcher` (default)	Deterministic tasks	Whitespace-normalized string equality
`EmbeddingMatcher`	Natural language	Cosine similarity of embeddings (threshold: 0.92)
`CodeMatcher`	Code generation	Tree-sitter AST comparison with alpha-renaming

use maker::core::matcher::ExactMatcher;
use maker::core::matchers::embedding::{EmbeddingMatcher, MockEmbeddingClient};

// Embedding matcher groups semantically similar responses
let matcher = EmbeddingMatcher::new(Box::new(MockEmbeddingClient::default()), 0.92);
assert!(matcher.are_equivalent("The answer is 42", "The answer is 42."));

The CodeMatcher requires the code-matcher feature flag:

cargo build --features code-matcher
cargo test --features code-matcher

Multi-Model Ensemble

MAKER supports voting across heterogeneous LLM models to decorrelate errors by model architecture, not just sampling temperature:

Strategy	Behavior	Use Case
`RoundRobin`	Distribute samples evenly across models	Maximize diversity
`CostAware`	Cheap models first, escalate on disagreement	Minimize cost
`ReliabilityWeighted`	More samples from higher-reliability models	Optimize accuracy

Configure via MCP:

{
  "ensemble": {
    "models": [
      { "provider": "ollama", "model": "llama3", "cost_tier": "cheap" },
      { "provider": "anthropic", "model": "claude-haiku", "cost_tier": "medium" }
    ],
    "strategy": "cost_aware"
  }
}

Cost-aware ensemble saves 87.5%+ vs single expensive model. See BENCHMARKS.md for full results.

Benchmarks

MAKER includes domain-specific benchmarks covering coding tasks, math/logic, and data analysis:

cargo bench --bench coding_tasks        # 10 coding tasks (trivial to complex)
cargo bench --bench math_logic          # Arithmetic, symbolic, logic, Hanoi
cargo bench --bench data_analysis       # CSV, statistics, SQL, data cleaning
cargo bench --bench cost_scaling        # Θ(s ln s) cost validation
cargo bench --bench ensemble_comparison # Single-model vs ensemble comparison

See BENCHMARKS.md for detailed results and acceptance criteria.

Development

cargo build                          # Build
cargo test                           # All tests (unit + integration + property)
cargo test --features code-matcher   # Include tree-sitter code matcher tests
cargo test --example hanoi           # Hanoi example tests (21 tests)
cargo test --test properties         # Property-based tests (proptest, 21 tests)
cargo test --test mcp_integration    # MCP integration tests (35 tests)
cargo test --test semantic_matching  # Semantic matching tests (16/25 tests)
cargo test --test monte_carlo        # Monte Carlo cost validation
cargo bench --bench cost_scaling     # Cost scaling benchmark
cargo bench --bench coding_tasks    # Coding task benchmark
cargo bench --bench math_logic      # Math & logic benchmark
cargo bench --bench data_analysis   # Data analysis benchmark
cargo bench --bench ensemble_comparison # Ensemble comparison
cargo clippy                         # Lint
cargo fmt --check                    # Format check
cargo doc --no-deps --open           # API documentation

Security

MAKER implements defense-in-depth for MCP tool security:

Schema enforcement: #[serde(deny_unknown_fields)] on all inputs
Red-flag filtering: Malformed LLM outputs discarded, never repaired
Prompt limits: 10,000 character maximum
Microagent isolation: No history leakage between steps (m=1)
State hash validation: Corruption detected before state transfer

See SECURITY.md for vulnerability reporting.

Citing This Project

If you use MAKER in your research or projects, please cite it using GitHub's "Cite this repository" button, or use the following BibTeX:

@software{allen_maker_2026,
  author       = {Allen, Robert},
  title        = {{MAKER Framework}},
  version      = {0.3.0},
  date         = {2026-02-27},
  url          = {https://github.com/zircote/maker-rs},
  license      = {MIT}
}

See CITATION.cff for full citation metadata.

Acknowledgments

MAKER is a Rust implementation of the algorithms and theoretical framework presented in:

Meyerson, E., Qiu, X., & Lehman, J. (2025). Solving a Million-Step LLM Task with Zero Errors. arXiv:2511.09030

References

Meyerson, E., et al. (2025). Solving a Million-Step LLM Task with Zero Errors. arXiv:2511.09030
Anthropic. (2024). Introducing the Model Context Protocol. anthropic.com
Wald, A. (1945). Sequential Analysis. (SPRT foundational work)

License

MIT - see LICENSE

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
.claude		.claude
.github		.github
_plan		_plan
_research		_research
benches		benches
docs		docs
examples		examples
src		src
tests		tests
workspace		workspace
.dockerignore		.dockerignore
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
PROJECT-CONTEXT.md		PROJECT-CONTEXT.md
README.md		README.md
SECURITY.md		SECURITY.md
cliff.toml		cliff.toml
deny.toml		deny.toml

Folders and files

Latest commit

History

Repository files navigation

MAKER Framework

⚠️ EXPERIMENTAL - NOT FOR PRODUCTION USE

The Problem

Quick Start

As a Rust Library

As an MCP Server

Run the Demos

Validation Demos

Hanoi Demo (examples/hanoi_demo.rs)

Arithmetic Demo (examples/arithmetic_demo.rs)

Research Findings

What MAKER Can Correct

What MAKER Cannot Correct

Key Insight

Observed Accuracy

Research Validation

Scorecard (Self-Assessment)

What's Implemented Well

Implementation Details

Known Limitations

Practical Compromises

Architecture

MCP Tools

maker/vote - Error-Corrected Voting

maker/validate - Red-Flag Checking

maker/calibrate - Success Rate Estimation

maker/configure - Runtime Configuration

maker/decompose - LLM-Driven Task Decomposition

Cost Scaling

MAKER vs Naive Retry (Monte Carlo validated, p=0.85, t=0.95)

Adaptive K-Margin

Semantic Matching

Multi-Model Ensemble

Benchmarks

Development

Security

Citing This Project

Acknowledgments

References

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Hanoi Demo (`examples/hanoi_demo.rs`)

Arithmetic Demo (`examples/arithmetic_demo.rs`)

`maker/vote` - Error-Corrected Voting

`maker/validate` - Red-Flag Checking

`maker/calibrate` - Success Rate Estimation

`maker/configure` - Runtime Configuration

`maker/decompose` - LLM-Driven Task Decomposition

Packages