Skip to content

rmoff/dbt-claude

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dbt-claude

Background:


Can an AI coding agent build a production-quality dbt pipeline from scratch? This repo is a reproducible eval harness to find out.

It gives Claude Code an empty directory, a prompt pointing at the UK Environment Agency flood monitoring API, and bash access inside a Docker container. Claude must set up an entire dbt project from scratch -- explore the API, build models, run dbt build, fix errors, iterate -- with no human intervention.

The harness then scores what Claude built, both deterministically (does it compile? does it run?) and qualitatively (is the data model good? did it handle edge cases?).

Each eval run typically takes 15-60+ minutes and costs in the region of a few dollars in API usage, depending on model, context length, and how many iterations Claude needs. Check Anthropic's pricing page for current rates.

Eval pipeline overview

Quick start

Prerequisites

  • Docker
  • Google Cloud SDK with Application Default Credentials
  • Access to Claude models in Vertex AI Model Garden

Setup

# 1. Authenticate
gcloud auth application-default login

# 2. Build the Docker image
docker build -t dbt-claude .

# 3. Set your GCP project ID
export ANTHROPIC_VERTEX_PROJECT_ID=$(gcloud config get-value project)

Run a scenario

./eval.sh A                    # Run one scenario
./eval.sh all                  # Run all six sequentially
./eval.sh --model claude-opus-4-6 A   # Specify model

Score the results

./validate.sh runs/A-minimal-no-skills/claude-sonnet-4-6/project   # Deterministic checks
./judge.sh A                                                        # LLM-as-judge scoring

View in the dashboard

./dashboard.sh                 # Opens http://localhost:8080/eval-dashboard-v2/

How the eval works

Scenarios

The eval varies two factors to create a 2x3 matrix of six scenarios:

  1. Context level: Minimal (just the API URLs) vs Rich (links to blog posts with full data analysis)
  2. dbt skills: None, a single skill file, or the full dbt-agent-skills plugin
Scenario Context dbt Skills Command
A Minimal None ./eval.sh A
B Rich None ./eval.sh B
C Minimal Single skill file ./eval.sh C
D Rich Single skill file ./eval.sh D
E Minimal Full plugin (8 skills) ./eval.sh E
F Rich Full plugin (8 skills) ./eval.sh F

What happens during a run

  1. eval.sh creates an empty runs/<scenario>/<model>/project/ directory
  2. For skills scenarios: copies skill files or mounts the plugin directory
  3. Launches Claude in Docker via run.sh with the appropriate prompt
  4. Claude autonomously explores the API, sets up dbt, builds models, runs dbt build, fixes errors, iterates
  5. Output is streamed to output.jsonl (full NDJSON transcript)
  6. Summary metrics (cost, tokens, duration) are extracted to summary.json
  7. validate.sh runs automated checks against the generated project

Validation vs judging

The eval has two complementary scoring layers:

validate.sh judge.sh
What it is Deterministic shell script LLM-as-judge
How it works grep/find checks + dbt build in Docker Sends project files + rubric to an LLM, gets scored JSON
Reproducible? Yes No -- LLM scoring has variance between runs
What it checks Structure exists (files, directories, keywords) Quality of what was built (design, correctness, completeness)
Can it run dbt? Yes -- runs dbt build in Docker No -- reads source files only
Output validation.log (PASS/FAIL/WARN per check) judge.json (0-3 scores on 9 criteria)
Cost Free (local Docker) A few cents per run

Validation answers "does it work?" -- can dbt parse the project, do the models build, are the expected files present?

Judging answers "is it good?" -- are the joins correct, is the data model well-designed, did the agent discover and handle messy data?

The scoring rubric is in rubric.md -- 9 criteria, each scored 0-3, grounded in the reference implementation. Multiple judge providers are supported:

./judge.sh A                                                        # Gemini (default)
./judge.sh --provider claude --judge-model claude-opus-4-6 B        # Claude
./judge.sh --provider ollama --judge-model qwen2.5-coder:32b all    # Local Ollama

Use variance.sh to run multiple judge trials and measure scoring consistency.

Data ingestion

The eval expects Claude to build data ingestion into the dbt project itself (e.g. using DuckDB's httpfs extension, Python pre-load scripts, or dbt-duckdb plugins). The validate.sh script runs dbt build against the generated project as-is -- if source tables don't exist because the project doesn't handle its own ingestion, that's a legitimate eval failure.

Known data quality gotchas

These are issues in the Environment Agency data that a good pipeline should handle, discovered during the hand-built pipeline and initial data exploration. They serve as litmus tests when scoring:

  • stationReference vs station: 0.4% of records have different values -- which key is canonical?
  • URL-prefixed keys: API returns keys like http://environment.data.gov.uk/flood-monitoring/id/stations/1029TH -- should be stripped to just 1029TH
  • Pipe-separated values: Some CSV fields contain multiple values separated by | where only one is expected
  • Disappearing measures: Measures appear and disappear between API calls
  • Duplicate notation values: Some stations have duplicate notation fields
  • Snapshot column selection: Snapshotting all columns captures noise (e.g. measure list changes); should track only meaningful metadata

Viewing results

Dashboard

./dashboard.sh           # Starts at http://localhost:8080/eval-dashboard-v2/
./dashboard.sh 9090      # Custom port

The dashboard provides:

  • Eval run monitoring: Live polling of in-progress runs (turn counts, token usage, cost, duration) in a grid layout with minimise/maximise
  • Session browser: Browse Claude Code sessions from ~/.claude/projects/, or load any .jsonl file from disk
  • Content filtering: Toggle visibility of agent text, user messages, tool calls, tool results, and thinking blocks
  • Event detail: Click any event to expand -- tool inputs, side-by-side file diffs for edits, syntax-highlighted code blocks

Keyboard shortcuts: j/k scroll, G/gg jump to bottom/top, / search, n/N next/prev match.

Timeline viewer

claude-code-timeline.html is a standalone viewer for individual session transcripts, based on claude-code-timeline by Simon Willison. Supports file picker, drag-and-drop, paste, and URL fetch. The dashboard links to it for each eval run.

Using run.sh standalone

Beyond the eval harness, run.sh is a general-purpose wrapper for running Claude Code in Docker with Vertex AI auth. Run ./run.sh --help for full options.

./run.sh                                          # Interactive mode
./run.sh --print -p "explain the dbt models"      # One-shot prompt
./run.sh --print -p "fix the test" --max-turns 10 # Pass any Claude Code flags
./run.sh --in-place                               # Skip worktree isolation
./run.sh --no-sessions                            # Skip session capture

Worktree isolation: Inside a git repo, run.sh auto-creates an isolated worktree. Changes are auto-committed to the branch; no changes means the worktree is cleaned up. Use --in-place to modify the working tree directly.

How it works: Your host's ADC credentials are mounted read-only. The Docker image includes Python 3 and dbt-duckdb pre-installed. A non-root claude user satisfies Claude Code's permission requirements. Session JSONL is persisted to the host under ~/.claude/projects/docker-<encoded-path>/.

Glossary

  • Eval: The whole project -- the apparatus for testing how well an LLM builds dbt pipelines.
  • Scenario: What we're testing -- a specific Prompt + Skill combination (e.g. "rich context with dbt skills").
  • Configuration: Scenario + Model. One cell in the test matrix (e.g. "rich context with dbt skills, using Claude Opus").
  • Run: One execution of a configuration. Multiple runs of the same configuration test repeatability.
  • Validation: Deterministic checking -- does the project build? Are the expected files present?
  • Judging: LLM-based quality assessment, scoring 9 criteria (0-3 each, max 27) against a rubric.
  • Trial: One execution of a judge against a run. Multiple trials measure scoring variance.

Project structure

dbt-claude/
├── eval.sh                     # Eval orchestrator (runs scenarios A-F)
├── run.sh                      # Docker wrapper with Vertex AI auth
├── validate.sh                 # Post-run deterministic checks
├── judge.sh                    # LLM-as-judge scoring (Gemini/Claude/Ollama)
├── variance.sh                 # Multi-trial judge variance analysis
├── prepare.sh                  # Master script: regenerate -> judge -> dashboard
├── dashboard.sh                # Starts dashboard server + builds manifest
├── Dockerfile                  # Docker image: node + claude CLI + python + dbt-duckdb
├── rubric.md                   # Scoring rubric (9 criteria, 0-3 each)
├── rubric-tiered.md            # Tiered rubric variant
├── scorecard.md                # Manual scoring template
├── generate-heatmap.py         # Cross-judge heatmap visualisation
├── generate-judge-comparison.py # Judge agreement analysis
├── eval-dashboard-v2/          # Live monitoring dashboard (modular JS/CSS)
│   ├── index.html
│   ├── dashboard.css
│   └── js/                     # Components, state, routing, utilities
├── claude-code-timeline.html   # Session transcript timeline viewer
├── prompts/
│   ├── minimal.md              # Terse prompt: just the API URLs
│   └── rich.md                 # Detailed prompt: blog post links + requirements
└── runs/                       # Output from each run (gitignored)
    └── <scenario>/<model>/
        ├── project/            # The dbt project Claude built
        ├── output.jsonl        # Full NDJSON transcript
        ├── summary.json        # Cost, tokens, duration, model
        ├── validation.log      # Deterministic check results
        └── judge.json          # LLM-as-judge scores (9 criteria)

Background

This eval is grounded in a hand-built dbt pipeline using the same data:

  1. Exploring UK Environment Agency Data in DuckDB and Rill -- initial data exploration
  2. Building a Data Pipeline with DuckDB -- raw SQL pipeline
  3. Ten Years Late to the dbt Party -- hand-built dbt pipeline (the reference implementation)
  4. Claude the Instructor -- using Claude as a tutor to learn dbt

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors