dbt-claude

Background:

Claude Code isn't going to replace data engineers (yet) -- results and analysis
Claude Code in action with dbt -- watching Claude work
Evaluating Claude's dbt Skills: Building an Eval from Scratch -- designing and building the eval (for which this repo is the code)

Can an AI coding agent build a production-quality dbt pipeline from scratch? This repo is a reproducible eval harness to find out.

It gives Claude Code an empty directory, a prompt pointing at the UK Environment Agency flood monitoring API, and bash access inside a Docker container. Claude must set up an entire dbt project from scratch -- explore the API, build models, run dbt build, fix errors, iterate -- with no human intervention.

The harness then scores what Claude built, both deterministically (does it compile? does it run?) and qualitatively (is the data model good? did it handle edge cases?).

Each eval run typically takes 15-60+ minutes and costs in the region of a few dollars in API usage, depending on model, context length, and how many iterations Claude needs. Check Anthropic's pricing page for current rates.

Quick start

Prerequisites

Docker
Google Cloud SDK with Application Default Credentials
Access to Claude models in Vertex AI Model Garden

Setup

# 1. Authenticate
gcloud auth application-default login

# 2. Build the Docker image
docker build -t dbt-claude .

# 3. Set your GCP project ID
export ANTHROPIC_VERTEX_PROJECT_ID=$(gcloud config get-value project)

Run a scenario

./eval.sh A                    # Run one scenario
./eval.sh all                  # Run all six sequentially
./eval.sh --model claude-opus-4-6 A   # Specify model

Score the results

./validate.sh runs/A-minimal-no-skills/claude-sonnet-4-6/project   # Deterministic checks
./judge.sh A                                                        # LLM-as-judge scoring

View in the dashboard

./dashboard.sh                 # Opens http://localhost:8080/eval-dashboard-v2/

How the eval works

Scenarios

The eval varies two factors to create a 2x3 matrix of six scenarios:

Context level: Minimal (just the API URLs) vs Rich (links to blog posts with full data analysis)
dbt skills: None, a single skill file, or the full dbt-agent-skills plugin

Scenario	Context	dbt Skills	Command
A	Minimal	None	`./eval.sh A`
B	Rich	None	`./eval.sh B`
C	Minimal	Single skill file	`./eval.sh C`
D	Rich	Single skill file	`./eval.sh D`
E	Minimal	Full plugin (8 skills)	`./eval.sh E`
F	Rich	Full plugin (8 skills)	`./eval.sh F`

What happens during a run

eval.sh creates an empty runs/<scenario>/<model>/project/ directory
For skills scenarios: copies skill files or mounts the plugin directory
Launches Claude in Docker via run.sh with the appropriate prompt
Claude autonomously explores the API, sets up dbt, builds models, runs dbt build, fixes errors, iterates
Output is streamed to output.jsonl (full NDJSON transcript)
Summary metrics (cost, tokens, duration) are extracted to summary.json
validate.sh runs automated checks against the generated project

Validation vs judging

The eval has two complementary scoring layers:

	`validate.sh`	`judge.sh`
What it is	Deterministic shell script	LLM-as-judge
How it works	`grep`/`find` checks + `dbt build` in Docker	Sends project files + rubric to an LLM, gets scored JSON
Reproducible?	Yes	No -- LLM scoring has variance between runs
What it checks	Structure exists (files, directories, keywords)	Quality of what was built (design, correctness, completeness)
Can it run dbt?	Yes -- runs `dbt build` in Docker	No -- reads source files only
Output	`validation.log` (PASS/FAIL/WARN per check)	`judge.json` (0-3 scores on 9 criteria)
Cost	Free (local Docker)	A few cents per run

Validation answers "does it work?" -- can dbt parse the project, do the models build, are the expected files present?

Judging answers "is it good?" -- are the joins correct, is the data model well-designed, did the agent discover and handle messy data?

The scoring rubric is in rubric.md -- 9 criteria, each scored 0-3, grounded in the reference implementation. Multiple judge providers are supported:

./judge.sh A                                                        # Gemini (default)
./judge.sh --provider claude --judge-model claude-opus-4-6 B        # Claude
./judge.sh --provider ollama --judge-model qwen2.5-coder:32b all    # Local Ollama

Use variance.sh to run multiple judge trials and measure scoring consistency.

Data ingestion

The eval expects Claude to build data ingestion into the dbt project itself (e.g. using DuckDB's httpfs extension, Python pre-load scripts, or dbt-duckdb plugins). The validate.sh script runs dbt build against the generated project as-is -- if source tables don't exist because the project doesn't handle its own ingestion, that's a legitimate eval failure.

Known data quality gotchas

These are issues in the Environment Agency data that a good pipeline should handle, discovered during the hand-built pipeline and initial data exploration. They serve as litmus tests when scoring:

stationReference vs station: 0.4% of records have different values -- which key is canonical?
URL-prefixed keys: API returns keys like http://environment.data.gov.uk/flood-monitoring/id/stations/1029TH -- should be stripped to just 1029TH
Pipe-separated values: Some CSV fields contain multiple values separated by | where only one is expected
Disappearing measures: Measures appear and disappear between API calls
Duplicate notation values: Some stations have duplicate notation fields
Snapshot column selection: Snapshotting all columns captures noise (e.g. measure list changes); should track only meaningful metadata

Viewing results

Dashboard

./dashboard.sh           # Starts at http://localhost:8080/eval-dashboard-v2/
./dashboard.sh 9090      # Custom port

The dashboard provides:

Eval run monitoring: Live polling of in-progress runs (turn counts, token usage, cost, duration) in a grid layout with minimise/maximise
Session browser: Browse Claude Code sessions from ~/.claude/projects/, or load any .jsonl file from disk
Content filtering: Toggle visibility of agent text, user messages, tool calls, tool results, and thinking blocks
Event detail: Click any event to expand -- tool inputs, side-by-side file diffs for edits, syntax-highlighted code blocks

Keyboard shortcuts: j/k scroll, G/gg jump to bottom/top, / search, n/N next/prev match.

Timeline viewer

claude-code-timeline.html is a standalone viewer for individual session transcripts, based on claude-code-timeline by Simon Willison. Supports file picker, drag-and-drop, paste, and URL fetch. The dashboard links to it for each eval run.

Using run.sh standalone

Beyond the eval harness, run.sh is a general-purpose wrapper for running Claude Code in Docker with Vertex AI auth. Run ./run.sh --help for full options.

./run.sh                                          # Interactive mode
./run.sh --print -p "explain the dbt models"      # One-shot prompt
./run.sh --print -p "fix the test" --max-turns 10 # Pass any Claude Code flags
./run.sh --in-place                               # Skip worktree isolation
./run.sh --no-sessions                            # Skip session capture

Worktree isolation: Inside a git repo, run.sh auto-creates an isolated worktree. Changes are auto-committed to the branch; no changes means the worktree is cleaned up. Use --in-place to modify the working tree directly.

How it works: Your host's ADC credentials are mounted read-only. The Docker image includes Python 3 and dbt-duckdb pre-installed. A non-root claude user satisfies Claude Code's permission requirements. Session JSONL is persisted to the host under ~/.claude/projects/docker-<encoded-path>/.

Glossary

Eval: The whole project -- the apparatus for testing how well an LLM builds dbt pipelines.
Scenario: What we're testing -- a specific Prompt + Skill combination (e.g. "rich context with dbt skills").
Configuration: Scenario + Model. One cell in the test matrix (e.g. "rich context with dbt skills, using Claude Opus").
Run: One execution of a configuration. Multiple runs of the same configuration test repeatability.
Validation: Deterministic checking -- does the project build? Are the expected files present?
Judging: LLM-based quality assessment, scoring 9 criteria (0-3 each, max 27) against a rubric.
Trial: One execution of a judge against a run. Multiple trials measure scoring variance.

Project structure

dbt-claude/
├── eval.sh                     # Eval orchestrator (runs scenarios A-F)
├── run.sh                      # Docker wrapper with Vertex AI auth
├── validate.sh                 # Post-run deterministic checks
├── judge.sh                    # LLM-as-judge scoring (Gemini/Claude/Ollama)
├── variance.sh                 # Multi-trial judge variance analysis
├── prepare.sh                  # Master script: regenerate -> judge -> dashboard
├── dashboard.sh                # Starts dashboard server + builds manifest
├── Dockerfile                  # Docker image: node + claude CLI + python + dbt-duckdb
├── rubric.md                   # Scoring rubric (9 criteria, 0-3 each)
├── rubric-tiered.md            # Tiered rubric variant
├── scorecard.md                # Manual scoring template
├── generate-heatmap.py         # Cross-judge heatmap visualisation
├── generate-judge-comparison.py # Judge agreement analysis
├── eval-dashboard-v2/          # Live monitoring dashboard (modular JS/CSS)
│   ├── index.html
│   ├── dashboard.css
│   └── js/                     # Components, state, routing, utilities
├── claude-code-timeline.html   # Session transcript timeline viewer
├── prompts/
│   ├── minimal.md              # Terse prompt: just the API URLs
│   └── rich.md                 # Detailed prompt: blog post links + requirements
└── runs/                       # Output from each run (gitignored)
    └── <scenario>/<model>/
        ├── project/            # The dbt project Claude built
        ├── output.jsonl        # Full NDJSON transcript
        ├── summary.json        # Cost, tokens, duration, model
        ├── validation.log      # Deterministic check results
        └── judge.json          # LLM-as-judge scores (9 criteria)

Background

This eval is grounded in a hand-built dbt pipeline using the same data:

Exploring UK Environment Agency Data in DuckDB and Rill -- initial data exploration
Building a Data Pipeline with DuckDB -- raw SQL pipeline
Ten Years Late to the dbt Party -- hand-built dbt pipeline (the reference implementation)
Claude the Instructor -- using Claude as a tutor to learn dbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dbt-claude

Quick start

Prerequisites

Setup

Run a scenario

Score the results

View in the dashboard

How the eval works

Scenarios

What happens during a run

Validation vs judging

Data ingestion

Known data quality gotchas

Viewing results

Dashboard

Timeline viewer

Using run.sh standalone

Glossary

Project structure

Background

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
eval-dashboard-v2		eval-dashboard-v2
prompts		prompts
test-results		test-results
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
claude-code-timeline.docs.md		claude-code-timeline.docs.md
claude-code-timeline.html		claude-code-timeline.html
dashboard.sh		dashboard.sh
diagram-excalidraw.excalidraw		diagram-excalidraw.excalidraw
diagram.png		diagram.png
eval.sh		eval.sh
generate-heatmap.py		generate-heatmap.py
generate-judge-comparison.py		generate-judge-comparison.py
judge.sh		judge.sh
package-lock.json		package-lock.json
package.json		package.json
prepare.sh		prepare.sh
rubric-tiered.md		rubric-tiered.md
rubric.md		rubric.md
run.sh		run.sh
scorecard.md		scorecard.md
validate.sh		validate.sh
variance.sh		variance.sh
verify-screenshot.js		verify-screenshot.js

Folders and files

Latest commit

History

Repository files navigation

dbt-claude

Quick start

Prerequisites

Setup

Run a scenario

Score the results

View in the dashboard

How the eval works

Scenarios

What happens during a run

Validation vs judging

Data ingestion

Known data quality gotchas

Viewing results

Dashboard

Timeline viewer

Using run.sh standalone

Glossary

Project structure

Background

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages