Background:
- Claude Code isn't going to replace data engineers (yet) -- results and analysis
- Claude Code in action with dbt -- watching Claude work
- Evaluating Claude's dbt Skills: Building an Eval from Scratch -- designing and building the eval (for which this repo is the code)
Can an AI coding agent build a production-quality dbt pipeline from scratch? This repo is a reproducible eval harness to find out.
It gives Claude Code an empty directory, a prompt pointing at the UK Environment Agency flood monitoring API, and bash access inside a Docker container. Claude must set up an entire dbt project from scratch -- explore the API, build models, run dbt build, fix errors, iterate -- with no human intervention.
The harness then scores what Claude built, both deterministically (does it compile? does it run?) and qualitatively (is the data model good? did it handle edge cases?).
Each eval run typically takes 15-60+ minutes and costs in the region of a few dollars in API usage, depending on model, context length, and how many iterations Claude needs. Check Anthropic's pricing page for current rates.
- Docker
- Google Cloud SDK with Application Default Credentials
- Access to Claude models in Vertex AI Model Garden
# 1. Authenticate
gcloud auth application-default login
# 2. Build the Docker image
docker build -t dbt-claude .
# 3. Set your GCP project ID
export ANTHROPIC_VERTEX_PROJECT_ID=$(gcloud config get-value project)./eval.sh A # Run one scenario
./eval.sh all # Run all six sequentially
./eval.sh --model claude-opus-4-6 A # Specify model./validate.sh runs/A-minimal-no-skills/claude-sonnet-4-6/project # Deterministic checks
./judge.sh A # LLM-as-judge scoring./dashboard.sh # Opens http://localhost:8080/eval-dashboard-v2/The eval varies two factors to create a 2x3 matrix of six scenarios:
- Context level: Minimal (just the API URLs) vs Rich (links to blog posts with full data analysis)
- dbt skills: None, a single skill file, or the full dbt-agent-skills plugin
| Scenario | Context | dbt Skills | Command |
|---|---|---|---|
| A | Minimal | None | ./eval.sh A |
| B | Rich | None | ./eval.sh B |
| C | Minimal | Single skill file | ./eval.sh C |
| D | Rich | Single skill file | ./eval.sh D |
| E | Minimal | Full plugin (8 skills) | ./eval.sh E |
| F | Rich | Full plugin (8 skills) | ./eval.sh F |
eval.shcreates an emptyruns/<scenario>/<model>/project/directory- For skills scenarios: copies skill files or mounts the plugin directory
- Launches Claude in Docker via
run.shwith the appropriate prompt - Claude autonomously explores the API, sets up dbt, builds models, runs
dbt build, fixes errors, iterates - Output is streamed to
output.jsonl(full NDJSON transcript) - Summary metrics (cost, tokens, duration) are extracted to
summary.json validate.shruns automated checks against the generated project
The eval has two complementary scoring layers:
validate.sh |
judge.sh |
|
|---|---|---|
| What it is | Deterministic shell script | LLM-as-judge |
| How it works | grep/find checks + dbt build in Docker |
Sends project files + rubric to an LLM, gets scored JSON |
| Reproducible? | Yes | No -- LLM scoring has variance between runs |
| What it checks | Structure exists (files, directories, keywords) | Quality of what was built (design, correctness, completeness) |
| Can it run dbt? | Yes -- runs dbt build in Docker |
No -- reads source files only |
| Output | validation.log (PASS/FAIL/WARN per check) |
judge.json (0-3 scores on 9 criteria) |
| Cost | Free (local Docker) | A few cents per run |
Validation answers "does it work?" -- can dbt parse the project, do the models build, are the expected files present?
Judging answers "is it good?" -- are the joins correct, is the data model well-designed, did the agent discover and handle messy data?
The scoring rubric is in rubric.md -- 9 criteria, each scored 0-3, grounded in the reference implementation. Multiple judge providers are supported:
./judge.sh A # Gemini (default)
./judge.sh --provider claude --judge-model claude-opus-4-6 B # Claude
./judge.sh --provider ollama --judge-model qwen2.5-coder:32b all # Local OllamaUse variance.sh to run multiple judge trials and measure scoring consistency.
The eval expects Claude to build data ingestion into the dbt project itself (e.g. using DuckDB's httpfs extension, Python pre-load scripts, or dbt-duckdb plugins). The validate.sh script runs dbt build against the generated project as-is -- if source tables don't exist because the project doesn't handle its own ingestion, that's a legitimate eval failure.
These are issues in the Environment Agency data that a good pipeline should handle, discovered during the hand-built pipeline and initial data exploration. They serve as litmus tests when scoring:
stationReferencevsstation: 0.4% of records have different values -- which key is canonical?- URL-prefixed keys: API returns keys like
http://environment.data.gov.uk/flood-monitoring/id/stations/1029TH-- should be stripped to just1029TH - Pipe-separated values: Some CSV fields contain multiple values separated by
|where only one is expected - Disappearing measures: Measures appear and disappear between API calls
- Duplicate notation values: Some stations have duplicate
notationfields - Snapshot column selection: Snapshotting all columns captures noise (e.g. measure list changes); should track only meaningful metadata
./dashboard.sh # Starts at http://localhost:8080/eval-dashboard-v2/
./dashboard.sh 9090 # Custom portThe dashboard provides:
- Eval run monitoring: Live polling of in-progress runs (turn counts, token usage, cost, duration) in a grid layout with minimise/maximise
- Session browser: Browse Claude Code sessions from
~/.claude/projects/, or load any.jsonlfile from disk - Content filtering: Toggle visibility of agent text, user messages, tool calls, tool results, and thinking blocks
- Event detail: Click any event to expand -- tool inputs, side-by-side file diffs for edits, syntax-highlighted code blocks
Keyboard shortcuts: j/k scroll, G/gg jump to bottom/top, / search, n/N next/prev match.
claude-code-timeline.html is a standalone viewer for individual session transcripts, based on claude-code-timeline by Simon Willison. Supports file picker, drag-and-drop, paste, and URL fetch. The dashboard links to it for each eval run.
Beyond the eval harness, run.sh is a general-purpose wrapper for running Claude Code in Docker with Vertex AI auth. Run ./run.sh --help for full options.
./run.sh # Interactive mode
./run.sh --print -p "explain the dbt models" # One-shot prompt
./run.sh --print -p "fix the test" --max-turns 10 # Pass any Claude Code flags
./run.sh --in-place # Skip worktree isolation
./run.sh --no-sessions # Skip session captureWorktree isolation: Inside a git repo, run.sh auto-creates an isolated worktree. Changes are auto-committed to the branch; no changes means the worktree is cleaned up. Use --in-place to modify the working tree directly.
How it works: Your host's ADC credentials are mounted read-only. The Docker image includes Python 3 and dbt-duckdb pre-installed. A non-root claude user satisfies Claude Code's permission requirements. Session JSONL is persisted to the host under ~/.claude/projects/docker-<encoded-path>/.
- Eval: The whole project -- the apparatus for testing how well an LLM builds dbt pipelines.
- Scenario: What we're testing -- a specific Prompt + Skill combination (e.g. "rich context with dbt skills").
- Configuration: Scenario + Model. One cell in the test matrix (e.g. "rich context with dbt skills, using Claude Opus").
- Run: One execution of a configuration. Multiple runs of the same configuration test repeatability.
- Validation: Deterministic checking -- does the project build? Are the expected files present?
- Judging: LLM-based quality assessment, scoring 9 criteria (0-3 each, max 27) against a rubric.
- Trial: One execution of a judge against a run. Multiple trials measure scoring variance.
dbt-claude/
├── eval.sh # Eval orchestrator (runs scenarios A-F)
├── run.sh # Docker wrapper with Vertex AI auth
├── validate.sh # Post-run deterministic checks
├── judge.sh # LLM-as-judge scoring (Gemini/Claude/Ollama)
├── variance.sh # Multi-trial judge variance analysis
├── prepare.sh # Master script: regenerate -> judge -> dashboard
├── dashboard.sh # Starts dashboard server + builds manifest
├── Dockerfile # Docker image: node + claude CLI + python + dbt-duckdb
├── rubric.md # Scoring rubric (9 criteria, 0-3 each)
├── rubric-tiered.md # Tiered rubric variant
├── scorecard.md # Manual scoring template
├── generate-heatmap.py # Cross-judge heatmap visualisation
├── generate-judge-comparison.py # Judge agreement analysis
├── eval-dashboard-v2/ # Live monitoring dashboard (modular JS/CSS)
│ ├── index.html
│ ├── dashboard.css
│ └── js/ # Components, state, routing, utilities
├── claude-code-timeline.html # Session transcript timeline viewer
├── prompts/
│ ├── minimal.md # Terse prompt: just the API URLs
│ └── rich.md # Detailed prompt: blog post links + requirements
└── runs/ # Output from each run (gitignored)
└── <scenario>/<model>/
├── project/ # The dbt project Claude built
├── output.jsonl # Full NDJSON transcript
├── summary.json # Cost, tokens, duration, model
├── validation.log # Deterministic check results
└── judge.json # LLM-as-judge scores (9 criteria)
This eval is grounded in a hand-built dbt pipeline using the same data:
- Exploring UK Environment Agency Data in DuckDB and Rill -- initial data exploration
- Building a Data Pipeline with DuckDB -- raw SQL pipeline
- Ten Years Late to the dbt Party -- hand-built dbt pipeline (the reference implementation)
- Claude the Instructor -- using Claude as a tutor to learn dbt