feat: add CEL-based data query with SQLite catalog by adamhjk · Pull Request #1029 · systeminit/swamp

adamhjk · 2026-04-01T13:20:59Z

Summary

Implements #1014 — adds swamp data query for finding data artifacts across all models using CEL predicates, backed by a SQLite metadata catalog.

Three entry points with full parity: CLI (--select projection), CEL expressions (data.query(predicate, select?)), and extension methods (context.queryData(predicate, select?))
SQLite catalog (.swamp/data/_catalog.db) with write-through updates on all data mutations, backfill-on-first-query, self-healing, and invalidation after remote datastore sync
Lazy content loading — attributes (parsed JSON) and content (raw text) only read from disk when the expression references them
DataRecord extended with modelName, modelType, specName, dataType, contentType, lifetime, ownerType, streaming, size, content — additive, backward-compatible
New swamp-data-query skill bundled with repo init/upgrade, replaces data search references across existing skills

Design doc: design/data-query.md

Test Plan

14 DataQueryService unit tests (predicates, tags, attributes, projection, backfill, field validation)
10 CatalogStore unit tests (upsert, remove, iterate, invalidation, reopen)
All 3937 existing tests pass
deno check / deno lint / deno fmt clean
Manual testing against live repos with --select projections

🤖 Generated with Claude Code

Closes #1014

Add `swamp data query '<predicate>'` for finding data artifacts across all models using CEL predicates. Queries filter on artifact metadata and optionally on JSON content or raw text. Three entry points with full parity: - CLI: `swamp data query '<predicate>' --select '<projection>'` - CEL: `data.query('<predicate>', '<select>')` - Extension methods: `context.queryData('<predicate>', '<select>')` Key features: - SQLite metadata catalog (.swamp/data/_catalog.db) with write-through updates from all UnifiedDataRepository mutations - Lazy content loading — attributes and content only read from disk when the expression references them - CEL projection via --select (CLI) or second argument (CEL/methods) for extracting specific fields, custom tables, or formatted output - Field validation on predicates with clear error messages - Backfill-on-first-query with self-healing (deleted catalog auto-rebuilds) - Catalog invalidation after remote datastore sync - Markdown table rendering via renderMarkdownToTerminal - New `content` field on DataRecord for raw text access (text/*, JSON, YAML) - New `swamp-data-query` skill with field reference documentation DataRecord extended with: modelName, modelType, specName, dataType, contentType, lifetime, ownerType, streaming, size, content. All existing DataRecord producers updated. Backward-compatible (additive fields). Closes #1014

github-actions

CLI UX Review

Blocking

None.

Suggestions

Missing .example() calls (src/cli/commands/data_query.ts) — Every other data subcommand (list, search) has multiple .example() entries in its help text. CEL predicates are the least-intuitive part of this feature; without examples, swamp data query --help gives users no starting point. Suggested additions:
```
.example("Filter by model", "swamp data query 'modelName == \"scanner\"'" )
.example("Filter with size threshold", "swamp data query 'size > 1048576'")
.example("Project a single field", "swamp data query 'dataType == \"resource\"' --select data.name")
```
--select description is too terse (data_query.ts:42) — "CEL projection expression" doesn't hint at what to write. Something like "CEL expression to extract fields from matching records (e.g. data.name)" gives users a concrete starting point without reading docs.
JSON mode drops limited/total when --select is used (src/presentation/renderers/data_query.ts:178-194) — Without --select, the JSON envelope includes total and limited, so scripts can detect truncation. With --select, the output is just the bare array with no metadata. A script hitting the default 100-record limit gets silently truncated data. Consider wrapping projected JSON output in { results: [...], total, limited } as well, or at least emitting limited: true when applicable.
Projected log-mode output omits count when not at limit (data_query.ts renderer, line 169-172) — The default table always shows "N result(s)" at the bottom. The projected renderer only appends "Showing N results (limit reached)" when the limit is hit; otherwise the output ends with no count. Users have no feedback on how many records matched. Consistent with #3 above.
--limit default inconsistency — data query defaults to 100; data search defaults to 50. Minor, but worth aligning since both commands are in the same data group.

Verdict

PASS — no blocking issues. The command works correctly with both log and JSON output modes, flags are consistently named, and error messages follow the UserError pattern. The suggestions above are primarily around discoverability (missing examples) and completeness of JSON scripting metadata.

github-actions

Adversarial Review

Critical / High

No critical or high severity findings.

Medium

src/infrastructure/persistence/catalog_store.ts:145-149 — iterate() loads all rows into memory despite generator signature.
The method is documented as "Uses a generator so only one row is in memory at a time", but stmt.all() on line 147 materializes the entire result set into a JS array before yielding. For a large catalog (tens of thousands of artifacts), this defeats the memory-efficiency purpose of the generator pattern. The node:sqlite DatabaseSync API doesn't support streaming cursors, so this is a limitation of the underlying library rather than a bug, but the comment is misleading. For catalogs with 100K+ rows, this could cause memory pressure.
Suggested fix: Update the comment to accurately reflect that all rows are loaded eagerly, or if node:sqlite adds iterator support, switch to it. Alternatively, consider paged queries (LIMIT/OFFSET) if memory becomes an issue.
src/infrastructure/persistence/catalog_store.ts — close() is never called in production code paths.
The CatalogStore is created in createRepositoryContext() (repository_factory.ts:283) but there is no cleanup lifecycle that calls close(). For CLI commands that exit quickly this is harmless (process exit cleans up), but for long-running processes like the serve/WebSocket server (src/serve/connection.ts), the SQLite connection stays open for the lifetime of the process. With WAL mode this is generally fine, but if multiple repo contexts are created (e.g., per-connection), this could leak database handles.
Suggested fix: This is acceptable for CLIs but should be tracked for the serve path. Consider adding close() to RepositoryContext lifecycle or using a shared singleton.
src/domain/data/query_predicate.ts:38,48 — size appears in both QUERY_FIELDS and CEL_BUILTINS.
The identifier size is in both sets. This doesn't cause a validation bug (the check is !QUERY_FIELDS.has(id) && !CEL_BUILTINS.has(id) — both must be false to flag it), but it means the CEL built-in size() function (e.g., size(name) > 5) and the size data field (e.g., size > 1024) are both accepted. If a user writes size > 1024, CEL will resolve size as the DataRecord's size field (a number), which is the intended behavior. But size(tags) would also be accepted as valid, which is correct CEL behavior. No actual bug here, but the overlap is worth noting.
src/domain/data/data_query_service.ts:127-135 — Silent catch swallows all evaluation errors including programming errors.
The bare catch on line 133 swallows any exception during CEL evaluation, including things like TypeError from unexpected null access or internal cel-js bugs. While skipping rows that fail type matching is reasonable, silently swallowing all errors can make debugging very difficult when queries return fewer results than expected and there's no indication why.
Suggested fix: At minimum, log the error at debug/trace level so users can diagnose with --verbose.
src/domain/data/data_query_service.ts:99-100 — parsed.ast accessed without type guard.
The code casts parsed.ast as ASTNode on line 100. If cel-js changes its internal structure or parse() returns a different shape, this will fail at runtime with an unhelpful error. Similarly, line 118 casts (selectParsed as unknown as { ast: ASTNode }).ast which is fragile.
Not blocking: This couples to cel-js internals but is a pragmatic choice for AST inspection.

Low

src/domain/data/data_query_service.ts:82-90 — querySync called from CEL can trigger a full disk scan.
When data.query() is used in a CEL expression during model evaluation, querySync is called. If the catalog is not populated, backfillSync() performs a synchronous scan of all data on disk (findAllGlobalSync()). For large repos, this could cause a noticeable hang during expression evaluation. This is by design (documented as "backfill-on-first-query") but worth noting as a latency risk.
src/domain/data/data_query_service.ts — Nested CEL evaluation (CEL → data.query → CEL parse/eval).
When data.query(predicate) is called from a CEL expression, it internally creates a new Environment and parse()s the predicate. If a user crafts a deeply nested query (e.g., a predicate that itself references data.query — which the field validation would actually block), this is fine. The field validation (validateFieldReferences) prevents data from appearing as a root identifier, so recursive calls are blocked. Good design.
src/infrastructure/persistence/catalog_store.ts:104-128 — Prepared statements are created per-call, not cached.
Each call to upsert(), remove(), count(), etc. calls this.db.prepare() which parses the SQL again. For write-through updates on every data mutation, this adds overhead. SQLite does have a statement cache, but relying on node:sqlite's internal caching is implicit.
Not blocking: Perf optimization opportunity, not a correctness issue.

Verdict

PASS — The implementation is solid. The SQLite catalog with write-through updates is well-integrated into the existing data mutation paths. Field validation prevents injection of unknown identifiers. The lazy content loading (attributes/content) is a good optimization. The backfill self-healing mechanism is clean. The silent catch in predicate evaluation (Medium #4) is the most concerning pattern but is unlikely to cause data loss—just confusing debugging. No blocking issues found.

github-actions

Code Review

Well-structured PR that adds a CEL-based data query system with a SQLite metadata catalog. The architecture follows DDD principles cleanly: DataQueryService as a domain service, CatalogStore as infrastructure, pure domain logic in query_predicate.ts, and the libswamp generator handling application-layer orchestration. The catalog write-through approach is solid, and the lazy content loading is a good performance decision.

Test coverage is good for the core pieces (14 DataQueryService tests, 10 CatalogStore tests covering round-trips, backfill, invalidation, and edge cases).

Blocking Issues

None.

Suggestions

CatalogStore.iterate() uses stmt.all() which loads all rows into memory (src/infrastructure/persistence/catalog_store.ts:148). The design doc states "Iteration uses stmt.iterate() so that only one row is in memory at a time," but the implementation calls stmt.all() then yields from the array. For large catalogs this defeats the memory efficiency goal. Consider using stmt.iterate() from node:sqlite if the Deno runtime supports it, or at minimum update the design doc to reflect the actual behavior.
Duplicate DataQueryService instantiation pattern: Both src/cli/commands/model_method_run.ts and src/serve/deps.ts have the same IIFE pattern for creating queryData:
```
((dqs) => (predicate: string, select?: string) =>
  dqs.query(predicate, { select }))(new DataQueryService(...))
```
Additionally, model_method_run.ts creates two separate DataQueryService instances (one for expression evaluation, one for queryData). Consider creating it once and sharing, and extracting a small helper to reduce duplication.
Missing direct unit tests for query_predicate.ts: The AST-walking logic in collectRootIdentifiers handles ~10 node types and is a prime candidate for direct unit tests. It's currently only tested indirectly through DataQueryService. A query_predicate_test.ts would make regressions easier to diagnose and cover edge cases (deeply nested expressions, unusual operator combinations).
Missing tests for content_type.ts: isTextContentType() is used by both DataAccessService and DataQueryService. A small test file would document the boundary cases (e.g., application/x-yaml is included, application/octet-stream is not).
Domain-infra ratchet bump is honest and correct (22→23 for DataQueryService importing CatalogStore). Long-term, extracting a CatalogReader interface in the domain layer would let you remove this violation, but that's not needed now.

Three medium-severity issues from the adversarial review: 1. CatalogStore.iterate() now pages with LIMIT/OFFSET (1000 rows per batch) instead of stmt.all() which loaded all rows into memory. 2. serve command closes CatalogStore on graceful shutdown so the SQLite connection is not leaked for long-running processes. 3. DataQueryService logs skipped rows at debug level instead of silently swallowing evaluation errors, so users can diagnose with --verbose. Also incorporates UX review suggestions: CLI examples, improved --select description, and JSON envelope for projected output.

github-actions Bot approved these changes Apr 1, 2026

View reviewed changes

github-actions Bot reviewed Apr 1, 2026

View reviewed changes

github-actions Bot approved these changes Apr 1, 2026

View reviewed changes

stack72 merged commit 2daea13 into main Apr 1, 2026
9 checks passed

stack72 deleted the data-query branch April 1, 2026 13:29

This was referenced Apr 1, 2026

fix: address adversarial review findings for data query #1035

Merged

SQLite "database is locked" error under concurrent model method runs #1055

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add CEL-based data query with SQLite catalog#1029

feat: add CEL-based data query with SQLite catalog#1029
stack72 merged 1 commit intomainfrom
data-query

adamhjk commented Apr 1, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

adamhjk commented Apr 1, 2026

Summary

Test Plan

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

CLI UX Review

Blocking

Suggestions

Verdict

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Adversarial Review

Critical / High

Medium

Low

Verdict

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Code Review

Blocking Issues

Suggestions

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants