Skip to content

feat: add CEL-based data query with SQLite catalog#1029

Merged
stack72 merged 1 commit intomainfrom
data-query
Apr 1, 2026
Merged

feat: add CEL-based data query with SQLite catalog#1029
stack72 merged 1 commit intomainfrom
data-query

Conversation

@adamhjk
Copy link
Copy Markdown
Contributor

@adamhjk adamhjk commented Apr 1, 2026

Summary

Implements #1014 — adds swamp data query for finding data artifacts across all models using CEL predicates, backed by a SQLite metadata catalog.

  • Three entry points with full parity: CLI (--select projection), CEL expressions (data.query(predicate, select?)), and extension methods (context.queryData(predicate, select?))
  • SQLite catalog (.swamp/data/_catalog.db) with write-through updates on all data mutations, backfill-on-first-query, self-healing, and invalidation after remote datastore sync
  • Lazy content loadingattributes (parsed JSON) and content (raw text) only read from disk when the expression references them
  • DataRecord extended with modelName, modelType, specName, dataType, contentType, lifetime, ownerType, streaming, size, content — additive, backward-compatible
  • New swamp-data-query skill bundled with repo init/upgrade, replaces data search references across existing skills

Design doc: design/data-query.md

Test Plan

  • 14 DataQueryService unit tests (predicates, tags, attributes, projection, backfill, field validation)
  • 10 CatalogStore unit tests (upsert, remove, iterate, invalidation, reopen)
  • All 3937 existing tests pass
  • deno check / deno lint / deno fmt clean
  • Manual testing against live repos with --select projections

🤖 Generated with Claude Code

Closes #1014

Add `swamp data query '<predicate>'` for finding data artifacts across all
models using CEL predicates. Queries filter on artifact metadata and
optionally on JSON content or raw text.

Three entry points with full parity:
- CLI: `swamp data query '<predicate>' --select '<projection>'`
- CEL: `data.query('<predicate>', '<select>')`
- Extension methods: `context.queryData('<predicate>', '<select>')`

Key features:
- SQLite metadata catalog (.swamp/data/_catalog.db) with write-through
  updates from all UnifiedDataRepository mutations
- Lazy content loading — attributes and content only read from disk when
  the expression references them
- CEL projection via --select (CLI) or second argument (CEL/methods) for
  extracting specific fields, custom tables, or formatted output
- Field validation on predicates with clear error messages
- Backfill-on-first-query with self-healing (deleted catalog auto-rebuilds)
- Catalog invalidation after remote datastore sync
- Markdown table rendering via renderMarkdownToTerminal
- New `content` field on DataRecord for raw text access (text/*, JSON, YAML)
- New `swamp-data-query` skill with field reference documentation

DataRecord extended with: modelName, modelType, specName, dataType,
contentType, lifetime, ownerType, streaming, size, content. All existing
DataRecord producers updated. Backward-compatible (additive fields).

Closes #1014
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLI UX Review

Blocking

None.

Suggestions

  1. Missing .example() calls (src/cli/commands/data_query.ts) — Every other data subcommand (list, search) has multiple .example() entries in its help text. CEL predicates are the least-intuitive part of this feature; without examples, swamp data query --help gives users no starting point. Suggested additions:

    .example("Filter by model", "swamp data query 'modelName == \"scanner\"'" )
    .example("Filter with size threshold", "swamp data query 'size > 1048576'")
    .example("Project a single field", "swamp data query 'dataType == \"resource\"' --select data.name")
    
  2. --select description is too terse (data_query.ts:42) — "CEL projection expression" doesn't hint at what to write. Something like "CEL expression to extract fields from matching records (e.g. data.name)" gives users a concrete starting point without reading docs.

  3. JSON mode drops limited/total when --select is used (src/presentation/renderers/data_query.ts:178-194) — Without --select, the JSON envelope includes total and limited, so scripts can detect truncation. With --select, the output is just the bare array with no metadata. A script hitting the default 100-record limit gets silently truncated data. Consider wrapping projected JSON output in { results: [...], total, limited } as well, or at least emitting limited: true when applicable.

  4. Projected log-mode output omits count when not at limit (data_query.ts renderer, line 169-172) — The default table always shows "N result(s)" at the bottom. The projected renderer only appends "Showing N results (limit reached)" when the limit is hit; otherwise the output ends with no count. Users have no feedback on how many records matched. Consistent with #3 above.

  5. --limit default inconsistencydata query defaults to 100; data search defaults to 50. Minor, but worth aligning since both commands are in the same data group.

Verdict

PASS — no blocking issues. The command works correctly with both log and JSON output modes, flags are consistently named, and error messages follow the UserError pattern. The suggestions above are primarily around discoverability (missing examples) and completeness of JSON scripting metadata.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adversarial Review

Critical / High

No critical or high severity findings.

Medium

  1. src/infrastructure/persistence/catalog_store.ts:145-149iterate() loads all rows into memory despite generator signature.
    The method is documented as "Uses a generator so only one row is in memory at a time", but stmt.all() on line 147 materializes the entire result set into a JS array before yielding. For a large catalog (tens of thousands of artifacts), this defeats the memory-efficiency purpose of the generator pattern. The node:sqlite DatabaseSync API doesn't support streaming cursors, so this is a limitation of the underlying library rather than a bug, but the comment is misleading. For catalogs with 100K+ rows, this could cause memory pressure.
    Suggested fix: Update the comment to accurately reflect that all rows are loaded eagerly, or if node:sqlite adds iterator support, switch to it. Alternatively, consider paged queries (LIMIT/OFFSET) if memory becomes an issue.

  2. src/infrastructure/persistence/catalog_store.tsclose() is never called in production code paths.
    The CatalogStore is created in createRepositoryContext() (repository_factory.ts:283) but there is no cleanup lifecycle that calls close(). For CLI commands that exit quickly this is harmless (process exit cleans up), but for long-running processes like the serve/WebSocket server (src/serve/connection.ts), the SQLite connection stays open for the lifetime of the process. With WAL mode this is generally fine, but if multiple repo contexts are created (e.g., per-connection), this could leak database handles.
    Suggested fix: This is acceptable for CLIs but should be tracked for the serve path. Consider adding close() to RepositoryContext lifecycle or using a shared singleton.

  3. src/domain/data/query_predicate.ts:38,48size appears in both QUERY_FIELDS and CEL_BUILTINS.
    The identifier size is in both sets. This doesn't cause a validation bug (the check is !QUERY_FIELDS.has(id) && !CEL_BUILTINS.has(id) — both must be false to flag it), but it means the CEL built-in size() function (e.g., size(name) > 5) and the size data field (e.g., size > 1024) are both accepted. If a user writes size > 1024, CEL will resolve size as the DataRecord's size field (a number), which is the intended behavior. But size(tags) would also be accepted as valid, which is correct CEL behavior. No actual bug here, but the overlap is worth noting.

  4. src/domain/data/data_query_service.ts:127-135 — Silent catch swallows all evaluation errors including programming errors.
    The bare catch on line 133 swallows any exception during CEL evaluation, including things like TypeError from unexpected null access or internal cel-js bugs. While skipping rows that fail type matching is reasonable, silently swallowing all errors can make debugging very difficult when queries return fewer results than expected and there's no indication why.
    Suggested fix: At minimum, log the error at debug/trace level so users can diagnose with --verbose.

  5. src/domain/data/data_query_service.ts:99-100parsed.ast accessed without type guard.
    The code casts parsed.ast as ASTNode on line 100. If cel-js changes its internal structure or parse() returns a different shape, this will fail at runtime with an unhelpful error. Similarly, line 118 casts (selectParsed as unknown as { ast: ASTNode }).ast which is fragile.
    Not blocking: This couples to cel-js internals but is a pragmatic choice for AST inspection.

Low

  1. src/domain/data/data_query_service.ts:82-90querySync called from CEL can trigger a full disk scan.
    When data.query() is used in a CEL expression during model evaluation, querySync is called. If the catalog is not populated, backfillSync() performs a synchronous scan of all data on disk (findAllGlobalSync()). For large repos, this could cause a noticeable hang during expression evaluation. This is by design (documented as "backfill-on-first-query") but worth noting as a latency risk.

  2. src/domain/data/data_query_service.ts — Nested CEL evaluation (CEL → data.query → CEL parse/eval).
    When data.query(predicate) is called from a CEL expression, it internally creates a new Environment and parse()s the predicate. If a user crafts a deeply nested query (e.g., a predicate that itself references data.query — which the field validation would actually block), this is fine. The field validation (validateFieldReferences) prevents data from appearing as a root identifier, so recursive calls are blocked. Good design.

  3. src/infrastructure/persistence/catalog_store.ts:104-128 — Prepared statements are created per-call, not cached.
    Each call to upsert(), remove(), count(), etc. calls this.db.prepare() which parses the SQL again. For write-through updates on every data mutation, this adds overhead. SQLite does have a statement cache, but relying on node:sqlite's internal caching is implicit.
    Not blocking: Perf optimization opportunity, not a correctness issue.

Verdict

PASS — The implementation is solid. The SQLite catalog with write-through updates is well-integrated into the existing data mutation paths. Field validation prevents injection of unknown identifiers. The lazy content loading (attributes/content) is a good optimization. The backfill self-healing mechanism is clean. The silent catch in predicate evaluation (Medium #4) is the most concerning pattern but is unlikely to cause data loss—just confusing debugging. No blocking issues found.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Well-structured PR that adds a CEL-based data query system with a SQLite metadata catalog. The architecture follows DDD principles cleanly: DataQueryService as a domain service, CatalogStore as infrastructure, pure domain logic in query_predicate.ts, and the libswamp generator handling application-layer orchestration. The catalog write-through approach is solid, and the lazy content loading is a good performance decision.

Test coverage is good for the core pieces (14 DataQueryService tests, 10 CatalogStore tests covering round-trips, backfill, invalidation, and edge cases).

Blocking Issues

None.

Suggestions

  1. CatalogStore.iterate() uses stmt.all() which loads all rows into memory (src/infrastructure/persistence/catalog_store.ts:148). The design doc states "Iteration uses stmt.iterate() so that only one row is in memory at a time," but the implementation calls stmt.all() then yields from the array. For large catalogs this defeats the memory efficiency goal. Consider using stmt.iterate() from node:sqlite if the Deno runtime supports it, or at minimum update the design doc to reflect the actual behavior.

  2. Duplicate DataQueryService instantiation pattern: Both src/cli/commands/model_method_run.ts and src/serve/deps.ts have the same IIFE pattern for creating queryData:

    ((dqs) => (predicate: string, select?: string) =>
      dqs.query(predicate, { select }))(new DataQueryService(...))

    Additionally, model_method_run.ts creates two separate DataQueryService instances (one for expression evaluation, one for queryData). Consider creating it once and sharing, and extracting a small helper to reduce duplication.

  3. Missing direct unit tests for query_predicate.ts: The AST-walking logic in collectRootIdentifiers handles ~10 node types and is a prime candidate for direct unit tests. It's currently only tested indirectly through DataQueryService. A query_predicate_test.ts would make regressions easier to diagnose and cover edge cases (deeply nested expressions, unusual operator combinations).

  4. Missing tests for content_type.ts: isTextContentType() is used by both DataAccessService and DataQueryService. A small test file would document the boundary cases (e.g., application/x-yaml is included, application/octet-stream is not).

  5. Domain-infra ratchet bump is honest and correct (22→23 for DataQueryService importing CatalogStore). Long-term, extracting a CatalogReader interface in the domain layer would let you remove this violation, but that's not needed now.

@stack72 stack72 merged commit 2daea13 into main Apr 1, 2026
9 checks passed
@stack72 stack72 deleted the data-query branch April 1, 2026 13:29
adamhjk added a commit that referenced this pull request Apr 1, 2026
Three medium-severity issues from the adversarial review:

1. CatalogStore.iterate() now pages with LIMIT/OFFSET (1000 rows per
   batch) instead of stmt.all() which loaded all rows into memory.

2. serve command closes CatalogStore on graceful shutdown so the SQLite
   connection is not leaked for long-running processes.

3. DataQueryService logs skipped rows at debug level instead of silently
   swallowing evaluation errors, so users can diagnose with --verbose.

Also incorporates UX review suggestions: CLI examples, improved --select
description, and JSON envelope for projected output.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Swamp data search / query is insufficient

2 participants