feat: add CEL-based data query with SQLite catalog#1029
Conversation
Add `swamp data query '<predicate>'` for finding data artifacts across all
models using CEL predicates. Queries filter on artifact metadata and
optionally on JSON content or raw text.
Three entry points with full parity:
- CLI: `swamp data query '<predicate>' --select '<projection>'`
- CEL: `data.query('<predicate>', '<select>')`
- Extension methods: `context.queryData('<predicate>', '<select>')`
Key features:
- SQLite metadata catalog (.swamp/data/_catalog.db) with write-through
updates from all UnifiedDataRepository mutations
- Lazy content loading — attributes and content only read from disk when
the expression references them
- CEL projection via --select (CLI) or second argument (CEL/methods) for
extracting specific fields, custom tables, or formatted output
- Field validation on predicates with clear error messages
- Backfill-on-first-query with self-healing (deleted catalog auto-rebuilds)
- Catalog invalidation after remote datastore sync
- Markdown table rendering via renderMarkdownToTerminal
- New `content` field on DataRecord for raw text access (text/*, JSON, YAML)
- New `swamp-data-query` skill with field reference documentation
DataRecord extended with: modelName, modelType, specName, dataType,
contentType, lifetime, ownerType, streaming, size, content. All existing
DataRecord producers updated. Backward-compatible (additive fields).
Closes #1014
There was a problem hiding this comment.
CLI UX Review
Blocking
None.
Suggestions
-
Missing
.example()calls (src/cli/commands/data_query.ts) — Every otherdatasubcommand (list,search) has multiple.example()entries in its help text. CEL predicates are the least-intuitive part of this feature; without examples,swamp data query --helpgives users no starting point. Suggested additions:.example("Filter by model", "swamp data query 'modelName == \"scanner\"'" ) .example("Filter with size threshold", "swamp data query 'size > 1048576'") .example("Project a single field", "swamp data query 'dataType == \"resource\"' --select data.name") -
--selectdescription is too terse (data_query.ts:42) —"CEL projection expression"doesn't hint at what to write. Something like"CEL expression to extract fields from matching records (e.g. data.name)"gives users a concrete starting point without reading docs. -
JSON mode drops
limited/totalwhen--selectis used (src/presentation/renderers/data_query.ts:178-194) — Without--select, the JSON envelope includestotalandlimited, so scripts can detect truncation. With--select, the output is just the bare array with no metadata. A script hitting the default 100-record limit gets silently truncated data. Consider wrapping projected JSON output in{ results: [...], total, limited }as well, or at least emittinglimited: truewhen applicable. -
Projected log-mode output omits count when not at limit (
data_query.ts renderer, line 169-172) — The default table always shows"N result(s)"at the bottom. The projected renderer only appends"Showing N results (limit reached)"when the limit is hit; otherwise the output ends with no count. Users have no feedback on how many records matched. Consistent with #3 above. -
--limitdefault inconsistency —data querydefaults to 100;data searchdefaults to 50. Minor, but worth aligning since both commands are in the samedatagroup.
Verdict
PASS — no blocking issues. The command works correctly with both log and JSON output modes, flags are consistently named, and error messages follow the UserError pattern. The suggestions above are primarily around discoverability (missing examples) and completeness of JSON scripting metadata.
There was a problem hiding this comment.
Adversarial Review
Critical / High
No critical or high severity findings.
Medium
-
src/infrastructure/persistence/catalog_store.ts:145-149—iterate()loads all rows into memory despite generator signature.
The method is documented as "Uses a generator so only one row is in memory at a time", butstmt.all()on line 147 materializes the entire result set into a JS array before yielding. For a large catalog (tens of thousands of artifacts), this defeats the memory-efficiency purpose of the generator pattern. Thenode:sqliteDatabaseSyncAPI doesn't support streaming cursors, so this is a limitation of the underlying library rather than a bug, but the comment is misleading. For catalogs with 100K+ rows, this could cause memory pressure.
Suggested fix: Update the comment to accurately reflect that all rows are loaded eagerly, or ifnode:sqliteadds iterator support, switch to it. Alternatively, consider paged queries (LIMIT/OFFSET) if memory becomes an issue. -
src/infrastructure/persistence/catalog_store.ts—close()is never called in production code paths.
TheCatalogStoreis created increateRepositoryContext()(repository_factory.ts:283) but there is no cleanup lifecycle that callsclose(). For CLI commands that exit quickly this is harmless (process exit cleans up), but for long-running processes like the serve/WebSocket server (src/serve/connection.ts), the SQLite connection stays open for the lifetime of the process. With WAL mode this is generally fine, but if multiple repo contexts are created (e.g., per-connection), this could leak database handles.
Suggested fix: This is acceptable for CLIs but should be tracked for the serve path. Consider addingclose()toRepositoryContextlifecycle or using a shared singleton. -
src/domain/data/query_predicate.ts:38,48—sizeappears in bothQUERY_FIELDSandCEL_BUILTINS.
The identifiersizeis in both sets. This doesn't cause a validation bug (the check is!QUERY_FIELDS.has(id) && !CEL_BUILTINS.has(id)— both must be false to flag it), but it means the CEL built-insize()function (e.g.,size(name) > 5) and thesizedata field (e.g.,size > 1024) are both accepted. If a user writessize > 1024, CEL will resolvesizeas the DataRecord'ssizefield (a number), which is the intended behavior. Butsize(tags)would also be accepted as valid, which is correct CEL behavior. No actual bug here, but the overlap is worth noting. -
src/domain/data/data_query_service.ts:127-135— Silent catch swallows all evaluation errors including programming errors.
The barecatchon line 133 swallows any exception during CEL evaluation, including things likeTypeErrorfrom unexpected null access or internal cel-js bugs. While skipping rows that fail type matching is reasonable, silently swallowing all errors can make debugging very difficult when queries return fewer results than expected and there's no indication why.
Suggested fix: At minimum, log the error at debug/trace level so users can diagnose with--verbose. -
src/domain/data/data_query_service.ts:99-100—parsed.astaccessed without type guard.
The code castsparsed.ast as ASTNodeon line 100. Ifcel-jschanges its internal structure orparse()returns a different shape, this will fail at runtime with an unhelpful error. Similarly, line 118 casts(selectParsed as unknown as { ast: ASTNode }).astwhich is fragile.
Not blocking: This couples to cel-js internals but is a pragmatic choice for AST inspection.
Low
-
src/domain/data/data_query_service.ts:82-90—querySynccalled from CEL can trigger a full disk scan.
Whendata.query()is used in a CEL expression during model evaluation,querySyncis called. If the catalog is not populated,backfillSync()performs a synchronous scan of all data on disk (findAllGlobalSync()). For large repos, this could cause a noticeable hang during expression evaluation. This is by design (documented as "backfill-on-first-query") but worth noting as a latency risk. -
src/domain/data/data_query_service.ts— Nested CEL evaluation (CEL → data.query → CEL parse/eval).
Whendata.query(predicate)is called from a CEL expression, it internally creates a newEnvironmentandparse()s the predicate. If a user crafts a deeply nested query (e.g., a predicate that itself referencesdata.query— which the field validation would actually block), this is fine. The field validation (validateFieldReferences) preventsdatafrom appearing as a root identifier, so recursive calls are blocked. Good design. -
src/infrastructure/persistence/catalog_store.ts:104-128— Prepared statements are created per-call, not cached.
Each call toupsert(),remove(),count(), etc. callsthis.db.prepare()which parses the SQL again. For write-through updates on every data mutation, this adds overhead. SQLite does have a statement cache, but relying onnode:sqlite's internal caching is implicit.
Not blocking: Perf optimization opportunity, not a correctness issue.
Verdict
PASS — The implementation is solid. The SQLite catalog with write-through updates is well-integrated into the existing data mutation paths. Field validation prevents injection of unknown identifiers. The lazy content loading (attributes/content) is a good optimization. The backfill self-healing mechanism is clean. The silent catch in predicate evaluation (Medium #4) is the most concerning pattern but is unlikely to cause data loss—just confusing debugging. No blocking issues found.
There was a problem hiding this comment.
Code Review
Well-structured PR that adds a CEL-based data query system with a SQLite metadata catalog. The architecture follows DDD principles cleanly: DataQueryService as a domain service, CatalogStore as infrastructure, pure domain logic in query_predicate.ts, and the libswamp generator handling application-layer orchestration. The catalog write-through approach is solid, and the lazy content loading is a good performance decision.
Test coverage is good for the core pieces (14 DataQueryService tests, 10 CatalogStore tests covering round-trips, backfill, invalidation, and edge cases).
Blocking Issues
None.
Suggestions
-
CatalogStore.iterate()usesstmt.all()which loads all rows into memory (src/infrastructure/persistence/catalog_store.ts:148). The design doc states "Iteration usesstmt.iterate()so that only one row is in memory at a time," but the implementation callsstmt.all()then yields from the array. For large catalogs this defeats the memory efficiency goal. Consider usingstmt.iterate()fromnode:sqliteif the Deno runtime supports it, or at minimum update the design doc to reflect the actual behavior. -
Duplicate
DataQueryServiceinstantiation pattern: Bothsrc/cli/commands/model_method_run.tsandsrc/serve/deps.tshave the same IIFE pattern for creatingqueryData:((dqs) => (predicate: string, select?: string) => dqs.query(predicate, { select }))(new DataQueryService(...))
Additionally,
model_method_run.tscreates two separateDataQueryServiceinstances (one for expression evaluation, one forqueryData). Consider creating it once and sharing, and extracting a small helper to reduce duplication. -
Missing direct unit tests for
query_predicate.ts: The AST-walking logic incollectRootIdentifiershandles ~10 node types and is a prime candidate for direct unit tests. It's currently only tested indirectly throughDataQueryService. Aquery_predicate_test.tswould make regressions easier to diagnose and cover edge cases (deeply nested expressions, unusual operator combinations). -
Missing tests for
content_type.ts:isTextContentType()is used by bothDataAccessServiceandDataQueryService. A small test file would document the boundary cases (e.g.,application/x-yamlis included,application/octet-streamis not). -
Domain-infra ratchet bump is honest and correct (22→23 for
DataQueryServiceimportingCatalogStore). Long-term, extracting aCatalogReaderinterface in the domain layer would let you remove this violation, but that's not needed now.
Three medium-severity issues from the adversarial review: 1. CatalogStore.iterate() now pages with LIMIT/OFFSET (1000 rows per batch) instead of stmt.all() which loaded all rows into memory. 2. serve command closes CatalogStore on graceful shutdown so the SQLite connection is not leaked for long-running processes. 3. DataQueryService logs skipped rows at debug level instead of silently swallowing evaluation errors, so users can diagnose with --verbose. Also incorporates UX review suggestions: CLI examples, improved --select description, and JSON envelope for projected output.
Summary
Implements #1014 — adds
swamp data queryfor finding data artifacts across all models using CEL predicates, backed by a SQLite metadata catalog.--selectprojection), CEL expressions (data.query(predicate, select?)), and extension methods (context.queryData(predicate, select?)).swamp/data/_catalog.db) with write-through updates on all data mutations, backfill-on-first-query, self-healing, and invalidation after remote datastore syncattributes(parsed JSON) andcontent(raw text) only read from disk when the expression references themmodelName,modelType,specName,dataType,contentType,lifetime,ownerType,streaming,size,content— additive, backward-compatibleswamp-data-queryskill bundled with repo init/upgrade, replacesdata searchreferences across existing skillsDesign doc:
design/data-query.mdTest Plan
deno check/deno lint/deno fmtclean--selectprojections🤖 Generated with Claude Code
Closes #1014