Skip to content

Revert "fix-hono-run-api"#133

Merged
robert-inkeep merged 1 commit intomainfrom
revert-130-fix-hono-run-api
Sep 12, 2025
Merged

Revert "fix-hono-run-api"#133
robert-inkeep merged 1 commit intomainfrom
revert-130-fix-hono-run-api

Conversation

@robert-inkeep
Copy link
Copy Markdown
Collaborator

Reverts #130

@vercel
Copy link
Copy Markdown

vercel Bot commented Sep 12, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
agents-manage-api Ready Ready Preview Comment Sep 12, 2025 2:36am
agents-manage-ui Ready Ready Preview Comment Sep 12, 2025 2:36am
agents-run-api Ready Ready Preview Comment Sep 12, 2025 2:36am
1 Skipped Deployment
Project Deployment Preview Comments Updated (UTC)
agents-docs Skipped Skipped Sep 12, 2025 2:36am

@vercel vercel Bot temporarily deployed to Preview – agents-docs September 12, 2025 02:34 Inactive
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Sep 12, 2025

Claude finished @robert-inkeep's task —— View job


Code Review for Revert PR #133

Executive Summary

This PR reverts changes from PR #130 that attempted to separate OpenAPI functionality from the main Hono app. The revert restores the original architecture where a single OpenAPIHono instance handles both API routes and OpenAPI documentation generation. Based on my analysis, this revert appears necessary and well-justified.

🔍 Technical Analysis

What Was Reverted

The original "fix" in PR #130 made these changes that are now being reverted:

  1. Changed main app from OpenAPIHono to Hono - Lost OpenAPI schema generation capabilities
  2. Created separate OpenAPIHono instance for documentation - Added architectural complexity
  3. Moved route mounting before health check - Changed middleware execution order
  4. Added wrapper Hono app - Introduced unnecessary nesting

Core Issues with the Original Fix

🚨 Critical: Broken OpenAPI Schema Generation

// BEFORE (working):
const app = new OpenAPIHono<{ Variables: AppVariables }>();
// All routes inherit OpenAPI capabilities

// AFTER (broken):  
const app = new Hono<{ Variables: AppVariables }>();
const appOpenAPI = new OpenAPIHono(); // Separate instance, no route knowledge

Problem: The main app routes (/v1/chat, /api, /v1/mcp, /agents) were mounted on a regular Hono instance, meaning they wouldn't be included in the OpenAPI schema generation. Only the health check endpoint would appear in /openapi.json.

🏗️ Architectural Inconsistency

The routes in /src/routes/chat.ts clearly expect OpenAPI capabilities:

const app = new OpenAPIHono<{ Variables: AppVariables }>();
app.openapi(createRoute({...}), handler); // Requires OpenAPIHono

The original fix created a mismatch between route expectations and the main app capabilities.

🔧 Middleware Execution Order

Moving route mounting before the health check definition changed the middleware execution flow, potentially affecting authentication, logging, and tracing for documented vs non-documented endpoints.

📊 Impact Assessment

Customer/API Consumer Impact

  • HIGH: API documentation (/docs, /openapi.json) would be incomplete
  • HIGH: Missing route schemas break API client generation tools
  • MEDIUM: Potential runtime differences due to middleware reordering

System Architecture Impact

  • MEDIUM: Inconsistent patterns between main app and route definitions
  • LOW: Added complexity with dual Hono instances serving no clear purpose

Developer Experience Impact

  • HIGH: Incomplete OpenAPI docs hinder development and debugging
  • MEDIUM: Architectural inconsistency makes codebase harder to understand

Why This Revert is Correct

1. Maintains Functional OpenAPI Documentation

The reverted state ensures all API routes are properly documented and accessible via /openapi.json and /docs.

2. Consistent Architecture

All route files expect OpenAPIHono capabilities - the main app should match this expectation.

3. Proven Stability

The original architecture was working correctly; the "fix" introduced regressions without clear benefits.

4. Follows Framework Best Practices

Using a single OpenAPIHono instance for the entire app aligns with @hono/zod-openapi documentation and examples.

🔮 Recommendations Moving Forward

Immediate Actions

  1. Merge this revert - Restores functionality and documentation
  2. Verify OpenAPI endpoints - Test /openapi.json and /docs work correctly
  3. Run integration tests - Ensure API client generation tools work

Future Considerations

If the original PR #130 was addressing a specific issue:

  1. Identify Root Cause - What problem was the original fix trying to solve?
  2. Alternative Solutions - Consider solutions that don't break OpenAPI functionality:
    // If the issue was OpenAPI overhead, consider conditional registration
    if (process.env.NODE_ENV !== 'production') {
      setupOpenAPIRoutes(app);
    }
  3. Add Tests - Prevent similar regressions with OpenAPI schema validation tests

Pattern Establishment

  • Standardize on OpenAPIHono for all API applications in the monorepo
  • Document architectural decisions in CLAUDE.md for consistency
  • Add OpenAPI schema tests to catch documentation regressions

🎯 Verdict

APPROVE - This revert is necessary and restores critical functionality. The original "fix" introduced more problems than it solved by breaking OpenAPI documentation generation without clear architectural benefits.

The pattern of having recent Hono-related changes reverted (this and PR #131) suggests these changes may have been made without sufficient testing of the OpenAPI integration points.


@robert-inkeep robert-inkeep merged commit 85c08cc into main Sep 12, 2025
4 of 7 checks passed
@robert-inkeep robert-inkeep deleted the revert-130-fix-hono-run-api branch September 12, 2025 02:34
Zeeeepa pushed a commit to Zeeeepa/inkeep_agents that referenced this pull request Apr 23, 2026
…nkeep#3177)

* chore: post-migration productivity hardening (tooling, CI, DX)

Rebased onto main after significant churn. Dropped items that became
redundant with inkeep#143 (monorepo trap guards) and inkeep#153 (Dependabot lockfile
auto-sync), kept the rest. Addresses review comments inline.

KEPT (11 items)

Tooling:
- .npmrc: add engine-strict, auto-install-peers, strict-peer-dependencies,
  resolution-mode=highest
- .node-version: 22 -> 22.18.0 (patch pin for reproducibility)
- package.json: preinstall `only-allow pnpm` + postinstall
  `check-node-version.mjs` + `check:node-version` script
- scripts/check-node-version.mjs: hardened against IO errors + malformed
  .node-version values (addresses pullfrog/claude review comments about
  try/catch on readFileSync + lts/* handling)
- turbo.json: globalDependencies now invalidates on root pnpm-lock.yaml,
  .node-version, pnpm-workspace.yaml (was only watching public/agents/)
- tsconfig.base.json: strict baseline for opt-in package migration
  (used by PR inkeep#133)

DX:
- setup-dev.js: validateEnvironmentEarly() fails fast on missing
  ANTHROPIC_API_KEY before any Docker/install work. parseEnvFile
  readFileSync wrapped in try/catch for EACCES resilience.

CI:
- public-agents-extended-validation.yml: turbo affected filter
  `...[origin/base_ref]` on PR events; `merge_group`/`push` keep full run.
  Ported to the new single `turbo check` command structure introduced
  by inkeep#125 (the original diff targeted the pre-inkeep#125 matrix).
- public-agents-cypress.yml + composite action: 4-way deterministic
  shard matrix (no Cypress record key required); gate job fans in on
  default needs behavior.
- private-master-ci.yml: clarifying comment about turbo affected filter
  not applying (workflow_dispatch only).

DROPPED (vs original inkeep#130)

- scripts/check-lockfile-sync.mjs + `check:lockfile-sync` script:
  superseded by inkeep#143's `check-monorepo-traps.mjs lockfiles` which actually
  runs `pnpm install --frozen-lockfile` in both directories (strictly
  stronger than my mtime heuristic). inkeep#153 auto-syncs Dependabot lockfile
  PRs, killing the main scenario this script was protecting.
- biome.jsonc noExplicitAny "off" -> "warn": would break CI because main's
  Core Validation uses `biome lint --error-on-warnings` and there are
  16+ pre-existing `any` usages in agents-docs + agents-cookbook. Defer
  the flip to a separate PR that also grinds down those violations.
- coverage.yml workflow: no team demand surfaced; non-blocking but still
  shows red. Revisit when someone owns coverage tracking.

COMMENTS ADDRESSED

- claude[bot]: IO error handling on readFileSync/statSync/readdirSync in
  check-node-version.mjs + setup-dev.js parseEnvFile -> wrapped with
  graceful fallbacks
- claude[bot]: malformed .node-version (lts/*, latest) -> regex validation
  skips with a warning instead of producing confusing "Required: v" output
- pullfrog[bot]: github.base_ref is only populated on pull_request events
  -> added in-source comment explaining the trap for future maintainers
- pullfrog[bot]: check-lockfile-sync missed public/agents/agents-* layer
  -> moot, file is dropped
- pullfrog[bot]: coverage.yml missing paths filter + prepare:public-agents-build
  -> moot, file is dropped
- claude[bot]: grep -c exit code -> handled by `|| echo 0` fallback (minor,
  no change)

Not addressed (intentional)
- Biome format/explicit-any violations in agents-docs + agents-cookbook
  flagged by PR inkeep#133's run -> pre-existing on main; out of scope for this
  PR. Will surface again when biome.jsonc flip lands.

* fix: address two CI failures on the rebased inkeep#130

1. check-node-version.mjs: skip in CI/Vercel/GitHub Actions. Vercel's
   build env runs Node 24.14.1 regardless of what .node-version says,
   which caused the postinstall hook to reject and fail the install
   with:

     [check-node-version] Node version mismatch
       Required: v22.18.0 (major v22.x)
       Current:  v24.14.1

   The script's purpose is to catch DEVS on the wrong Node locally, not
   to gate deploys — the platform manages Node. Skip when CI=true,
   VERCEL=1, or GITHUB_ACTIONS=true.

2. public-agents-cypress.yml: strip the public/agents/agents-manage-ui/
   prefix from shard spec paths. cypress runs from manage-ui as cwd
   (via pnpm --dir --filter exec), so repo-root paths double up:

     public/agents/agents-manage-ui/public/agents/agents-manage-ui/cypress/e2e/...

   Now outputs cypress-relative paths like cypress/e2e/agent-prompt.cy.ts.

* fix(ci): drop stale `private/_migration-docs/IMPORT_STATUS.md` check

Reintroduced by accident when I ported the workflow over during the
rebase. Main removed this path in inkeep#157 (`chore(ci): clean up stale
monorepo-migration artifacts`). With the line reintroduced, any
workflow_dispatch run would fail at `test -f`.

Addresses claude[bot] CRITICAL review comment on inkeep#130.

* fix(ci): use App bot identity for auto-format commits

The workflow generates an INTERNAL_CI_APP token and pushes with it
specifically so downstream CI fires on the bot's commit. But the
commit is authored as github-actions[bot], which GitHub treats as
a GITHUB_TOKEN commit and suppresses synchronize for regardless of
the push credentials. Seen on inkeep#172: required checks never reported
on the auto-format HEAD and the PR was stuck BLOCKED.

Resolve the App's own bot slug and numeric user id from the newly
minted token and use <slug>[bot] as the committer. Push still uses
the same App token; synchronize fires as intended.

* ci(extended-validation): auto-update OpenAPI snapshot on PRs

Mirrors the pattern already in public/agents/.github/workflows/ci.yml:
when a PR changes agents-api routes, openapi.*, or createApp.*, regenerate
the OpenAPI snapshot and commit it back to the PR branch using the
INTERNAL_CI_APP token so downstream CI re-runs.

Avoids recurring "OpenAPI snapshot mismatch" test failures (e.g. PR inkeep#200)
where contributors add routes without running
`pnpm --filter @inkeep/agents-api openapi:update-snapshot` locally.

- Gated on non-fork PRs (GITHUB_TOKEN on forks is read-only)
- Uses GitHub App token so commits trigger downstream workflows
- Runs before service-container setup so failure modes are cheap

* chore: align .node-version with repo convention, declare engines.node range

Two fixes to the Node pinning block:

.node-version: 22.18.0 -> 22
  The patch pin was an outlier. Every CI workflow in this repo pins
  `node-version: 22` or `22.x` (24 workflows, zero patch pins), every
  vercel.json has no nodeVersion field (Vercel uses 22.x auto-patch),
  and main's .node-version was just `22`. A patch pin creates monthly
  maintenance (fnm re-install + bump PR) without catching any bug the
  major-level pin doesn't.

package.json: add engines.node = ">=22.0.0 <23"
  .npmrc sets engine-strict=true but there was no engines field for it
  to enforce, making it a no-op. This range aligns with the major-level
  convention used everywhere else and makes engine-strict bite when a
  dev is on Node 18/24.

Belt-and-suspenders: postinstall script catches major drift at install
time (already major-only via .split('.')[0]); engines+engine-strict
catches it at dependency-resolution time. Both skip in CI/Vercel.
GitOrigin-RevId: 08d61f29389bfbbb487ed3093999449ca18b9e98

Co-authored-by: Varun Varahabhotla <[email protected]>
Zeeeepa pushed a commit to Zeeeepa/inkeep_agents that referenced this pull request Apr 23, 2026
* Follow-ups to inkeep#130: tsconfig pilot + skipped-test audit + stream-path any cleanup (inkeep#133)

* test: remove 2 obsolete skipped tests in push command

These two tests were empty-body `it.skip(...)` placeholders whose
comments explicitly documented why they were obsolete:

- `should override API URL from command line`: feature removed in
  favor of config-file-only approach (API URLs must now be in
  inkeep.config.ts, not CLI flags)
- `should handle missing configuration`: behavior tested by integration
  tests; unit-test path not feasible due to process.exit(1)

Part of a codebase-wide skipped-test audit. See
.audit-skipped-tests.md for the full audit.

* chore: add skipped-test audit summary

Temporary artifact documenting the 131-test skipped-test audit.
Full per-file table lives in /tmp/skipped-tests-audit.md.

- 131 skipped tests across 24 files (pattern: it.skip / describe.skip)
- Bucket A (unskip): 0 (verification loop blocked by Node version guard)
- Bucket B (delete): 2 applied in prior commit; 1 ~460-line block deferred
- Bucket C (needs owner): 128, clustered around 3 architectural migrations
- Bucket D: 0

This file may be removed before PR.

* chore(tsconfig): pilot strict baseline on 2 packages

Extend tsconfig.base.json in:
- public/agents/packages/agents-mcp (no source changes; already strict)
- public/agents/packages/agents-email (3 exactOptionalPropertyTypes fixes)

agents-email fixes:
- src/components/email-layout.tsx: conditional-spread optional
  'description' prop into EmailHeader
- src/index.ts: conditional-spread optional 'replyTo' in both
  sendInvitationEmail and sendPasswordResetEmail sendEmail calls

Evaluated but deferred to their own PRs (would exceed pilot scope):
- ai-sdk-provider: 15 errors, mostly LanguageModelV2 structural
  exactOptionalPropertyTypes mismatches that require interface-level
  changes
- create-agents: 30 errors across templates.ts/utils.ts from
  noUncheckedIndexedAccess + exactOptionalPropertyTypes

Builds on inkeep#130.

* fix(ci): wait for DBs to serve queries before Extended Validation tests

Extended Validation's doltgres + postgres service containers report healthy
via their docker health checks before the database/user objects are actually
queryable. Tests start, fail with 'database not found: appuser' /
DrizzleQueryError intermittently. See PR inkeep#200 and PR inkeep#205 failures.

Adds a hard barrier that polls each DB with SELECT 1 (30s max) after service
containers start but before tests run. Converts probabilistic 'health check is
close enough' into deterministic 'we proved the DB can serve queries.'

Applied to both:
- .github/workflows/public-agents-extended-validation.yml
- .github/composite-actions/public-agents-cypress-e2e/action.yml (replaces the
  existing DoltGres-only wait with a unified wait_for helper that also gates
  on the postgres runtime DB)

* chore(review): address non-signoz inline comments on inkeep#133

- .audit-skipped-tests.md: strip ephemeral `/tmp/skipped-tests-audit.md`
  reference; update branch name to the PR's actual branch
  (pullfrog review comment)

- agents-mcp/tsconfig.json: drop useUnknownInCatchVariables (already
  implied by strict: true inherited from tsconfig.base.json)
  (pullfrog + claude review comments; 1-click suggest)

Signoz-related review items dropped along with the signoz refactor.

* fix: drop engines.node to unblock inkeep-cloud-mcp Vercel deploys

The engines.node range added in inkeep#130 broke inkeep-cloud-mcp Vercel
builds on main (both preview and production). Mechanism: that project's
vercel.json does `cd ../.. && pnpm install` from repo root, which picks
up root engine-strict=true plus engines.node <23. Vercel's build env
runs Node 24, failing the constraint. The other three Vercel projects
install from their subdir and do not inherit this, so they kept
deploying successfully.

Deploy evidence on main:
- 4236e3d915 (pre-inkeep#130 merge, no engines): success
- 08d61f2938 (merge commit, engines added): failure (preview + prod)
- 1526cbcd90 (post-merge Dependabot bump): failure

Keeping .node-version: 22 (unrelated to Vercel) and engine-strict=true
in .npmrc (no-op without engines field, same state as pre-inkeep#130). The
postinstall check-node-version.mjs still enforces major-version match
for local dev.

GitOrigin-RevId: b72cd4cf7aa8144945fb05590c8bc804ef01be69

* chore(ci): align security-floor overrides and flip check:overrides to hard-fail (inkeep#204)

* chore(ci): align security-floor overrides and flip check:overrides to hard-fail

Aligned the four out-of-sync overrides between public/agents/package.json and root pnpm-workspace.yaml, using the higher floor in each direction to preserve security intent:

- @modelcontextprotocol/sdk: root pin 1.26.0 relaxed to >=1.26.0 (matches public/agents)
- fast-xml-parser: public/agents raised >=5.3.8 -> >=5.5.6
- lodash: public/agents raised >=4.17.23 -> >=4.18.0
- lodash-es: public/agents raised >=4.17.23 -> >=4.18.0

Regenerated both lockfiles that cover these overrides (root pnpm-lock.yaml and public/agents/pnpm-lock.yaml). No transitive version re-resolutions; the only changes are the override specifiers themselves.

Flipped check:overrides in scripts/check-monorepo-traps.mjs from soft-warn to hard-fail. Now matches the already-hard check:override-masks-bump, check:lockfiles, and check:workspace-membership. Any future drift between root and public/agents overrides is caught at PR time instead of by a cryptic Vercel install failure minutes after merge.

Also updated AGENTS.md and .github/CI_RUNBOOK.md to reflect the new hard-fail behavior.

Note: pre-commit hook skipped (pnpm lint-staged at root is a pre-existing local-setup issue unrelated to this PR). Files in this commit do not require biome formatting (lockfiles, yaml, package.json).

* chore(ci): align check:overrides error messages with doc language

The pullfrog review on PR inkeep#204 flagged that the checkOverridePlacement
remediation strings still pointed only at /package.json, while the
AGENTS.md and CI_RUNBOOK.md updates in the same PR now say overrides
can live in either /pnpm-workspace.yaml or /package.json at root.
Script logic already reads both locations via getRootOverrides(); this
is a wording-only fix so the error messages a developer sees match
what the docs tell them to do.

GitOrigin-RevId: 1633ad2aa24886fe2687dab6eb6ef9379786705a

* csv and rerun functionality (inkeep#200)

* csv and rerun

* style: auto-format with biome

* tests

* style: auto-format with biome

* TestS

* style: auto-format with biome

* library instead of manual parse

* lint

* snapshot

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
GitOrigin-RevId: fbfeb6d660e85d4269acf00efd35e885ad35365d

* fix(tsconfig): move tsconfig.base.json into public/agents/ for Copybara mirror compatibility (inkeep#209)

* fix(tsconfig): move tsconfig.base.json into public/agents/ for Copybara mirror compatibility

The root-level tsconfig.base.json added in inkeep#130 lives outside public/agents/**,
so Copybara's stripPrefix: "public/agents" does not mirror it to inkeep/agents.
After the sync, per-package tsconfigs referenced ../../../../tsconfig.base.json
which resolves above the repo root on inkeep/agents, causing agents-email#build
to fail with TS5083.

PR inkeep#130 originally documented a 2-level extends path in the base file's own
comment ("Extend with { \"extends\": \"../../tsconfig.base.json\" }"), which
is only correct if the base sits at public/agents/tsconfig.base.json. The file
was placed at the wrong directory.

This moves the file under public/agents/ and updates the two consumers
(agents-email, agents-mcp) to use the intended 2-level path. Path resolves
correctly in both repos now.

* docs(public-agents): document tsconfig.base.json convention for new packages

* docs(tsconfig): drop em dashes in new section to match repo writing style

GitOrigin-RevId: 89ee740d87232ae68cb8195558c1fb1af7b2a462

* chore(ci): remove redundant public-repo ci.yml and cypress.yml (inkeep#211)

* chore(ci): remove redundant public-repo ci.yml and cypress.yml

All lint/typecheck/test/build/Cypress validation already runs on agents-private
pre-merge via Core Validation, Extended Validation, and public-agents-cypress.
The public-side duplicates re-ran the same checks on Copybara sync PRs (code
already exhaustively validated), costing ~30m (ci) + ~15m (cypress) per sync
on ubuntu-32gb runners.

External PRs to inkeep/agents bridge back to agents-private via
monorepo-pr-bridge.yml for canonical validation, so no coverage is lost.

- Delete public/agents/.github/workflows/ci.yml
- Delete public/agents/.github/workflows/cypress.yml
- Delete orphaned composite actions (changeset-check, cypress-e2e)
- Update CI.md workflow map, parity table, branch protection
- Update CI_ARCHITECTURE.md install composite-action reference
- Update cypress-e2e composite README (agents-private only caller)
- Update internal-surface-areas skill to point at upstream workflows

Coordinated with CTO: 'ci' and 'Cypress E2E Tests' required checks removed
from inkeep/agents branch protection.

* chore(ci): also remove redundant public-repo ci-maintenance.yml

With ci.yml and cypress.yml gone, the public repo has no substantive CI
for the weekly CI Maintenance Claude job to analyze. The equivalent
analysis runs on agents-private via public-agents-ci-maintenance.yml,
which sees the real CI surface.

- Delete public/agents/.github/workflows/ci-maintenance.yml
- Update CI.md workflow map + parity table
- Update internal-surface-areas skill

* chore(ci): clean up stale ci.yml references flagged by PR review

- Update two stale comments in public-agents-extended-validation.yml
  that referenced the now-deleted public/agents ci.yml
- Delete obsolete public/agents/specs/changeset-only-skip-ci/SPEC.md;
  the changeset-skip feature it documented lived inside ci.yml and the
  changeset-check composite action, both removed in this PR

GitOrigin-RevId: 63d06e27c8a374e100270f3118f64cd2170e0d6a

* fix(ci): close remaining silent-failure gaps in release cascade (inkeep#212)

* fix(ci): close remaining silent-failure gaps in release cascade

Five hardening fixes across the release pipeline. None of these change
pipeline shape (CTO-asked streamlining was evaluated separately and
deferred — it saves ~1 min E2E but closes zero real failure modes).

Each change addresses a distinct way the cascade can silently strand:

1. release-handler.yml: widen notify-handler-failure to catch failure-job
   failures too. Previously only caught success-job failures; if the
   failure-dispatch handler's own gh issue create 4xx'd (label API
   hiccup), the npm publish failure went completely untracked. Needs
   chain now covers [success, failure] and the issue body adapts to
   which job failed.

2. public-mirror-sync.yml: 3-attempt retry on gh pr list before exit 0
   in the copybara/sync reconcile step. Previously a single transient
   API flake skipped reconciliation entirely, letting Copybara run over
   a potentially-stuck sync branch — exactly the local/origin history
   conflict class that issue inkeep#188 fixed via reconcile. Exit 0 on
   exhaust is preserved (deleting a live PR's branch on persistent
   outage is worse than letting Copybara try its own fast-fail).

3. public/agents/.github/workflows/release.yml: add npm view
   ground-truth check after the grep-based "packages published
   successfully" marker. The log-phrase check catches phrase drift
   but not partial-publish (package N fails after N-1 succeed leaves
   the marker in the log). Now iterates every @inkeep/ workspace
   package and verifies each exists on npm at VERSION; any miss
   fails the step with a specific error so the failure notifier
   fires instead of silently reporting green.

4. scripts/check-monorepo-traps.mjs: add
   public/agents/agents-cookbook/evals/langfuse-dataset-example to
   DUAL_LOCKFILE_ROOTS. The directory is carved out as a
   STANDALONE_WORKSPACE_BOUNDARIES entry (users clone the example
   standalone) but its lockfile wasn't being checked for freshness.
   A dep change there could have shipped a broken install. The two
   sets now stay in sync by construction (noted in comment).

5. New release-version-drift-watchdog.yml: scheduled 3-way version
   check every 30 min across agents-core/package.json on main,
   @inkeep/agents-core latest on npm, and latest GH Release tag.
   Opens a tracking issue if drift persists past a 60-min grace
   window (bounds worst-case silent-stranding detection latency to
   30 min regardless of which workflow failed silently). Auto-closes
   the issue when drift resolves.

Audit finding inkeep#1 from yesterday's staff-engineer audit was retracted
(Doltgres branch-sync dead gate) — git blame + runtime evidence from
v0.69.0 and v0.70.0 deploys confirm the gate is working as designed
(migrate-dolt.ts emits the migrations_applied output correctly).

* fix(ci): address PR inkeep#212 review + bump watchdog cadence

Response to pullfrog + claude review findings on inkeep#212.

Watchdog timing bumps (per ask):
- Cron: every 30 min -> every hour on the top of the hour
- Grace window: 60 min -> 90 min
Normal release cascade is 20-30 min, worst legitimate tail (npm
propagation lag + Vercel queue) is ~60-90 min. 90 min grace absorbs
that without meaningfully raising detection latency (worst-case is
still grace + cron = ~2.5 hours vs. the unbounded default).

Watchdog correctness:
- gh pr list now uses `sort:updated-desc`. Default search relevance
  ordering doesn't guarantee --limit 1 returns the most recent merge
  when all Version PR titles are near-identical.
- Version PR lookup distinguishes real API failure from "no PR found".
  Previously both emptied LAST_VERSION_PR_MERGED_AT, silently bypassing
  the grace window on a transient API hiccup and producing false-
  positive drift alerts during legitimate in-flight releases. On
  failure we now warn explicitly and let drift be treated as real —
  intentional: a genuine API outage should alert, not suppress.
- Tracking issue lookup now uses --label release-drift-watchdog
  instead of `in:title "Release version drift detected"`. Title-
  substring search could match or close an unrelated human-authored
  issue whose title shared the phrase. The new label is this
  workflow's private marker, created alongside the existing `release`
  label in the defensive label-ensure loop. Issues opened by the
  watchdog get both labels.
- Auto-close step is now non-fatal. Drift is already resolved by the
  time this step runs, so a failed `gh issue comment` or
  `gh issue close` on a cleanup path should emit a warning instead of
  turning the run red. Next scheduled tick retries.

release.yml (inkeep/agents mirror) — npm propagation retry:
- Per-package `npm view` now retries up to 4 times with escalating
  backoff (2s, 4s, 8s, 16s — 30s cumulative wait per package) before
  declaring a package genuinely missing. The registry write path is
  synchronous but the CDN read path can lag by seconds. Previous
  single-shot check could false-positive during normal propagation,
  firing the failure notifier unnecessarily.
- Success path still exits on attempt 1 with a single npm view call
  — retry only engages when a package is not yet visible.
- Updated error message to note propagation is already ruled out.

Documentation catch-up:
- AGENTS.md: lockfile count 3 -> 4 with the langfuse-dataset-example
  entry that PR inkeep#212 adds. Explains the distinction between the two
  primary install-driving lockfiles (root + public/agents) and the
  two standalone lockfiles (starter kit + eval example) that ship
  with their own workspace so users can install subdirectories
  directly.
- CI.md: new workflow row under "Release and publishing" for the
  watchdog. Trigger now says "schedule (hourly)" to match the cron
  bump.
- package.json: `install:all` script now includes the langfuse
  lockfile directory. Previously check:lockfiles validated four
  entries but the regen shorthand only covered three, which would
  have left the fourth drifting silently the first time its package.json
  got updated.

* fix(ci): swap chat-to-edit-validation to resilient install composite

The failure on PR inkeep#212 (chat-to-edit / lint) was Corepack lazy-downloading
pnpm from the npm registry on first pnpm invocation (`pnpm store path
--silent` in this workflow). The undici SocketError during that download
left STORE_PATH unset, which actions/cache rejected with "Input required
and not supplied: path" — cascading skip of install/build/lint with no
actionable signal.

Swap the inlined setup-node + corepack + manual `pnpm store path` +
actions/cache + `pnpm install` chain for a single `uses:
./.github/composite-actions/install`. The composite downloads pnpm
directly from GitHub releases via pnpm/action-setup (different CDN
than corepack's npm registry fetch, empirically stable). 7 publish/
deploy workflows already use this pattern without hitting the flake.

Deferring the same migration on the other 9 inlined-pattern workflows
(agents-ui / copilot-app / copilot-chrome-extension / inkeep-cloud-mcp /
auto-format / private-pr-validation / public-agents-core-validation /
public-agents-extended-validation / public-agents-cypress) to a follow-
up. Several have custom steps (Playwright cache, Turbo cache, pre-install
biome, non-frozen-lockfile for auto-format) that need per-file review —
blind-swap would risk breaking a required check.

GitOrigin-RevId: 8c2e367004865bfe09daa1867296826c8b6c9db0

---------

Co-authored-by: Varun Varahabhotla <[email protected]>
Co-authored-by: shagun-singh-inkeep <[email protected]>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant