Skip to content

keytar fix#188

Merged
amikofalvy merged 1 commit intomainfrom
keytar-fix
Sep 14, 2025
Merged

keytar fix#188
amikofalvy merged 1 commit intomainfrom
keytar-fix

Conversation

@omar-inkeep
Copy link
Copy Markdown
Contributor

No description provided.

@vercel
Copy link
Copy Markdown

vercel Bot commented Sep 13, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
agents-manage-api Ready Ready Preview Comment Sep 13, 2025 11:03pm
agents-manage-ui Error Error Sep 13, 2025 11:03pm
agents-run-api Ready Ready Preview Comment Sep 13, 2025 11:03pm
1 Skipped Deployment
Project Deployment Preview Comments Updated (UTC)
agents-docs Skipped Skipped Sep 13, 2025 11:03pm

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Sep 13, 2025

Claude finished @omar-inkeep's task —— View job


Code Review Complete

As a TypeScript Staff Engineer and System Architect, I've completed my review of the keytar fix changes.

Todo List:

  • Read and analyze changed files to understand the keytar fix
  • Research existing keytar usage patterns in codebase
  • Evaluate changes for engineering quality and system design
  • Provide comprehensive review feedback
  • Update comment with final review

Review Findings

Summary

This PR addresses keytar dependency management for the Node.js keytar native module across the monorepo. The changes ensure keytar is properly configured for compilation/building in different environments.

Changes Analysis

  1. agents-run-api/package.json: Added "keytar": "^7.9.0" dependency
  2. packages/create-agents/src/utils.ts: Added 'keytar' to onlyBuiltDependencies in pnpm workspace configuration (lines 345-347)
  3. pnpm-workspace.yaml: Already had onlyBuiltDependencies: keytar configuration
  4. pnpm-lock.yaml: Updated lockfile reflecting the new dependency

Architecture Context

The keytar library is used by the KeyChainStore implementation in /packages/agents-core/src/credential-stores/keychain-store.ts. This provides cross-platform system keychain access (macOS Keychain, Windows Credential Vault, Linux Secret Service API) for secure credential storage.


Review Feedback

Positive Aspects

  1. Correct Dependency Strategy: Using onlyBuiltDependencies is the appropriate approach for keytar since it's a native Node.js addon requiring compilation
  2. Graceful Fallback Pattern: The existing KeyChainStore implementation handles keytar availability gracefully with proper error handling
  3. Consistent Configuration: Changes align with existing pnpm workspace configuration patterns

⚠️ Areas for Consideration

Critical: Inconsistent Dependency Declaration

The most significant issue is inconsistent keytar dependency declarations across the monorepo:

  • /packages/agents-core/package.json: Already has "keytar": "^7.9.0"
  • /pnpm-workspace.yaml: Already has onlyBuiltDependencies: keytar
  • /agents-run-api/package.json: NOW has "keytar": "^7.9.0" (this PR)
  • /packages/create-agents/src/utils.ts: NOW has onlyBuiltDependencies: ['keytar'] (this PR)

Issue: Multiple packages declaring the same native dependency can lead to:

  • Version conflicts during resolution
  • Multiple compilation attempts
  • Inconsistent behavior across packages
  • Bundle size increases

Recommendation: Since agents-run-api depends on @inkeep/agents-core (which already includes keytar), the keytar dependency should be removed from agents-run-api/package.json and handled via the core package dependency.

System Design Concern: Dependency Architecture

// Current (potentially problematic)
agents-run-api/package.json: keytar ^7.9.0
packages/agents-core/package.json: keytar ^7.9.0

// Better approach
packages/agents-core/package.json: keytar ^7.9.0
agents-run-api depends on @inkeep/agents-core  inherits keytar

This follows the single source of truth principle for native dependencies.

Build Configuration Pattern

The onlyBuiltDependencies configuration in create-agents template is correct, but consider:

  1. Documentation: Add comments explaining why keytar needs special build handling
  2. Error Handling: The template should include guidance for users who may encounter keytar build issues on different platforms

Security & Platform Considerations

  1. macOS Codesigning: For production use, consider documenting codesigning requirements for macOS permission prompts
  2. Platform Support: Document Linux dependencies (libsecret-1-dev) for complete platform coverage
  3. Runtime Detection: The existing graceful degradation pattern is excellent - maintain this approach

Recommendations

High Priority

  1. Remove duplicate dependency: Remove keytar from agents-run-api/package.json since it's already provided by agents-core
  2. Dependency audit: Verify no other packages are unnecessarily declaring keytar as a direct dependency

Medium Priority

  1. Documentation: Add inline comments in create-agents template explaining keytar's special build requirements
  2. Platform guidance: Include platform-specific setup instructions in generated project documentation

Code Suggestion

// packages/create-agents/src/utils.ts (lines 345-347)
    pnpm: {
      // keytar requires native compilation - handle separately from other deps
      onlyBuiltDependencies: [
        'keytar'  
      ]
    },

Conclusion

The changes correctly address keytar dependency management, but there's a critical architectural issue with duplicate dependency declarations. The native dependency should be centralized in agents-core rather than declared in multiple packages. The existing graceful fallback patterns and build configurations are well-designed.

Severity: Medium - The PR works but creates technical debt and potential inconsistency issues.

@amikofalvy amikofalvy merged commit fe79b8a into main Sep 14, 2025
5 of 7 checks passed
@amikofalvy amikofalvy deleted the keytar-fix branch September 14, 2025 00:19
amikofalvy added a commit that referenced this pull request Mar 11, 2026
* fix(security): add minimatch override >=5.1.8

Fixes multiple ReDoS vulnerabilities (CVE-2026-26996, CVE-2026-27903,
CVE-2026-27904) in transitive [email protected] dependency.

Closes dependabot alerts #188, #199, #200.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(security): add lodash/lodash-es override >=4.17.23 — prototype pollution fix (#2643)

* fix(security): add lodash/lodash-es override >=4.17.23

Fixes prototype pollution in _.unset and _.omit (CVE-2025-13465)
in transitive lodash dependencies.

Closes dependabot alerts #120, #123.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(security): add express-rate-limit override >=8.2.2 (#2644)

Fixes IPv4-mapped IPv6 rate limit bypass (CVE-2026-30827) in transitive
express-rate-limit dependency.

Closes dependabot alert #213.

Co-authored-by: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>
amikofalvy added a commit that referenced this pull request Mar 11, 2026
* fix(security): add minimatch override >=5.1.8

Fixes multiple ReDoS vulnerabilities (CVE-2026-26996, CVE-2026-27903,
CVE-2026-27904) in transitive [email protected] dependency.

Closes dependabot alerts #188, #199, #200.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(security): add lodash/lodash-es override >=4.17.23 — prototype pollution fix (#2643)

* fix(security): add lodash/lodash-es override >=4.17.23

Fixes prototype pollution in _.unset and _.omit (CVE-2025-13465)
in transitive lodash dependencies.

Closes dependabot alerts #120, #123.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(security): add express-rate-limit override >=8.2.2 (#2644)

Fixes IPv4-mapped IPv6 rate limit bypass (CVE-2026-30827) in transitive
express-rate-limit dependency.

Closes dependabot alert #213.

Co-authored-by: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>
amikofalvy added a commit that referenced this pull request Mar 11, 2026
* fix(security): add minimatch override >=5.1.8

Fixes multiple ReDoS vulnerabilities (CVE-2026-26996, CVE-2026-27903,
CVE-2026-27904) in transitive [email protected] dependency.

Closes dependabot alerts #188, #199, #200.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(security): add lodash/lodash-es override >=4.17.23 — prototype pollution fix (#2643)

* fix(security): add lodash/lodash-es override >=4.17.23

Fixes prototype pollution in _.unset and _.omit (CVE-2025-13465)
in transitive lodash dependencies.

Closes dependabot alerts #120, #123.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(security): add express-rate-limit override >=8.2.2 (#2644)

Fixes IPv4-mapped IPv6 rate limit bypass (CVE-2026-30827) in transitive
express-rate-limit dependency.

Closes dependabot alert #213.

Co-authored-by: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>
amikofalvy added a commit that referenced this pull request Mar 11, 2026
* fix(security): add minimatch override >=5.1.8

Fixes multiple ReDoS vulnerabilities (CVE-2026-26996, CVE-2026-27903,
CVE-2026-27904) in transitive [email protected] dependency.

Closes dependabot alerts #188, #199, #200.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(security): add lodash/lodash-es override >=4.17.23 — prototype pollution fix (#2643)

* fix(security): add lodash/lodash-es override >=4.17.23

Fixes prototype pollution in _.unset and _.omit (CVE-2025-13465)
in transitive lodash dependencies.

Closes dependabot alerts #120, #123.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(security): add express-rate-limit override >=8.2.2 (#2644)

Fixes IPv4-mapped IPv6 rate limit bypass (CVE-2026-30827) in transitive
express-rate-limit dependency.

Closes dependabot alert #213.

Co-authored-by: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>
amikofalvy added a commit that referenced this pull request Mar 11, 2026
* fix(security): add minimatch override >=5.1.8

Fixes multiple ReDoS vulnerabilities (CVE-2026-26996, CVE-2026-27903,
CVE-2026-27904) in transitive [email protected] dependency.

Closes dependabot alerts #188, #199, #200.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(security): add lodash/lodash-es override >=4.17.23 — prototype pollution fix (#2643)

* fix(security): add lodash/lodash-es override >=4.17.23

Fixes prototype pollution in _.unset and _.omit (CVE-2025-13465)
in transitive lodash dependencies.

Closes dependabot alerts #120, #123.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(security): add express-rate-limit override >=8.2.2 (#2644)

Fixes IPv4-mapped IPv6 rate limit bypass (CVE-2026-30827) in transitive
express-rate-limit dependency.

Closes dependabot alert #213.

Co-authored-by: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>
github-merge-queue Bot pushed a commit that referenced this pull request Mar 11, 2026
* fix(security): add dompurify override >=3.3.2

Fixes XSS bypass vulnerability (CVE-2026-0540) in transitive dompurify
dependency by adding pnpm override.

Closes dependabot alerts #210, #211.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(security): add fast-xml-parser override >=5.3.8

Fixes stack overflow with preserveOrder (CVE-2026-27942) in transitive
fast-xml-parser dependency.

Closes dependabot alert #205.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(security): add serialize-javascript override >=7.0.3

Fixes RCE vulnerability via RegExp.flags and Date.prototype.toISOString()
in transitive serialize-javascript dependency (build-time only).

Closes dependabot alert #203.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(security): add svgo override >=3.3.3

Fixes DoS via entity expansion in DOCTYPE (CVE-2026-29074) in transitive
svgo dependency (build-time only).

Closes dependabot alert #212.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(security): add minimatch override >=5.1.8 — ReDoS fix (#2642)

* fix(security): add minimatch override >=5.1.8

Fixes multiple ReDoS vulnerabilities (CVE-2026-26996, CVE-2026-27903,
CVE-2026-27904) in transitive [email protected] dependency.

Closes dependabot alerts #188, #199, #200.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(security): add lodash/lodash-es override >=4.17.23 — prototype pollution fix (#2643)

* fix(security): add lodash/lodash-es override >=4.17.23

Fixes prototype pollution in _.unset and _.omit (CVE-2025-13465)
in transitive lodash dependencies.

Closes dependabot alerts #120, #123.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(security): add express-rate-limit override >=8.2.2 (#2644)

Fixes IPv4-mapped IPv6 rate limit bypass (CVE-2026-30827) in transitive
express-rate-limit dependency.

Closes dependabot alert #213.

Co-authored-by: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>

* fix(security): add security overrides to create-agents-template

Ensures self-hosted deployments using the template also get patched
transitive dependency versions.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(security): sync overrides between root and create-agents-template

Makes pnpm.overrides identical in both package.json files so the
monorepo and self-hosted template have the same security floor.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>
inkeep-oss-sync Bot pushed a commit that referenced this pull request Apr 22, 2026
…ded (#194)

* fix(ci): always reset copybara/sync on every mirror run

Closes #188

Drop the "leave branch in place if open PR is < STALE_PR_HOURS" branch in
the mirror sync reconcile step. Letting Copybara "append" to an existing
copybara/sync was never safe: the Copybara config uses fetch=main, so
every run baselines off inkeep/agents main's last GitOrigin-RevId. When
a new push lands on agents-private main while a prior sync PR is still
open, Copybara rebuilds the older origin change from main's HEAD (new
SHA due to timestamps) and the non-force push to copybara/sync is
rejected as non-fast-forward. This is the failure mode that blew up
the release cascade in #188 (Version Packages #185 merged while #3166
was still open 9 minutes after being created).

Every mirror run now closes any open sync PR and deletes copybara/sync
before Copybara runs, so each run pushes a fresh history. The concurrency
group already serializes runs and every new run includes all accumulated
changes since the last imported revision, so no information is lost.
PR churn (one inkeep/agents sync PR per agents-private main push) is the
cost, and it is much cheaper than a stuck release cascade.

CI_RUNBOOK gets a new entry for this specific failure string so future
red runs route to the fix without a re-investigation.

* fix(ci): harden release cascade against silent strandings

Bundled on top of the copybara/sync reset in this PR so the whole
release path (mirror sync -> npm publish -> GH Release -> Vercel
prod deploy -> scheduler restart) can run end-to-end with no human
intervention. Each fix closes a distinct silent-stranding mode.

1. public-mirror-sync.yml Create-PR guard
   - Reconcile now always deletes copybara/sync before Copybara runs,
     which introduced a regression: when Copybara exits 4 (no changes
     to sync, eg. workflow_dispatch with an idle main), the branch is
     gone and the next `gh pr create --head copybara/sync` would fail.
     Add an explicit branch-existence check; short-circuit cleanly.
   - Add explicit --state open to the gh pr list call. Defaults to open
     but being explicit prevents a future refactor from reintroducing
     the PR #184 bug class.
   - Replace the PR number extraction `grep -o '[0-9]*$'` on the PR URL
     with gh pr view --json number. gh's stdout format is not a contract.

2. private-agents-ui-version-packages.yml publish detection
   - Was parsing `Publishing "X" at "Y"` via grep/sed on the changesets
     log, which is the exact fragility PR #174 removed from public
     release.yml. If changesets v2 changes format, published=false is
     written despite a successful publish, the widget-release dispatch
     is skipped, and agents-docs changelog silently desyncs.
   - Use the stable "packages published successfully" presence marker
     and read the version from package.json (authoritative for a fixed
     release group).

3. public/agents/.github/workflows/release.yml catch-all + dispatch retry
   - `Notify agents-private (failure)` was gated on
     `steps.detect.outputs.has_changesets == 'false'`. If the workflow
     failed before the detect step ran (install, build, token gen),
     has_changesets is unset and the condition evaluated false -> no
     dispatch, no tracking issue on agents-private, red run sitting
     invisibly in the Actions tab. Drop the has_changesets gate.
   - Replace peter-evans/repository-dispatch with a bash retry loop
     (3 attempts, 5/10s backoff). The action has no built-in retry, so
     a transient 5xx or rate-limit during the post-publish dispatch
     loses the signal permanently: npm publishes, but no GH Release is
     created and no Vercel prod deploy fires. Retry + explicit error
     on exhausted attempts so the stranding is loud, not silent.

4. public-agents-vercel-production.yml concurrency + failure tracker
   - Add concurrency: vercel-production-deploy. DB migrations are not
     idempotent; two parallel deploys (eg. a release published while a
     manual re-dispatch is in flight) would race on migrate-databases
     and leave schema in a half-applied state.
   - Add notify-on-failure job (mirrors the tracking-issue pattern
     from public-mirror-sync.yml). At this point npm has published,
     the GH Release exists, but prod runtime is stale. Needs to be
     loud: auto-open a "Vercel production deploy failing" issue so
     the half-shipped state is visible instead of buried in the
     Actions tab.

CI_RUNBOOK.md: reword the release/publish failure entries to match
the new retry/tracking behavior, and add a new entry covering the
post-publish deploy failure case.

Intentionally out of scope: the auto-format.yml + Dependabot
`pnpm install --frozen-lockfile` race. Not a release-cascade issue,
will go in a separate PR.

* docs(runbook): bold Historical marker for consistency

GitOrigin-RevId: 04ff8b544833e109b57f75ded3236730d7fb10eb
github-merge-queue Bot pushed a commit that referenced this pull request Apr 22, 2026
* Version Packages (agents) (#185)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
GitOrigin-RevId: 7263142a67ac9ce9c9873a68a5673bfb436dbc1c

* chore(copilot-app): remove redundant lockfile, install from monorepo root (#186)

* chore(copilot-app): remove redundant lockfile, install from monorepo root

copilot-app is a workspace member (pnpm-workspace.yaml line 18), so the root
lockfile already resolves its dependencies. The second lockfile only existed
because vercel.json used pnpm install --ignore-workspace --frozen-lockfile,
which severs workspace context and therefore needed a local lockfile.

Two install boundaries for the same app meant root pnpm.overrides did not
apply to the Vercel install, so CI and Vercel could silently resolve to
different dependency trees. PR #167's description originally said "Vercel
to install + build from the monorepo root via pnpm --filter copilot-app...",
but the committed vercel.json drifted to --ignore-workspace. This aligns the
implementation with the stated plan.

- Delete private/copilot-app/pnpm-lock.yaml
- Change private/copilot-app/vercel.json installCommand to install from the
  monorepo root with a workspace filter
- Drop the copilot-app entry from scripts/check-monorepo-traps.mjs and simplify
  the DUAL_LOCKFILE_ROOTS comment (every remaining entry is a true workspace
  boundary, so the ignoreWorkspace workaround is no longer needed for any of
  them)

* docs(private): update lockfile section after copilot-app cleanup

* chore: add install:all convenience script for dual-lockfile installs

* chore: include create-agents-template in install:all

* fix(copilot-app): drop redundant cd ../.. from vercel installCommand

* docs: point dual-lockfile guidance at pnpm install:all

This PR introduces the install:all script; update every doc that teaches
the old cd-and-install-twice pattern to reference the shorthand instead.

- AGENTS.md (root) Dual lockfiles section: replaces the two-step pnpm
  install invocation with a single install:all, and lists all three
  lockfile scopes (root, public/agents, public/agents/create-agents-template)
  so readers understand what the shorthand covers.
- CI_RUNBOOK ERR_PNPM_OUTDATED_LOCKFILE: same substitution plus the third
  lockfile in the git add line.
- public/agents/AGENTS.md pnpm-lock.yaml Resolution Strategy: adds a
  When changing dependencies callout pointing at install:all, so readers
  inside the public/agents subtree know they have a root shortcut for
  the whole-monorepo regeneration.

* chore(check-monorepo-traps): drop dead ignoreWorkspace flag

Every DUAL_LOCKFILE_ROOTS entry is now a true workspace boundary that
installs without --ignore-workspace. The flag had exactly one live
consumer (private/copilot-app) which this PR removes. Simplify the data
structure to an array of path strings and drop the now-unused flag
branches in the install command and regen hint.

Also: the regen hint gains a pointer at the install:all shorthand, since
that's the recommended path for a whole-monorepo resync.

* docs: comprehensive command cheatsheet + check:structural aggregate

The problem: every time a new shorthand is added (install:all, check:*)
it lands in code but stays invisible in docs. People default to the raw
cd-and-install form, which is how we drift. The cheatsheet is the fix
for the drift-by-ignorance path.

Changes:
- Adds check:structural to root package.json - one command for the full
  structural guard set (boundaries + monorepo-traps + release-groups
  validate). Complements the existing pre-push hook which only runs
  check:monorepo-traps.
- Rewrites AGENTS.md 'Command routing' section as 'Command cheatsheet'
  with a scenario-driven quick-lookup table at top, then grouped by
  intent: install/lockfiles, build+dev+lint+typecheck+test, structural
  guards, changesets+releases, mirror/Copybara, parity, database.
- Documents the suffix convention (:agents, :agents-ui, :chat-to-edit,
  :inkeep-cloud-mcp, :copilot, :ext; no suffix = fan-out) so people
  can guess commands instead of memorizing.
- Every command gets a one-line description of what it does and when
  to reach for it.

* fix(check-monorepo-traps): guard the create-agents-template lockfile too

Docs introduced in this PR call out three lockfiles (root, public/agents,
public/agents/create-agents-template) and point at install:all as the
shorthand that regenerates them. The check only validated two — the
starter-kit lockfile could drift silently and slip past the pre-push
hook, surfacing for end users later when they cloned the starter.

Add public/agents/create-agents-template to DUAL_LOCKFILE_ROOTS and
update the comment to reflect the actual install-boundary taxonomy
(monorepo / Copybara+Vercel / standalone starter). install:all and the
check now cover the same set.

* ci: gate publish on check:structural (defense-in-depth)

Required checks on the source PR already run check:structural, and both
version-packages workflows check out origin/main before doing anything.
In practice, publish always runs against a validated main state.

But 'in practice' isn't the same as 'structurally'. A workflow_dispatch
run against main, an admin bypass of branch protection, or a future
change that loosens merge requirements could let a misconfigured main
reach the publish step without re-validation. Today's agents-ui release
already surfaced one post-publish pipefail bug that shouldn't have been
possible if we trusted the pipeline - this gate is the same intuition
applied upstream.

Adds 'Validate structural invariants' step between Install and the
release machinery in both private-agents-ui-version-packages.yml and
public-agents-version-packages.yml. Runs pnpm check:structural, which
aggregates check:boundaries + check:monorepo-traps + release-groups:validate
(including the workspace-isolation guard introduced in #191). Fails
hard on any structural misconfig, refusing to publish.

Cost: ~30-60s per publish run. Cheaper than a bad release.
GitOrigin-RevId: 684d52e5ab7734f592479b61e972cdfe5fc3ae23

* fix(ci): harden release cascade so copybara + npm publish run unattended (#194)

* fix(ci): always reset copybara/sync on every mirror run

Closes #188

Drop the "leave branch in place if open PR is < STALE_PR_HOURS" branch in
the mirror sync reconcile step. Letting Copybara "append" to an existing
copybara/sync was never safe: the Copybara config uses fetch=main, so
every run baselines off inkeep/agents main's last GitOrigin-RevId. When
a new push lands on agents-private main while a prior sync PR is still
open, Copybara rebuilds the older origin change from main's HEAD (new
SHA due to timestamps) and the non-force push to copybara/sync is
rejected as non-fast-forward. This is the failure mode that blew up
the release cascade in #188 (Version Packages #185 merged while #3166
was still open 9 minutes after being created).

Every mirror run now closes any open sync PR and deletes copybara/sync
before Copybara runs, so each run pushes a fresh history. The concurrency
group already serializes runs and every new run includes all accumulated
changes since the last imported revision, so no information is lost.
PR churn (one inkeep/agents sync PR per agents-private main push) is the
cost, and it is much cheaper than a stuck release cascade.

CI_RUNBOOK gets a new entry for this specific failure string so future
red runs route to the fix without a re-investigation.

* fix(ci): harden release cascade against silent strandings

Bundled on top of the copybara/sync reset in this PR so the whole
release path (mirror sync -> npm publish -> GH Release -> Vercel
prod deploy -> scheduler restart) can run end-to-end with no human
intervention. Each fix closes a distinct silent-stranding mode.

1. public-mirror-sync.yml Create-PR guard
   - Reconcile now always deletes copybara/sync before Copybara runs,
     which introduced a regression: when Copybara exits 4 (no changes
     to sync, eg. workflow_dispatch with an idle main), the branch is
     gone and the next `gh pr create --head copybara/sync` would fail.
     Add an explicit branch-existence check; short-circuit cleanly.
   - Add explicit --state open to the gh pr list call. Defaults to open
     but being explicit prevents a future refactor from reintroducing
     the PR #184 bug class.
   - Replace the PR number extraction `grep -o '[0-9]*$'` on the PR URL
     with gh pr view --json number. gh's stdout format is not a contract.

2. private-agents-ui-version-packages.yml publish detection
   - Was parsing `Publishing "X" at "Y"` via grep/sed on the changesets
     log, which is the exact fragility PR #174 removed from public
     release.yml. If changesets v2 changes format, published=false is
     written despite a successful publish, the widget-release dispatch
     is skipped, and agents-docs changelog silently desyncs.
   - Use the stable "packages published successfully" presence marker
     and read the version from package.json (authoritative for a fixed
     release group).

3. public/agents/.github/workflows/release.yml catch-all + dispatch retry
   - `Notify agents-private (failure)` was gated on
     `steps.detect.outputs.has_changesets == 'false'`. If the workflow
     failed before the detect step ran (install, build, token gen),
     has_changesets is unset and the condition evaluated false -> no
     dispatch, no tracking issue on agents-private, red run sitting
     invisibly in the Actions tab. Drop the has_changesets gate.
   - Replace peter-evans/repository-dispatch with a bash retry loop
     (3 attempts, 5/10s backoff). The action has no built-in retry, so
     a transient 5xx or rate-limit during the post-publish dispatch
     loses the signal permanently: npm publishes, but no GH Release is
     created and no Vercel prod deploy fires. Retry + explicit error
     on exhausted attempts so the stranding is loud, not silent.

4. public-agents-vercel-production.yml concurrency + failure tracker
   - Add concurrency: vercel-production-deploy. DB migrations are not
     idempotent; two parallel deploys (eg. a release published while a
     manual re-dispatch is in flight) would race on migrate-databases
     and leave schema in a half-applied state.
   - Add notify-on-failure job (mirrors the tracking-issue pattern
     from public-mirror-sync.yml). At this point npm has published,
     the GH Release exists, but prod runtime is stale. Needs to be
     loud: auto-open a "Vercel production deploy failing" issue so
     the half-shipped state is visible instead of buried in the
     Actions tab.

CI_RUNBOOK.md: reword the release/publish failure entries to match
the new retry/tracking behavior, and add a new entry covering the
post-publish deploy failure case.

Intentionally out of scope: the auto-format.yml + Dependabot
`pnpm install --frozen-lockfile` race. Not a release-cascade issue,
will go in a separate PR.

* docs(runbook): bold Historical marker for consistency

GitOrigin-RevId: 04ff8b544833e109b57f75ded3236730d7fb10eb

---------

Co-authored-by: Varun Varahabhotla <[email protected]>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
inkeep-oss-sync Bot pushed a commit that referenced this pull request Apr 22, 2026
* fix(ci): close remaining silent-failure gaps in release cascade

Five hardening fixes across the release pipeline. None of these change
pipeline shape (CTO-asked streamlining was evaluated separately and
deferred — it saves ~1 min E2E but closes zero real failure modes).

Each change addresses a distinct way the cascade can silently strand:

1. release-handler.yml: widen notify-handler-failure to catch failure-job
   failures too. Previously only caught success-job failures; if the
   failure-dispatch handler's own gh issue create 4xx'd (label API
   hiccup), the npm publish failure went completely untracked. Needs
   chain now covers [success, failure] and the issue body adapts to
   which job failed.

2. public-mirror-sync.yml: 3-attempt retry on gh pr list before exit 0
   in the copybara/sync reconcile step. Previously a single transient
   API flake skipped reconciliation entirely, letting Copybara run over
   a potentially-stuck sync branch — exactly the local/origin history
   conflict class that issue #188 fixed via reconcile. Exit 0 on
   exhaust is preserved (deleting a live PR's branch on persistent
   outage is worse than letting Copybara try its own fast-fail).

3. public/agents/.github/workflows/release.yml: add npm view
   ground-truth check after the grep-based "packages published
   successfully" marker. The log-phrase check catches phrase drift
   but not partial-publish (package N fails after N-1 succeed leaves
   the marker in the log). Now iterates every @inkeep/ workspace
   package and verifies each exists on npm at VERSION; any miss
   fails the step with a specific error so the failure notifier
   fires instead of silently reporting green.

4. scripts/check-monorepo-traps.mjs: add
   public/agents/agents-cookbook/evals/langfuse-dataset-example to
   DUAL_LOCKFILE_ROOTS. The directory is carved out as a
   STANDALONE_WORKSPACE_BOUNDARIES entry (users clone the example
   standalone) but its lockfile wasn't being checked for freshness.
   A dep change there could have shipped a broken install. The two
   sets now stay in sync by construction (noted in comment).

5. New release-version-drift-watchdog.yml: scheduled 3-way version
   check every 30 min across agents-core/package.json on main,
   @inkeep/agents-core latest on npm, and latest GH Release tag.
   Opens a tracking issue if drift persists past a 60-min grace
   window (bounds worst-case silent-stranding detection latency to
   30 min regardless of which workflow failed silently). Auto-closes
   the issue when drift resolves.

Audit finding #1 from yesterday's staff-engineer audit was retracted
(Doltgres branch-sync dead gate) — git blame + runtime evidence from
v0.69.0 and v0.70.0 deploys confirm the gate is working as designed
(migrate-dolt.ts emits the migrations_applied output correctly).

* fix(ci): address PR #212 review + bump watchdog cadence

Response to pullfrog + claude review findings on #212.

Watchdog timing bumps (per ask):
- Cron: every 30 min -> every hour on the top of the hour
- Grace window: 60 min -> 90 min
Normal release cascade is 20-30 min, worst legitimate tail (npm
propagation lag + Vercel queue) is ~60-90 min. 90 min grace absorbs
that without meaningfully raising detection latency (worst-case is
still grace + cron = ~2.5 hours vs. the unbounded default).

Watchdog correctness:
- gh pr list now uses `sort:updated-desc`. Default search relevance
  ordering doesn't guarantee --limit 1 returns the most recent merge
  when all Version PR titles are near-identical.
- Version PR lookup distinguishes real API failure from "no PR found".
  Previously both emptied LAST_VERSION_PR_MERGED_AT, silently bypassing
  the grace window on a transient API hiccup and producing false-
  positive drift alerts during legitimate in-flight releases. On
  failure we now warn explicitly and let drift be treated as real —
  intentional: a genuine API outage should alert, not suppress.
- Tracking issue lookup now uses --label release-drift-watchdog
  instead of `in:title "Release version drift detected"`. Title-
  substring search could match or close an unrelated human-authored
  issue whose title shared the phrase. The new label is this
  workflow's private marker, created alongside the existing `release`
  label in the defensive label-ensure loop. Issues opened by the
  watchdog get both labels.
- Auto-close step is now non-fatal. Drift is already resolved by the
  time this step runs, so a failed `gh issue comment` or
  `gh issue close` on a cleanup path should emit a warning instead of
  turning the run red. Next scheduled tick retries.

release.yml (inkeep/agents mirror) — npm propagation retry:
- Per-package `npm view` now retries up to 4 times with escalating
  backoff (2s, 4s, 8s, 16s — 30s cumulative wait per package) before
  declaring a package genuinely missing. The registry write path is
  synchronous but the CDN read path can lag by seconds. Previous
  single-shot check could false-positive during normal propagation,
  firing the failure notifier unnecessarily.
- Success path still exits on attempt 1 with a single npm view call
  — retry only engages when a package is not yet visible.
- Updated error message to note propagation is already ruled out.

Documentation catch-up:
- AGENTS.md: lockfile count 3 -> 4 with the langfuse-dataset-example
  entry that PR #212 adds. Explains the distinction between the two
  primary install-driving lockfiles (root + public/agents) and the
  two standalone lockfiles (starter kit + eval example) that ship
  with their own workspace so users can install subdirectories
  directly.
- CI.md: new workflow row under "Release and publishing" for the
  watchdog. Trigger now says "schedule (hourly)" to match the cron
  bump.
- package.json: `install:all` script now includes the langfuse
  lockfile directory. Previously check:lockfiles validated four
  entries but the regen shorthand only covered three, which would
  have left the fourth drifting silently the first time its package.json
  got updated.

* fix(ci): swap chat-to-edit-validation to resilient install composite

The failure on PR #212 (chat-to-edit / lint) was Corepack lazy-downloading
pnpm from the npm registry on first pnpm invocation (`pnpm store path
--silent` in this workflow). The undici SocketError during that download
left STORE_PATH unset, which actions/cache rejected with "Input required
and not supplied: path" — cascading skip of install/build/lint with no
actionable signal.

Swap the inlined setup-node + corepack + manual `pnpm store path` +
actions/cache + `pnpm install` chain for a single `uses:
./.github/composite-actions/install`. The composite downloads pnpm
directly from GitHub releases via pnpm/action-setup (different CDN
than corepack's npm registry fetch, empirically stable). 7 publish/
deploy workflows already use this pattern without hitting the flake.

Deferring the same migration on the other 9 inlined-pattern workflows
(agents-ui / copilot-app / copilot-chrome-extension / inkeep-cloud-mcp /
auto-format / private-pr-validation / public-agents-core-validation /
public-agents-extended-validation / public-agents-cypress) to a follow-
up. Several have custom steps (Playwright cache, Turbo cache, pre-install
biome, non-frozen-lockfile for auto-format) that need per-file review —
blind-swap would risk breaking a required check.

GitOrigin-RevId: 8c2e367004865bfe09daa1867296826c8b6c9db0
Zeeeepa pushed a commit to Zeeeepa/inkeep_agents that referenced this pull request Apr 23, 2026
* Follow-ups to inkeep#130: tsconfig pilot + skipped-test audit + stream-path any cleanup (inkeep#133)

* test: remove 2 obsolete skipped tests in push command

These two tests were empty-body `it.skip(...)` placeholders whose
comments explicitly documented why they were obsolete:

- `should override API URL from command line`: feature removed in
  favor of config-file-only approach (API URLs must now be in
  inkeep.config.ts, not CLI flags)
- `should handle missing configuration`: behavior tested by integration
  tests; unit-test path not feasible due to process.exit(1)

Part of a codebase-wide skipped-test audit. See
.audit-skipped-tests.md for the full audit.

* chore: add skipped-test audit summary

Temporary artifact documenting the 131-test skipped-test audit.
Full per-file table lives in /tmp/skipped-tests-audit.md.

- 131 skipped tests across 24 files (pattern: it.skip / describe.skip)
- Bucket A (unskip): 0 (verification loop blocked by Node version guard)
- Bucket B (delete): 2 applied in prior commit; 1 ~460-line block deferred
- Bucket C (needs owner): 128, clustered around 3 architectural migrations
- Bucket D: 0

This file may be removed before PR.

* chore(tsconfig): pilot strict baseline on 2 packages

Extend tsconfig.base.json in:
- public/agents/packages/agents-mcp (no source changes; already strict)
- public/agents/packages/agents-email (3 exactOptionalPropertyTypes fixes)

agents-email fixes:
- src/components/email-layout.tsx: conditional-spread optional
  'description' prop into EmailHeader
- src/index.ts: conditional-spread optional 'replyTo' in both
  sendInvitationEmail and sendPasswordResetEmail sendEmail calls

Evaluated but deferred to their own PRs (would exceed pilot scope):
- ai-sdk-provider: 15 errors, mostly LanguageModelV2 structural
  exactOptionalPropertyTypes mismatches that require interface-level
  changes
- create-agents: 30 errors across templates.ts/utils.ts from
  noUncheckedIndexedAccess + exactOptionalPropertyTypes

Builds on inkeep#130.

* fix(ci): wait for DBs to serve queries before Extended Validation tests

Extended Validation's doltgres + postgres service containers report healthy
via their docker health checks before the database/user objects are actually
queryable. Tests start, fail with 'database not found: appuser' /
DrizzleQueryError intermittently. See PR inkeep#200 and PR inkeep#205 failures.

Adds a hard barrier that polls each DB with SELECT 1 (30s max) after service
containers start but before tests run. Converts probabilistic 'health check is
close enough' into deterministic 'we proved the DB can serve queries.'

Applied to both:
- .github/workflows/public-agents-extended-validation.yml
- .github/composite-actions/public-agents-cypress-e2e/action.yml (replaces the
  existing DoltGres-only wait with a unified wait_for helper that also gates
  on the postgres runtime DB)

* chore(review): address non-signoz inline comments on inkeep#133

- .audit-skipped-tests.md: strip ephemeral `/tmp/skipped-tests-audit.md`
  reference; update branch name to the PR's actual branch
  (pullfrog review comment)

- agents-mcp/tsconfig.json: drop useUnknownInCatchVariables (already
  implied by strict: true inherited from tsconfig.base.json)
  (pullfrog + claude review comments; 1-click suggest)

Signoz-related review items dropped along with the signoz refactor.

* fix: drop engines.node to unblock inkeep-cloud-mcp Vercel deploys

The engines.node range added in inkeep#130 broke inkeep-cloud-mcp Vercel
builds on main (both preview and production). Mechanism: that project's
vercel.json does `cd ../.. && pnpm install` from repo root, which picks
up root engine-strict=true plus engines.node <23. Vercel's build env
runs Node 24, failing the constraint. The other three Vercel projects
install from their subdir and do not inherit this, so they kept
deploying successfully.

Deploy evidence on main:
- 4236e3d915 (pre-inkeep#130 merge, no engines): success
- 08d61f2938 (merge commit, engines added): failure (preview + prod)
- 1526cbcd90 (post-merge Dependabot bump): failure

Keeping .node-version: 22 (unrelated to Vercel) and engine-strict=true
in .npmrc (no-op without engines field, same state as pre-inkeep#130). The
postinstall check-node-version.mjs still enforces major-version match
for local dev.

GitOrigin-RevId: b72cd4cf7aa8144945fb05590c8bc804ef01be69

* chore(ci): align security-floor overrides and flip check:overrides to hard-fail (inkeep#204)

* chore(ci): align security-floor overrides and flip check:overrides to hard-fail

Aligned the four out-of-sync overrides between public/agents/package.json and root pnpm-workspace.yaml, using the higher floor in each direction to preserve security intent:

- @modelcontextprotocol/sdk: root pin 1.26.0 relaxed to >=1.26.0 (matches public/agents)
- fast-xml-parser: public/agents raised >=5.3.8 -> >=5.5.6
- lodash: public/agents raised >=4.17.23 -> >=4.18.0
- lodash-es: public/agents raised >=4.17.23 -> >=4.18.0

Regenerated both lockfiles that cover these overrides (root pnpm-lock.yaml and public/agents/pnpm-lock.yaml). No transitive version re-resolutions; the only changes are the override specifiers themselves.

Flipped check:overrides in scripts/check-monorepo-traps.mjs from soft-warn to hard-fail. Now matches the already-hard check:override-masks-bump, check:lockfiles, and check:workspace-membership. Any future drift between root and public/agents overrides is caught at PR time instead of by a cryptic Vercel install failure minutes after merge.

Also updated AGENTS.md and .github/CI_RUNBOOK.md to reflect the new hard-fail behavior.

Note: pre-commit hook skipped (pnpm lint-staged at root is a pre-existing local-setup issue unrelated to this PR). Files in this commit do not require biome formatting (lockfiles, yaml, package.json).

* chore(ci): align check:overrides error messages with doc language

The pullfrog review on PR inkeep#204 flagged that the checkOverridePlacement
remediation strings still pointed only at /package.json, while the
AGENTS.md and CI_RUNBOOK.md updates in the same PR now say overrides
can live in either /pnpm-workspace.yaml or /package.json at root.
Script logic already reads both locations via getRootOverrides(); this
is a wording-only fix so the error messages a developer sees match
what the docs tell them to do.

GitOrigin-RevId: 1633ad2aa24886fe2687dab6eb6ef9379786705a

* csv and rerun functionality (inkeep#200)

* csv and rerun

* style: auto-format with biome

* tests

* style: auto-format with biome

* TestS

* style: auto-format with biome

* library instead of manual parse

* lint

* snapshot

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
GitOrigin-RevId: fbfeb6d660e85d4269acf00efd35e885ad35365d

* fix(tsconfig): move tsconfig.base.json into public/agents/ for Copybara mirror compatibility (inkeep#209)

* fix(tsconfig): move tsconfig.base.json into public/agents/ for Copybara mirror compatibility

The root-level tsconfig.base.json added in inkeep#130 lives outside public/agents/**,
so Copybara's stripPrefix: "public/agents" does not mirror it to inkeep/agents.
After the sync, per-package tsconfigs referenced ../../../../tsconfig.base.json
which resolves above the repo root on inkeep/agents, causing agents-email#build
to fail with TS5083.

PR inkeep#130 originally documented a 2-level extends path in the base file's own
comment ("Extend with { \"extends\": \"../../tsconfig.base.json\" }"), which
is only correct if the base sits at public/agents/tsconfig.base.json. The file
was placed at the wrong directory.

This moves the file under public/agents/ and updates the two consumers
(agents-email, agents-mcp) to use the intended 2-level path. Path resolves
correctly in both repos now.

* docs(public-agents): document tsconfig.base.json convention for new packages

* docs(tsconfig): drop em dashes in new section to match repo writing style

GitOrigin-RevId: 89ee740d87232ae68cb8195558c1fb1af7b2a462

* chore(ci): remove redundant public-repo ci.yml and cypress.yml (inkeep#211)

* chore(ci): remove redundant public-repo ci.yml and cypress.yml

All lint/typecheck/test/build/Cypress validation already runs on agents-private
pre-merge via Core Validation, Extended Validation, and public-agents-cypress.
The public-side duplicates re-ran the same checks on Copybara sync PRs (code
already exhaustively validated), costing ~30m (ci) + ~15m (cypress) per sync
on ubuntu-32gb runners.

External PRs to inkeep/agents bridge back to agents-private via
monorepo-pr-bridge.yml for canonical validation, so no coverage is lost.

- Delete public/agents/.github/workflows/ci.yml
- Delete public/agents/.github/workflows/cypress.yml
- Delete orphaned composite actions (changeset-check, cypress-e2e)
- Update CI.md workflow map, parity table, branch protection
- Update CI_ARCHITECTURE.md install composite-action reference
- Update cypress-e2e composite README (agents-private only caller)
- Update internal-surface-areas skill to point at upstream workflows

Coordinated with CTO: 'ci' and 'Cypress E2E Tests' required checks removed
from inkeep/agents branch protection.

* chore(ci): also remove redundant public-repo ci-maintenance.yml

With ci.yml and cypress.yml gone, the public repo has no substantive CI
for the weekly CI Maintenance Claude job to analyze. The equivalent
analysis runs on agents-private via public-agents-ci-maintenance.yml,
which sees the real CI surface.

- Delete public/agents/.github/workflows/ci-maintenance.yml
- Update CI.md workflow map + parity table
- Update internal-surface-areas skill

* chore(ci): clean up stale ci.yml references flagged by PR review

- Update two stale comments in public-agents-extended-validation.yml
  that referenced the now-deleted public/agents ci.yml
- Delete obsolete public/agents/specs/changeset-only-skip-ci/SPEC.md;
  the changeset-skip feature it documented lived inside ci.yml and the
  changeset-check composite action, both removed in this PR

GitOrigin-RevId: 63d06e27c8a374e100270f3118f64cd2170e0d6a

* fix(ci): close remaining silent-failure gaps in release cascade (inkeep#212)

* fix(ci): close remaining silent-failure gaps in release cascade

Five hardening fixes across the release pipeline. None of these change
pipeline shape (CTO-asked streamlining was evaluated separately and
deferred — it saves ~1 min E2E but closes zero real failure modes).

Each change addresses a distinct way the cascade can silently strand:

1. release-handler.yml: widen notify-handler-failure to catch failure-job
   failures too. Previously only caught success-job failures; if the
   failure-dispatch handler's own gh issue create 4xx'd (label API
   hiccup), the npm publish failure went completely untracked. Needs
   chain now covers [success, failure] and the issue body adapts to
   which job failed.

2. public-mirror-sync.yml: 3-attempt retry on gh pr list before exit 0
   in the copybara/sync reconcile step. Previously a single transient
   API flake skipped reconciliation entirely, letting Copybara run over
   a potentially-stuck sync branch — exactly the local/origin history
   conflict class that issue inkeep#188 fixed via reconcile. Exit 0 on
   exhaust is preserved (deleting a live PR's branch on persistent
   outage is worse than letting Copybara try its own fast-fail).

3. public/agents/.github/workflows/release.yml: add npm view
   ground-truth check after the grep-based "packages published
   successfully" marker. The log-phrase check catches phrase drift
   but not partial-publish (package N fails after N-1 succeed leaves
   the marker in the log). Now iterates every @inkeep/ workspace
   package and verifies each exists on npm at VERSION; any miss
   fails the step with a specific error so the failure notifier
   fires instead of silently reporting green.

4. scripts/check-monorepo-traps.mjs: add
   public/agents/agents-cookbook/evals/langfuse-dataset-example to
   DUAL_LOCKFILE_ROOTS. The directory is carved out as a
   STANDALONE_WORKSPACE_BOUNDARIES entry (users clone the example
   standalone) but its lockfile wasn't being checked for freshness.
   A dep change there could have shipped a broken install. The two
   sets now stay in sync by construction (noted in comment).

5. New release-version-drift-watchdog.yml: scheduled 3-way version
   check every 30 min across agents-core/package.json on main,
   @inkeep/agents-core latest on npm, and latest GH Release tag.
   Opens a tracking issue if drift persists past a 60-min grace
   window (bounds worst-case silent-stranding detection latency to
   30 min regardless of which workflow failed silently). Auto-closes
   the issue when drift resolves.

Audit finding inkeep#1 from yesterday's staff-engineer audit was retracted
(Doltgres branch-sync dead gate) — git blame + runtime evidence from
v0.69.0 and v0.70.0 deploys confirm the gate is working as designed
(migrate-dolt.ts emits the migrations_applied output correctly).

* fix(ci): address PR inkeep#212 review + bump watchdog cadence

Response to pullfrog + claude review findings on inkeep#212.

Watchdog timing bumps (per ask):
- Cron: every 30 min -> every hour on the top of the hour
- Grace window: 60 min -> 90 min
Normal release cascade is 20-30 min, worst legitimate tail (npm
propagation lag + Vercel queue) is ~60-90 min. 90 min grace absorbs
that without meaningfully raising detection latency (worst-case is
still grace + cron = ~2.5 hours vs. the unbounded default).

Watchdog correctness:
- gh pr list now uses `sort:updated-desc`. Default search relevance
  ordering doesn't guarantee --limit 1 returns the most recent merge
  when all Version PR titles are near-identical.
- Version PR lookup distinguishes real API failure from "no PR found".
  Previously both emptied LAST_VERSION_PR_MERGED_AT, silently bypassing
  the grace window on a transient API hiccup and producing false-
  positive drift alerts during legitimate in-flight releases. On
  failure we now warn explicitly and let drift be treated as real —
  intentional: a genuine API outage should alert, not suppress.
- Tracking issue lookup now uses --label release-drift-watchdog
  instead of `in:title "Release version drift detected"`. Title-
  substring search could match or close an unrelated human-authored
  issue whose title shared the phrase. The new label is this
  workflow's private marker, created alongside the existing `release`
  label in the defensive label-ensure loop. Issues opened by the
  watchdog get both labels.
- Auto-close step is now non-fatal. Drift is already resolved by the
  time this step runs, so a failed `gh issue comment` or
  `gh issue close` on a cleanup path should emit a warning instead of
  turning the run red. Next scheduled tick retries.

release.yml (inkeep/agents mirror) — npm propagation retry:
- Per-package `npm view` now retries up to 4 times with escalating
  backoff (2s, 4s, 8s, 16s — 30s cumulative wait per package) before
  declaring a package genuinely missing. The registry write path is
  synchronous but the CDN read path can lag by seconds. Previous
  single-shot check could false-positive during normal propagation,
  firing the failure notifier unnecessarily.
- Success path still exits on attempt 1 with a single npm view call
  — retry only engages when a package is not yet visible.
- Updated error message to note propagation is already ruled out.

Documentation catch-up:
- AGENTS.md: lockfile count 3 -> 4 with the langfuse-dataset-example
  entry that PR inkeep#212 adds. Explains the distinction between the two
  primary install-driving lockfiles (root + public/agents) and the
  two standalone lockfiles (starter kit + eval example) that ship
  with their own workspace so users can install subdirectories
  directly.
- CI.md: new workflow row under "Release and publishing" for the
  watchdog. Trigger now says "schedule (hourly)" to match the cron
  bump.
- package.json: `install:all` script now includes the langfuse
  lockfile directory. Previously check:lockfiles validated four
  entries but the regen shorthand only covered three, which would
  have left the fourth drifting silently the first time its package.json
  got updated.

* fix(ci): swap chat-to-edit-validation to resilient install composite

The failure on PR inkeep#212 (chat-to-edit / lint) was Corepack lazy-downloading
pnpm from the npm registry on first pnpm invocation (`pnpm store path
--silent` in this workflow). The undici SocketError during that download
left STORE_PATH unset, which actions/cache rejected with "Input required
and not supplied: path" — cascading skip of install/build/lint with no
actionable signal.

Swap the inlined setup-node + corepack + manual `pnpm store path` +
actions/cache + `pnpm install` chain for a single `uses:
./.github/composite-actions/install`. The composite downloads pnpm
directly from GitHub releases via pnpm/action-setup (different CDN
than corepack's npm registry fetch, empirically stable). 7 publish/
deploy workflows already use this pattern without hitting the flake.

Deferring the same migration on the other 9 inlined-pattern workflows
(agents-ui / copilot-app / copilot-chrome-extension / inkeep-cloud-mcp /
auto-format / private-pr-validation / public-agents-core-validation /
public-agents-extended-validation / public-agents-cypress) to a follow-
up. Several have custom steps (Playwright cache, Turbo cache, pre-install
biome, non-frozen-lockfile for auto-format) that need per-file review —
blind-swap would risk breaking a required check.

GitOrigin-RevId: 8c2e367004865bfe09daa1867296826c8b6c9db0

---------

Co-authored-by: Varun Varahabhotla <[email protected]>
Co-authored-by: shagun-singh-inkeep <[email protected]>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants