Update claude.yml#194
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Claude finished @amikofalvy's task —— View job PR Review: Update claude.ymlAs requested, I'm conducting a comprehensive engineering review of this PR from both granular and system architecture perspectives. Todo List:
|
|
@claude can you make a GET request to |
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
…ded (#194) * fix(ci): always reset copybara/sync on every mirror run Closes #188 Drop the "leave branch in place if open PR is < STALE_PR_HOURS" branch in the mirror sync reconcile step. Letting Copybara "append" to an existing copybara/sync was never safe: the Copybara config uses fetch=main, so every run baselines off inkeep/agents main's last GitOrigin-RevId. When a new push lands on agents-private main while a prior sync PR is still open, Copybara rebuilds the older origin change from main's HEAD (new SHA due to timestamps) and the non-force push to copybara/sync is rejected as non-fast-forward. This is the failure mode that blew up the release cascade in #188 (Version Packages #185 merged while #3166 was still open 9 minutes after being created). Every mirror run now closes any open sync PR and deletes copybara/sync before Copybara runs, so each run pushes a fresh history. The concurrency group already serializes runs and every new run includes all accumulated changes since the last imported revision, so no information is lost. PR churn (one inkeep/agents sync PR per agents-private main push) is the cost, and it is much cheaper than a stuck release cascade. CI_RUNBOOK gets a new entry for this specific failure string so future red runs route to the fix without a re-investigation. * fix(ci): harden release cascade against silent strandings Bundled on top of the copybara/sync reset in this PR so the whole release path (mirror sync -> npm publish -> GH Release -> Vercel prod deploy -> scheduler restart) can run end-to-end with no human intervention. Each fix closes a distinct silent-stranding mode. 1. public-mirror-sync.yml Create-PR guard - Reconcile now always deletes copybara/sync before Copybara runs, which introduced a regression: when Copybara exits 4 (no changes to sync, eg. workflow_dispatch with an idle main), the branch is gone and the next `gh pr create --head copybara/sync` would fail. Add an explicit branch-existence check; short-circuit cleanly. - Add explicit --state open to the gh pr list call. Defaults to open but being explicit prevents a future refactor from reintroducing the PR #184 bug class. - Replace the PR number extraction `grep -o '[0-9]*$'` on the PR URL with gh pr view --json number. gh's stdout format is not a contract. 2. private-agents-ui-version-packages.yml publish detection - Was parsing `Publishing "X" at "Y"` via grep/sed on the changesets log, which is the exact fragility PR #174 removed from public release.yml. If changesets v2 changes format, published=false is written despite a successful publish, the widget-release dispatch is skipped, and agents-docs changelog silently desyncs. - Use the stable "packages published successfully" presence marker and read the version from package.json (authoritative for a fixed release group). 3. public/agents/.github/workflows/release.yml catch-all + dispatch retry - `Notify agents-private (failure)` was gated on `steps.detect.outputs.has_changesets == 'false'`. If the workflow failed before the detect step ran (install, build, token gen), has_changesets is unset and the condition evaluated false -> no dispatch, no tracking issue on agents-private, red run sitting invisibly in the Actions tab. Drop the has_changesets gate. - Replace peter-evans/repository-dispatch with a bash retry loop (3 attempts, 5/10s backoff). The action has no built-in retry, so a transient 5xx or rate-limit during the post-publish dispatch loses the signal permanently: npm publishes, but no GH Release is created and no Vercel prod deploy fires. Retry + explicit error on exhausted attempts so the stranding is loud, not silent. 4. public-agents-vercel-production.yml concurrency + failure tracker - Add concurrency: vercel-production-deploy. DB migrations are not idempotent; two parallel deploys (eg. a release published while a manual re-dispatch is in flight) would race on migrate-databases and leave schema in a half-applied state. - Add notify-on-failure job (mirrors the tracking-issue pattern from public-mirror-sync.yml). At this point npm has published, the GH Release exists, but prod runtime is stale. Needs to be loud: auto-open a "Vercel production deploy failing" issue so the half-shipped state is visible instead of buried in the Actions tab. CI_RUNBOOK.md: reword the release/publish failure entries to match the new retry/tracking behavior, and add a new entry covering the post-publish deploy failure case. Intentionally out of scope: the auto-format.yml + Dependabot `pnpm install --frozen-lockfile` race. Not a release-cascade issue, will go in a separate PR. * docs(runbook): bold Historical marker for consistency GitOrigin-RevId: 04ff8b544833e109b57f75ded3236730d7fb10eb
* Version Packages (agents) (#185) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> GitOrigin-RevId: 7263142a67ac9ce9c9873a68a5673bfb436dbc1c * chore(copilot-app): remove redundant lockfile, install from monorepo root (#186) * chore(copilot-app): remove redundant lockfile, install from monorepo root copilot-app is a workspace member (pnpm-workspace.yaml line 18), so the root lockfile already resolves its dependencies. The second lockfile only existed because vercel.json used pnpm install --ignore-workspace --frozen-lockfile, which severs workspace context and therefore needed a local lockfile. Two install boundaries for the same app meant root pnpm.overrides did not apply to the Vercel install, so CI and Vercel could silently resolve to different dependency trees. PR #167's description originally said "Vercel to install + build from the monorepo root via pnpm --filter copilot-app...", but the committed vercel.json drifted to --ignore-workspace. This aligns the implementation with the stated plan. - Delete private/copilot-app/pnpm-lock.yaml - Change private/copilot-app/vercel.json installCommand to install from the monorepo root with a workspace filter - Drop the copilot-app entry from scripts/check-monorepo-traps.mjs and simplify the DUAL_LOCKFILE_ROOTS comment (every remaining entry is a true workspace boundary, so the ignoreWorkspace workaround is no longer needed for any of them) * docs(private): update lockfile section after copilot-app cleanup * chore: add install:all convenience script for dual-lockfile installs * chore: include create-agents-template in install:all * fix(copilot-app): drop redundant cd ../.. from vercel installCommand * docs: point dual-lockfile guidance at pnpm install:all This PR introduces the install:all script; update every doc that teaches the old cd-and-install-twice pattern to reference the shorthand instead. - AGENTS.md (root) Dual lockfiles section: replaces the two-step pnpm install invocation with a single install:all, and lists all three lockfile scopes (root, public/agents, public/agents/create-agents-template) so readers understand what the shorthand covers. - CI_RUNBOOK ERR_PNPM_OUTDATED_LOCKFILE: same substitution plus the third lockfile in the git add line. - public/agents/AGENTS.md pnpm-lock.yaml Resolution Strategy: adds a When changing dependencies callout pointing at install:all, so readers inside the public/agents subtree know they have a root shortcut for the whole-monorepo regeneration. * chore(check-monorepo-traps): drop dead ignoreWorkspace flag Every DUAL_LOCKFILE_ROOTS entry is now a true workspace boundary that installs without --ignore-workspace. The flag had exactly one live consumer (private/copilot-app) which this PR removes. Simplify the data structure to an array of path strings and drop the now-unused flag branches in the install command and regen hint. Also: the regen hint gains a pointer at the install:all shorthand, since that's the recommended path for a whole-monorepo resync. * docs: comprehensive command cheatsheet + check:structural aggregate The problem: every time a new shorthand is added (install:all, check:*) it lands in code but stays invisible in docs. People default to the raw cd-and-install form, which is how we drift. The cheatsheet is the fix for the drift-by-ignorance path. Changes: - Adds check:structural to root package.json - one command for the full structural guard set (boundaries + monorepo-traps + release-groups validate). Complements the existing pre-push hook which only runs check:monorepo-traps. - Rewrites AGENTS.md 'Command routing' section as 'Command cheatsheet' with a scenario-driven quick-lookup table at top, then grouped by intent: install/lockfiles, build+dev+lint+typecheck+test, structural guards, changesets+releases, mirror/Copybara, parity, database. - Documents the suffix convention (:agents, :agents-ui, :chat-to-edit, :inkeep-cloud-mcp, :copilot, :ext; no suffix = fan-out) so people can guess commands instead of memorizing. - Every command gets a one-line description of what it does and when to reach for it. * fix(check-monorepo-traps): guard the create-agents-template lockfile too Docs introduced in this PR call out three lockfiles (root, public/agents, public/agents/create-agents-template) and point at install:all as the shorthand that regenerates them. The check only validated two — the starter-kit lockfile could drift silently and slip past the pre-push hook, surfacing for end users later when they cloned the starter. Add public/agents/create-agents-template to DUAL_LOCKFILE_ROOTS and update the comment to reflect the actual install-boundary taxonomy (monorepo / Copybara+Vercel / standalone starter). install:all and the check now cover the same set. * ci: gate publish on check:structural (defense-in-depth) Required checks on the source PR already run check:structural, and both version-packages workflows check out origin/main before doing anything. In practice, publish always runs against a validated main state. But 'in practice' isn't the same as 'structurally'. A workflow_dispatch run against main, an admin bypass of branch protection, or a future change that loosens merge requirements could let a misconfigured main reach the publish step without re-validation. Today's agents-ui release already surfaced one post-publish pipefail bug that shouldn't have been possible if we trusted the pipeline - this gate is the same intuition applied upstream. Adds 'Validate structural invariants' step between Install and the release machinery in both private-agents-ui-version-packages.yml and public-agents-version-packages.yml. Runs pnpm check:structural, which aggregates check:boundaries + check:monorepo-traps + release-groups:validate (including the workspace-isolation guard introduced in #191). Fails hard on any structural misconfig, refusing to publish. Cost: ~30-60s per publish run. Cheaper than a bad release. GitOrigin-RevId: 684d52e5ab7734f592479b61e972cdfe5fc3ae23 * fix(ci): harden release cascade so copybara + npm publish run unattended (#194) * fix(ci): always reset copybara/sync on every mirror run Closes #188 Drop the "leave branch in place if open PR is < STALE_PR_HOURS" branch in the mirror sync reconcile step. Letting Copybara "append" to an existing copybara/sync was never safe: the Copybara config uses fetch=main, so every run baselines off inkeep/agents main's last GitOrigin-RevId. When a new push lands on agents-private main while a prior sync PR is still open, Copybara rebuilds the older origin change from main's HEAD (new SHA due to timestamps) and the non-force push to copybara/sync is rejected as non-fast-forward. This is the failure mode that blew up the release cascade in #188 (Version Packages #185 merged while #3166 was still open 9 minutes after being created). Every mirror run now closes any open sync PR and deletes copybara/sync before Copybara runs, so each run pushes a fresh history. The concurrency group already serializes runs and every new run includes all accumulated changes since the last imported revision, so no information is lost. PR churn (one inkeep/agents sync PR per agents-private main push) is the cost, and it is much cheaper than a stuck release cascade. CI_RUNBOOK gets a new entry for this specific failure string so future red runs route to the fix without a re-investigation. * fix(ci): harden release cascade against silent strandings Bundled on top of the copybara/sync reset in this PR so the whole release path (mirror sync -> npm publish -> GH Release -> Vercel prod deploy -> scheduler restart) can run end-to-end with no human intervention. Each fix closes a distinct silent-stranding mode. 1. public-mirror-sync.yml Create-PR guard - Reconcile now always deletes copybara/sync before Copybara runs, which introduced a regression: when Copybara exits 4 (no changes to sync, eg. workflow_dispatch with an idle main), the branch is gone and the next `gh pr create --head copybara/sync` would fail. Add an explicit branch-existence check; short-circuit cleanly. - Add explicit --state open to the gh pr list call. Defaults to open but being explicit prevents a future refactor from reintroducing the PR #184 bug class. - Replace the PR number extraction `grep -o '[0-9]*$'` on the PR URL with gh pr view --json number. gh's stdout format is not a contract. 2. private-agents-ui-version-packages.yml publish detection - Was parsing `Publishing "X" at "Y"` via grep/sed on the changesets log, which is the exact fragility PR #174 removed from public release.yml. If changesets v2 changes format, published=false is written despite a successful publish, the widget-release dispatch is skipped, and agents-docs changelog silently desyncs. - Use the stable "packages published successfully" presence marker and read the version from package.json (authoritative for a fixed release group). 3. public/agents/.github/workflows/release.yml catch-all + dispatch retry - `Notify agents-private (failure)` was gated on `steps.detect.outputs.has_changesets == 'false'`. If the workflow failed before the detect step ran (install, build, token gen), has_changesets is unset and the condition evaluated false -> no dispatch, no tracking issue on agents-private, red run sitting invisibly in the Actions tab. Drop the has_changesets gate. - Replace peter-evans/repository-dispatch with a bash retry loop (3 attempts, 5/10s backoff). The action has no built-in retry, so a transient 5xx or rate-limit during the post-publish dispatch loses the signal permanently: npm publishes, but no GH Release is created and no Vercel prod deploy fires. Retry + explicit error on exhausted attempts so the stranding is loud, not silent. 4. public-agents-vercel-production.yml concurrency + failure tracker - Add concurrency: vercel-production-deploy. DB migrations are not idempotent; two parallel deploys (eg. a release published while a manual re-dispatch is in flight) would race on migrate-databases and leave schema in a half-applied state. - Add notify-on-failure job (mirrors the tracking-issue pattern from public-mirror-sync.yml). At this point npm has published, the GH Release exists, but prod runtime is stale. Needs to be loud: auto-open a "Vercel production deploy failing" issue so the half-shipped state is visible instead of buried in the Actions tab. CI_RUNBOOK.md: reword the release/publish failure entries to match the new retry/tracking behavior, and add a new entry covering the post-publish deploy failure case. Intentionally out of scope: the auto-format.yml + Dependabot `pnpm install --frozen-lockfile` race. Not a release-cascade issue, will go in a separate PR. * docs(runbook): bold Historical marker for consistency GitOrigin-RevId: 04ff8b544833e109b57f75ded3236730d7fb10eb --------- Co-authored-by: Varun Varahabhotla <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
…e + preview paths (inkeep#198) (inkeep#3172) * fix(ci): close remaining silent strandings in release cascade + bridge + preview paths Bundle of 10 code + 1 doc changes, each closing a distinct silent-failure mode where CI would pass green (or sit red invisibly) while production work stranded. Complement to inkeep#194's release-cascade hardening — each fix here addresses a gap that pass missed. Release cascade (5) - release-handler.yml: pin GitHub Release to client_payload.commit_sha, not origin/main. Auto-format landing between sync and handler used to create a SHA drift where Vercel deployed a newer commit than npm got. - release-handler.yml: 3-attempt bash retry around gh release view/edit/ create. Mirrors the dispatch-side retry from inkeep#174 — one transient 5xx used to leave npm published with no Release (so no Vercel deploy) and no alert. - release-handler.yml: notify-on-failure job. If the success job itself errors (app-token, checkout, or handler logic), the dispatch was already delivered and public-side retries don't re-fire. Without this notifier, npm publishes but the cascade stops silently. - public-mirror-sync.yml: 3-attempt retry on sync PR approval. One dropped GH API call used to leave the sync PR open with no approval; now exhaustion exits non-zero so the workflow turns red and the janitor's sync-PR sweep (below) catches long-open PRs as a safety net. - scripts/check-monorepo-traps.mjs: flip check:override-masks-bump from warn to hard-fail. Current tree is clean of masks, so flipping is safe and prevents future PR inkeep#170-class silent divergence. Preview envs (2) - deploy-vercel-preview.sh: validate captured deployment URL matches *.vercel.app pattern before alias-set. The log-grep fallback could silently capture a docs/inspection URL and alias api.preview.inkeep.com to the wrong place. - public-agents-preview-environments.yml: add teardown-failure-notify job. Railway and Vercel teardowns run in parallel by design; this closes the observability gap when either silently fails — Railway continues billing, Vercel env vars pollute, and the next 6h janitor sweep re-attempts cleanup. Dependency sync + bridge (2) - dependabot-sync-root-lockfile.yml: on push failure, full-reset onto origin/$HEAD_REF and re-regenerate both lockfiles (root + public/agents) from the new baseline. Previous simple rebase would replay our stale lockfile on top of a different package.json after a Dependabot force-push, yielding inconsistent committed state that either broke --frozen-lockfile or silently shipped drift into main. - bridge-public-pr-to-monorepo.mjs (both copies): truncate bridged PR body to fit GitHub's 65,536-char limit, with a link to the original PR. Was failing 422 "body is too long" on Dependabot mega-bumps, stranding those PRs outside the agents-private review surface. Janitor (1) - public-agents-preview-janitor.yml: new sweep-stale-sync-prs job. Every 6h, identify open copybara/sync PRs on inkeep/agents >4h old, open (idempotent) tracking issue on agents-private, re-dispatch public-mirror-sync.yml. Does NOT auto-close the PR — destructive actions stay human-driven. Docs (1) - CI_RUNBOOK.md: partial-npm-publish recovery path. Documents what to do when pnpm changeset publish fails on package N of 10 (some live, some not): confirm via npm view, re-run the failed job (publish is idempotent), verify all 10 present before unblocking cascade. Each change verified in isolation with pnpm check:structural; YAML validated with safe_load; bridge truncation logic tested at boundary. * fix(ci): add checkout step and remove heredoc indentation in release-handler notify Addresses review feedback on PR inkeep#198: - notify-success-failure now does a Checkout before gh calls, matching the failure job pattern (without it, gh CLI lacks git context and issue-create would silently fail with 'fatal: not a git repository'). - Body is now built with echo into a temp file instead of a heredoc. The original 10-space indent inside run: | would have been preserved literally by bash, making the tracking issue body unreadable. * fix(ci): address review feedback on release-handler retry + docs drift Review feedback from inkeep#198 (Claude + Pullfrog): release-handler.yml: - Capture stderr in gh_with_retry so on-call sees the actual API error (rate-limit, 5xx, auth) instead of a generic 'command failed' line. Was making incident triage 15-30m slower. (Claude Major) - Stop wrapping 'gh release view' in gh_with_retry. It returns non-zero for the legitimate 'release doesn't exist' case (the normal new-release happy path); retrying burned 15s of backoff before falling through to create. Only mutating calls (edit/create) need retry. (Pullfrog + Claude) - notify-success-failure now fails loudly (exit 1) if gh issue create fails, matching the failure job's pattern. Silent swallow would mean the success handler failed AND the notifier failed with no visible signal beyond a red check — exactly the class this PR closes elsewhere. (Claude Major) Docs drift (Pullfrog): - AGENTS.md: remove '(soft today)' / 'non-blocking today' annotations from override-masks-bump in both the cheatsheet and the trap-list. - AGENTS.md trap inkeep#4: flip 'Currently a warning, not a hard fail' to 'Hard-fail (flipped...)' to match the code. - CI_RUNBOOK.md: rename the override-masks-bump section header from 'warns (non-blocking today)' to 'fails — root override masks a workspace bump'. Preview janitor (Claude Consider): - Document that sweep-stale-sync-prs runs unconditionally regardless of PREVIEW_ENVIRONMENTS_ENABLED. Pr body said so; workflow didn't. Bridge script (Claude Consider): - console.log when PR body is truncated, with original + new sizes. Helps CI log debugging when a bridged PR looks shorter than expected. Applied to both copies (public/agents + public/agents-optional-local-dev). Intentionally not addressed: - dependabot-sync $MSG staleness after reset: Pullfrog flagged as intentional (message describes the operation, not the specific baseline), Claude flagged as cosmetic minor. Siding with Pullfrog. - Preview teardown notify swallowed error: lower severity given the janitor re-attempts cleanup every 6h; accepting current pattern. * fix(ci): pre-fetch PR base into agents-private before git apply --3way Closes the bridge-failure class seen on inkeep#3171 (and eventually on any bridged PR with a conflicting hunk). Root cause: `git apply --index --3way` resolves the patch's `index <old>..<new>` blob SHAs against the local repo's object store. The SHAs come from inkeep/agents' object graph; agents-private has never fetched from inkeep/agents, so those SHAs aren't resolvable. Clean patches work (no --3way fallback needed); any conflict triggers --3way, which fails with 'repository lacks the necessary blob to perform 3-way merge' — bridge stops with no useful diagnostic, PR strands outside the canonical review surface. Fix: Before calling git apply, shallow-fetch the PR's base commit from the public repo into agents-private's object store. GitHub permits fetching by SHA when the SHA is reachable from a ref (PR base is on main, always reachable), and shallow=1 keeps it cheap. After fetch, --3way can find the `<old>` blob and resolve conflicts into merge markers instead of a hard error. Why not drop --3way: Dropping --3way makes every conflicting PR fail hard with no recovery path. Pre-fetching preserves the --3way fallback so conflicting hunks produce visible conflict markers in the bridged PR, which is reviewable rather than opaque. Failure handling: If the fetch itself fails (rate-limit, network flake), we log and proceed. The subsequent git apply will hit the same blob error as before the fix — no regression, just no improvement for that run. Applied to both script copies: public/agents and public/agents-optional-local-dev. Bundled into PR inkeep#198 since it's the same release/sync hardening scope. * Revert "fix(ci): pre-fetch PR base into agents-private before git apply --3way" This reverts commit e7e0640e2345bbd7cdea1185eb3abc666f2c60fa. * fix(ci): switch bridge patch fetch from .patch to .diff media type Root cause of bridge failures on multi-commit PRs (eg inkeep#3171, 9 commits): `gh api .../pulls/N Accept: application/vnd.github.patch` returns a mailbox-format patch series — one patch per PR commit, each with `index <old>..<new>` lines referencing INTERMEDIATE blob SHAs created during the PR's history on inkeep/agents. Those intermediates exist only in inkeep/agents' object store; agents-private has never seen them. The first patch applies against agents-private's file (blob content matches the PR base even if SHAs differ), git's index advances to the intermediate state (only referenced by SHA, never fetched as an object). Any subsequent conflicting hunk triggers `git apply --3way` fallback, which needs the intermediate blob → 'repository lacks the necessary blob' → stranded PR. Single-commit PRs don't hit this (no intermediates). That's why the bridge used to work for most Dependabot bumps but started failing on longer branches like inkeep#3171. Fix: switch `Accept: application/vnd.github.patch` → `Accept: application/vnd.github.diff` in both bridge-public-pr-to- monorepo.mjs copies. .diff returns a single consolidated base-vs-head unified diff with only the PR's base blob SHAs referenced. Those are content-identical to agents-private's copies (Copybara mirrors blobs 1:1 without content rewrites for .ts files), so --3way can resolve them locally from agents-private's own object store. Also updated githubRequest's text-response check to include .diff alongside .patch (both are text, not JSON). Previous pre-fetch commit (`e7e0640e2`) was reverted in a prior commit — it was addressing the wrong layer (only PR-base blob, not intermediates). This commit supersedes that approach entirely. * chore(ci): biome format fix on bridge script console.log GitOrigin-RevId: a26d5dac74ed9938fc512703af21a25d82616504 Co-authored-by: Varun Varahabhotla <[email protected]>
No description provided.