ci: add preview janitor and recreate control by vnv-varun · Pull Request #2930 · inkeep/agents

vnv-varun · 2026-03-31T14:58:26Z

Summary

Harden preview environment lifecycle management so stale preview state is repaired automatically, individual PR previews can be force-recreated when they drift, and preview auth bootstrap fails faster and more transparently when SpiceDB is not actually ready yet.

Changes

add a scheduled/manual Preview Janitor workflow that cleans stale Railway pr-* environments and can also run stale Vercel preview env-var cleanup
add a Railway janitor script that enumerates pr-* environments, maps them back to GitHub PR state, deletes merged/closed PR envs, and verifies deletion
add a manual recreate_preview workflow input so a PR can force teardown and reprovision of its Railway preview environment when reused preview state has drifted
harden recreate_preview against Railway's delete/create name-reuse race by retrying create when Railway still reports the just-deleted env name as existing
harden teardown and cleanup helpers for missing/disappearing environments, repo-scoped GitHub queries, deletion failure accounting, and clearer exit behavior
add structured preview phase logging and wait-loop logging across provisioning, bootstrap, and teardown so future timeouts are diagnosable from CI logs
gate preview auth bootstrap on the Railway spicedb service deployment being ready, not just env existence / proxy creation
keep the existing SpiceDB schema probe as the final functional readiness check, but bound the retry budget so failures surface in minutes rather than dragging toward the full job timeout

Why

We were still seeing three classes of preview-state problems:

Reused pr-* environments could drift and require manual cleanup or full reprovision.
Close-event teardown was still event-driven, so if it failed once, stale preview environments remained indefinitely.
Bootstrap Preview Auth could still race SpiceDB startup even after Railway provisioning had succeeded, leading to UNAVAILABLE, ECONNRESET, and Failed to connect failures during the schema probe.

This PR addresses all three.

recreate_preview gives a first-class recovery path for one PR.
the janitor gives a system-wide repair path for missed teardown events.
the new SpiceDB deployment gate fixes the remaining readiness race at the right layer instead of only piling on more client retries.

It also closes one Railway-specific edge case in the recreate flow itself: after deleting a preview environment, Railway can briefly reject a fresh create with An environment with that name already exists even while the environment lookup returns absent. The recreate path now treats that as a bounded retry case instead of failing the run.

Confirmed Stale Railway Envs

Before cleanup, the Railway project still had these stale preview environments for merged PRs:

pr-2901
pr-2902
pr-2906
pr-2907
pr-2908
pr-2909

I confirmed each of those PRs is closed/merged in GitHub.

After the live cleanup, a follow-up janitor dry-run reported 19 Railway preview environments, all tied to open PRs, with 0 stale environments left to delete.

What Caused Those Misses

Those stale environments all came from a transition window before the earlier preview hardening had landed on main:

#2900 merged at 2026-03-30T21:31:52Z
#2921 merged at 2026-03-31T02:14:47Z

All six stale PRs above were merged before those fixes were on main. Their close-event Preview Environments workflow did fire, but the run failed in the old Teardown Tier 1 (Railway) path immediately after merge, which left the Railway env behind.

This PR accounts for that class of miss in two ways:

a hardier teardown path with better logs
a scheduled/manual janitor that repairs missed teardown events after the fact

SpiceDB Readiness Fix

The remaining preview flake we were still seeing was not Railway provisioning itself. Recent failures on other PRs showed this pattern:

Railway env creation passed
DB migrations passed
then wait-for-spicedb.ts timed out with repeated UNAVAILABLE, ECONNRESET, or Failed to connect

The problem was that env existence and TCP proxy readiness were not strong enough readiness signals. SpiceDB could still be deploying even though the proxy existed.

This PR now:

waits for the Railway spicedb service instance to report a ready deployment state first
then runs the existing schema-readiness probe with a bounded preview-only retry budget
reduces the chance of long silent waits by flattening the retry model and reducing the bootstrap timeout window

What We Verified

bash -n passes for the changed preview shell scripts
workflow YAML parses successfully for both preview workflows
git diff --check passes
the new Railway janitor script ran successfully against the live agents Railway project in dry-run mode before cleanup
the same Railway janitor script was then run live and deleted stale merged-PR environments that had been left behind
the new Railway SpiceDB deployment gate was exercised directly against a live preview env and returned SUCCESS
the Railway recreate name-reuse race was reproduced live against a disposable pr-* env and the patched recreate logic recovered successfully
the latest full PR run passed end to end, including:
- Provision Tier 1 (Railway)
- Bootstrap Preview Auth
- Inject Vercel Preview Env Vars
- Smoke Test Preview URLs
the latest manual workflow_dispatch run with recreate_preview=true also passed end to end on the updated head:
- run 23861566947

Test Plan

Notes

pnpm check is still failing in this workspace because @inkeep/agents-api test boot requires INKEEP_AGENTS_MANAGE_DATABASE_URL and INKEEP_AGENTS_RUN_DATABASE_URL to be set. That failure is repo/test-environment baseline and unrelated to the preview-shell/workflow changes in this PR.
Scheduled janitor runs intentionally perform real cleanup. Manual dispatch defaults to dry-run.

vercel · 2026-03-31T14:59:02Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
agents-api	Ready	Preview, Comment	Apr 1, 2026 7:12pm
agents-docs	Ready	Preview, Comment	Apr 1, 2026 7:12pm
agents-manage-ui	Ready	Preview, Comment	Apr 1, 2026 7:12pm

changeset-bot · 2026-03-31T15:05:12Z

⚠️ No Changeset found

Latest commit: 3654e2d

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

pullfrog · 2026-03-31T15:09:44Z

TL;DR — Adds a scheduled/manual "Preview Janitor" workflow that deletes stale Railway and Vercel preview environments left behind by missed close events, adds a recreate_preview dispatch input so a single PR can force a full teardown-and-reprovision, and hardens the entire preview lifecycle with SpiceDB deployment gating, transient-error retries, GraphQL error checking, environment-create retry logic, and structured logging.

Key changes

Add cleanup-stale-railway-envs.sh janitor script — enumerates all pr-* Railway environments, maps each to GitHub PR state, and deletes environments for closed/merged PRs with post-delete verification and per-deletion failure tracking.
Add preview-janitor.yml workflow — runs every 6 hours on a cron schedule (and manually) to invoke the Railway janitor and the Vercel stale-env cleanup, with dry-run default for manual dispatch.
Add recreate_preview workflow dispatch input — allows force-deleting and reprovisioning a PR's Railway environment when reused state has drifted.
Gate preview auth on SpiceDB deployment readiness — before probing gRPC, bootstrap-preview-auth.sh now polls Railway's deployment API until the SpiceDB service reaches SUCCESS or SLEEPING, preventing probes against a service that hasn't finished deploying.
Retry transient SpiceDB transport failures during bootstrap — wraps db:auth:init with run_with_transient_spicedb_retry, matching a curated pattern of transport error strings (UNAVAILABLE, ECONNRESET, ETIMEDOUT, etc.) with exponential backoff and jitter.
Add create_preview_environment retry wrapper — retries Railway environment creation up to 6 attempts with backoff, handling the race where Railway reports "already exists" after a delete by waiting for absence before retrying.
Add railway_require_graphql_success to mutation wrappers — railway_environment_create_from_source, railway_environment_delete_by_id, and railway_variable_collection_upsert now validate GraphQL responses and propagate failures instead of silently succeeding.
Expand common.sh shared helper library — adds ~15 new functions including railway_service_instance_json, railway_wait_for_service_deployment_ready, railway_ensure_tcp_proxy, vercel_list_preview_only_env_vars, redact_preview_logs, sleep_with_backoff_and_jitter, runtime_var_is_unresolved, and preview_should_log_wait_attempt.
Add preview_log structured logging — introduces a shared preview_log helper with UTC timestamps and adds phase-level log lines across provisioning, bootstrap, teardown, and wait loops.
Harden teardown for idempotency — teardown-railway.sh now checks environment existence before attempting teardown, handles mid-teardown disappearance gracefully, and verifies deletion with railway_wait_for_environment_absent.
Clarify workflow job names for skipped/conditional jobs — renames Preview Flow Disabled, Teardown Tier 1, and Teardown Vercel Preview Env Vars job names to indicate when each job actually runs (e.g. "PR close only", "runs only when previews are off").

_{Summary ｜ 8 files ｜ 7 commits ｜ base: main ← varun/preview-state-hardening}

Scheduled janitor for stale preview environments

Before: If a PR close event's teardown failed, the Railway pr-* environment was left behind indefinitely — cleanup was purely event-driven with no fallback.
After: A new cleanup-stale-railway-envs.sh script runs on a 6-hour cron (and manually) via preview-janitor.yml, cross-referencing every pr-* Railway environment against GitHub PR state and deleting environments for closed/merged PRs.

The janitor supports dry-run mode (default for manual dispatch), an opt-in flag to delete orphaned environments whose PR returns 404 (DELETE_UNKNOWN_PREVIEW_ENVS), and writes a summary table to GITHUB_STEP_SUMMARY with counts of kept, deleted, unknown, and errored environments. Individual deletion failures are tracked separately so a single failed delete does not abort the entire sweep. A companion Vercel job in the same workflow cleans branch-scoped preview env vars for stale branches.

How does the janitor decide what to delete?

It lists all Railway environments matching ^pr-[0-9]+$, extracts the PR number, queries the GitHub API for PR state, and categorizes each as open (keep), closed (delete), or missing (skip unless DELETE_UNKNOWN_PREVIEW_ENVS=true). Deletions are verified with railway_wait_for_environment_absent.

cleanup-stale-railway-envs.sh · cleanup-vercel-stale-env.sh · preview-janitor.yml

Manual `recreate_preview` for drifted PR environments

Before: A PR with a drifted Railway environment required manual Railway console intervention to reprovision.
After: A recreate_preview boolean input on the workflow_dispatch trigger deletes the existing Railway environment and waits for it to be absent before creating a fresh one from the template.

The flag flows from the workflow input through compute-context outputs to provision-railway.sh via RECREATE_PREVIEW_ENV.

provision-railway.sh · preview-environments.yml

SpiceDB deployment readiness gate

Before: bootstrap-preview-auth.sh immediately probed SpiceDB gRPC readiness after Railway provisioning, sometimes hitting a service that hadn't finished deploying yet.
After: The script first polls Railway's deployment API via railway_wait_for_service_deployment_ready until the SpiceDB service reaches SUCCESS or SLEEPING, then proceeds to the gRPC readiness probe.

The gate is conditional — it only runs when RAILWAY_API_TOKEN, RAILWAY_PROJECT_ID, RAILWAY_SPICEDB_SERVICE, and PR_NUMBER are all set. Terminal deployment states (FAILED, CRASHED, REMOVED, SKIPPED) return failure immediately instead of polling until timeout.

bootstrap-preview-auth.sh · common.sh

Retry-resilient environment creation and transient error handling

Before: A single SpiceDB transport failure during db:auth:init failed the preview run; Railway environment creation was a single-shot call that could fail if a prior delete hadn't fully propagated.
After: create_preview_environment retries up to 6 attempts with exponential backoff, detecting "already exists" race conditions after delete and waiting for absence before retrying. run_with_transient_spicedb_retry retries db:auth:init for transient transport errors. railway_require_graphql_success validates every mutation response.

Function	Purpose
`create_preview_environment`	Retries Railway environment creation with "already exists" race handling
`run_with_transient_spicedb_retry`	Retries commands that fail with transient SpiceDB/gRPC errors
`railway_require_graphql_success`	Asserts GraphQL response has no errors, returns non-zero on failure
`sleep_with_backoff_and_jitter`	Exponential backoff with random jitter via Python3

provision-railway.sh · bootstrap-preview-auth.sh · common.sh

Structured logging and teardown hardening

Before: Preview scripts used bare echo with no timestamps; teardown assumed the environment existed and failed on race conditions.
After: A shared preview_log function emits [preview][UTC timestamp] lines to stderr, and teardown-railway.sh checks existence before teardown, handles mid-teardown disappearance, and verifies deletion.

Wait loops now log on attempts 1, every 5th attempt, and the final attempt via preview_should_log_wait_attempt. The resolve_runtime_vars polling loop reports which variables remain unresolved via runtime_var_is_unresolved. A new redact_preview_logs filter strips PostgreSQL connection strings, secrets, auth cookies, and bearer tokens from log output.

common.sh · teardown-railway.sh · provision-railway.sh

Clearer workflow job names for conditional steps

Before: Workflow job names like Teardown Tier 1 (Railway) didn't indicate their run conditions, making it unclear why they were skipped in logs.
After: Job names now include their trigger conditions — e.g. Teardown Tier 1 (Railway, PR close only) and Preview Disabled Notice (runs only when previews are off).

preview-environments.yml

^{｜ View workflow run ｜ Triggered by Pullfrog ｜ Using Claude Opus ｜ 𝕏}

pullfrog

Low urgency. Solid operational hardening — the janitor, recreate path, SpiceDB retry, and structured logging all address real gaps. Two minor items below; neither is blocking.

^{｜ View workflow run ｜ Using Claude Opus ｜ 𝕏}

pullfrog · 2026-03-31T15:10:17Z

+runtime_var_is_unresolved() {
+  local value="${1:-}"
+
+  [ -z "${value}" ] || printf '%s' "${value}" | grep -q '\$[{][{]'
+}


Nit (readability): runtime_var_is_unresolved is defined here on line 238 but first referenced inside the resolve_runtime_vars function body starting at line 187. This works because bash evaluates function bodies lazily — resolve_runtime_vars isn't called until line 245 — but the ordering is surprising and fragile. Moving the definition above resolve_runtime_vars() (or into common.sh alongside the other helpers) would make the dependency explicit and prevent a future refactor from breaking call order.

Not blocking — the current code is correct at runtime.

pullfrog · 2026-03-31T15:10:17Z

+          RAILWAY_PROJECT_ID: ${{ vars.RAILWAY_PROJECT_ID }}
+          GH_TOKEN: ${{ github.token }}
+          GITHUB_REPOSITORY: ${{ github.repository }}
+          DRY_RUN: ${{ github.event_name == 'workflow_dispatch' && inputs.dry_run && 'true' || 'false' }}


Confirm intent: On schedule triggers, inputs.dry_run is undefined so this expression evaluates to 'false', meaning scheduled runs execute real deletes. The PR description indicates this is intentional (the janitor is meant to repair missed teardowns automatically). Just flagging for visibility — if you ever want scheduled runs to default to dry-run, this expression would need adjustment.

Copilot

Pull request overview

This PR hardens the CI automation around ephemeral preview environments by adding a scheduled/manual “janitor” cleanup path and a manual “recreate preview” path, plus improved logging and retries to make long-running preview steps more diagnosable and resilient.

Changes:

Add a new scheduled/manual Preview Janitor workflow to delete stale Railway pr-* environments and stale Vercel branch-scoped env vars.
Add a manual recreate_preview input to force teardown + reprovision of a PR’s Railway preview environment.
Improve preview script resilience and diagnosability via structured phase logging, safer teardown, and transient SpiceDB transport retries.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`.github/workflows/preview-janitor.yml`	New scheduled/manual janitor workflow for Railway + Vercel stale preview cleanup.
`.github/workflows/preview-environments.yml`	Adds `recreate_preview` input plumbed into Railway provisioning.
`.github/scripts/preview/common.sh`	Adds `preview_log`, wait-loop logging helpers, safer JSON edge handling, and env-deletion wait helper.
`.github/scripts/preview/provision-railway.sh`	Adds “recreate existing env” path and more structured provisioning logs.
`.github/scripts/preview/teardown-railway.sh`	Hardens teardown for missing/disappearing envs and uses shared deletion-wait helper.
`.github/scripts/preview/bootstrap-preview-auth.sh`	Retries transient SpiceDB transport failures and adds more phase logging.
`.github/scripts/preview/cleanup-vercel-stale-env.sh`	Ensures PR lookup is repo-scoped when `GITHUB_REPOSITORY` is set.
`.github/scripts/preview/cleanup-stale-railway-envs.sh`	New Railway janitor script to enumerate `pr-*` envs, map to PR state, and delete stale envs with verification.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-31T15:11:24Z

+  status="$(
+    curl -sS \
+      --connect-timeout 10 \
+      --max-time 30 \
+      -o "${response_file}" \
+      -w '%{http_code}' \
+      -H "Authorization: Bearer ${GH_TOKEN}" \
+      -H "Accept: application/vnd.github+json" \
+      "https://api.github.com/repos/${GITHUB_REPOSITORY}/pulls/${pr_number}"
+  )"


The GitHub PR lookup is hard-coded to https://api.github.com/.... GitHub Actions exposes GITHUB_API_URL (and this also matters for GHES); consider building the endpoint from ${GITHUB_API_URL} (fallbacking to api.github.com) to avoid breaking the janitor in non-public-GitHub contexts.

claude

PR Review Summary

(3) Total Issues | Risk: Medium

🟠⚠️ Major (2) 🟠⚠️

Inline Comments:

🟠 Major: cleanup-stale-railway-envs.sh:59 Deletion errors silently ignored — return status not checked
🟠 Major: cleanup-stale-railway-envs.sh:100-105 Deletion failures not tracked in exit code — janitor may exit 0 despite failed cleanups

🟡 Minor (1) 🟡

Inline Comments:

🟡 Minor: preview-environments.yml:389-390 Unused env vars passed to teardown job

💭 Consider (2) 💭

💭 1) provision-railway.sh:238 Function definition ordering

Issue: runtime_var_is_unresolved() is defined after resolve_runtime_vars() which references it.
Why: While this works correctly in bash (functions are resolved at call time, not definition time), the code structure is confusing. If someone reorders the script, it could break.
Fix: Move runtime_var_is_unresolved() definition to before line 167 for clearer declaration-before-use ordering.

💭 2) preview-janitor.yml:56 Scheduled runs delete without dry-run

Issue: The DRY_RUN expression evaluates to false for scheduled runs, meaning the cron job will perform actual deletions.
Why: This differs from the manual dispatch default (dry_run: true). While this appears intentional for automated cleanup, it could surprise operators.
Fix: Add a comment documenting this intentional behavior, or change to default dry-run for scheduled runs if that's preferred.

💡 APPROVE WITH SUGGESTIONS

Summary: This is a solid infrastructure hardening PR that closes real gaps in preview environment lifecycle management. The janitor workflow and recreate path are well-designed. The two Major findings around error tracking in the janitor script should be addressed to ensure operators can trust the exit code — without this, failed cleanups could silently accumulate. The other items are minor polish.

Discarded (3)

Location	Issue	Reason Discarded
`provision-railway.sh:238`	Function `runtime_var_is_unresolved` defined after use causes runtime error	False positive — bash resolves function references at call time, not definition time. The function IS defined before `resolve_runtime_vars` is called. Reclassified as Consider (code style).
`common.sh:27`	String equality instead of numeric comparison	Works correctly for numeric strings; purely stylistic preference
`bootstrap-preview-auth.sh:42`	Retry pattern may miss some transient errors	Low risk — current pattern covers common failure modes; adding more patterns is speculative

Reviewers (3)

Reviewer	Returned	Consider	Inline Comments	Discarded
`pr-review-errors`	5	0	2	3
`pr-review-devops`	8	2	1	5
`pr-review-standards`	2	0	0	2
Total	15	2	3	10

Note: Standards reviewer findings were duplicates of devops findings (function order, dry-run behavior).

claude · 2026-03-31T15:11:55Z

+  preview_log "Deleting stale Railway preview environment ${env_name}."
+  railway_environment_delete_by_id "${env_id}" >/dev/null
+  railway_wait_for_environment_absent "${RAILWAY_PROJECT_ID}" "${env_name}" 10 2
+}


🟠 MAJOR: Deletion errors are silently ignored

Issue: railway_environment_delete_by_id return status is not checked. If deletion fails, the function proceeds to railway_wait_for_environment_absent which will eventually fail with a generic timeout message.

Why: When debugging deletion failures, operators see "Railway environment still exists after waiting for deletion" but won't know why (permission denied, API error, rate limit). The actual Railway API error is discarded.

Fix: Check the return status before proceeding:

delete_env_and_verify() { local env_id="$1" local env_name="$2" preview_log "Deleting stale Railway preview environment ${env_name}." if ! railway_environment_delete_by_id "${env_id}"; then echo "Failed to delete Railway environment ${env_name}." >&2 return 1 fi railway_wait_for_environment_absent "${RAILWAY_PROJECT_ID}" "${env_name}" 10 2 }

Refs:

teardown-railway.sh:27 — similar pattern with explicit error check

claude · 2026-03-31T15:12:03Z

+        preview_log "[dry-run] Would delete stale Railway preview environment ${env_name}."
+      else
+        delete_env_and_verify "${env_id}" "${env_name}"
+        deleted=$((deleted + 1))
+        deleted_names+=("${env_name}")
+      fi


🟠 MAJOR: Deletion failures not tracked in exit code

Issue: When delete_env_and_verify fails during the janitor loop, the error is not counted. The final exit code (line 153) only reflects GitHub API lookup errors (errors variable), not Railway deletion failures.

Why: The janitor could fail to delete multiple Railway environments due to API issues, but exit with code 0 if all GitHub PR lookups succeeded. Operators checking CI may incorrectly believe all targeted environments were cleaned up.

Fix: Add a deletion_failures counter:

# Add at line 77 with other counters: deletion_failures=0 # In the delete blocks (lines 100-105 and 113-118): if ! delete_env_and_verify "${env_id}" "${env_name}"; then deletion_failures=$((deletion_failures + 1)) else deleted=$((deleted + 1)) deleted_names+=("${env_name}") fi # Update exit check at line 153: if [ "${errors}" -gt 0 ] || [ "${deletion_failures}" -gt 0 ]; then echo "Encountered ${errors} GitHub lookup error(s) and ${deletion_failures} Railway deletion failure(s)." >&2 exit 1 fi

claude

PR Review Summary

(0) Total Issues | Risk: Low

✅ Prior Feedback Addressed

All issues from the previous review have been resolved in commit 8973aad63:

Prior Issue	Resolution
🟠 Major: Deletion errors silently ignored	✅ Fixed — `delete_env_and_verify` now checks return status and returns 1 on failure (L58-61)
🟠 Major: Deletion failures not tracked in exit code	✅ Fixed — Added `deletion_failures` counter, tracked in loop, checked at exit (L82, L107-112, L163-165)
🟡 Minor: Unused env vars passed to teardown	✅ Fixed — `RAILWAY_TEMPLATE_ENVIRONMENT` and `RAILWAY_OUTPUT_SERVICE` removed from teardown job
💭 Consider: Function definition ordering	✅ Fixed — `runtime_var_is_unresolved()` moved to L121-125, before `resolve_runtime_vars()` which uses it
💭 Consider: Scheduled runs delete without dry-run (undocumented)	✅ Fixed — Comment added at L57 documenting intentional behavior
🔵 Copilot: Hard-coded GitHub API URL	✅ Fixed — `GITHUB_API_URL` now configurable with fallback (L16, L32) and passed from workflow (L55)

Additional Improvements in Delta

The delta also includes GraphQL error handling improvements in common.sh:

railway_environment_create_from_source, railway_environment_delete_by_id, and railway_variable_collection_upsert now properly check for GraphQL errors and return failure status
This hardens the entire Railway mutation surface against silent GraphQL-level failures

✅ APPROVE

Summary: All prior review feedback has been addressed comprehensively. The janitor now properly tracks and reports deletion failures, function ordering is cleaner, and the behavior of scheduled vs manual runs is documented. The additional GraphQL error handling improvements further strengthen the reliability of the preview infrastructure. Ship it! 🚀

Reviewers (0)

Reviewer	Returned	Main Findings	Consider	While You're Here	Inline Comments	Pending Recs	Discarded
Total	0	0	0	0	0	0	0

Note: Delta review — no new reviewer dispatches needed. All prior issues resolved.

claude

PR Review Summary

(0) Total Issues | Risk: Low

This is a delta review scoped to the 2 commits since the last automated review (3b0164a67, 64b34fd64).

✅ Delta Changes Look Good

The delta adds a Railway SpiceDB deployment readiness gate before probing SpiceDB gRPC readiness. Key observations:

Change	Assessment
`railway_wait_for_service_deployment_ready()`	Well-implemented with proper terminal state handling (FAILED, CRASHED, REMOVED, SKIPPED)
Configurable curl timeouts in `railway_graphql()`	Good improvement for polling queries
Timeout reduction (20→10 min)	Tight but adequate (~6.5 min worst case calculated)
Removal of retry wrapper from SpiceDB readiness probe	Correct — now redundant with deployment gate

💭 Consider (1) 💭

Inline Comments:

💭 Consider: bootstrap-preview-auth.sh:112 Add logging when Railway deployment gate is bypassed

🕐 Pending Recommendations (0)

All issues from the previous review were addressed. No pending items.

✅ APPROVE

Summary: The delta changes are solid infrastructure hardening. The new deployment readiness gate properly sequences the SpiceDB gRPC probe after Railway reports the deployment is ready, which addresses the root cause of transient bootstrap failures. Terminal deployment states are handled correctly with descriptive error messages. Ship it! 🚀

Reviewers (2)

Reviewer	Returned	Consider	Inline Comments	Discarded
`pr-review-devops`	3	0	0	3
`pr-review-errors`	3	1	1	1
Total	6	1	1	4

Note: Delta review — reviewer output was lighter given limited scope. Devops findings were informational (timeout math, cosmetic logging). Error reviewer's single-attempt query concern was duplicate of devops finding. Unhandled deployment status finding was overly defensive (current behavior is reasonable).

Discarded (4)

Location	Issue	Reason Discarded
`preview-environments.yml:230`	Timeout reduced from 20 to 10 minutes	Valid observation but math works out (~6.5 min worst case fits in 10 min budget). Intentional budget tightening.
`common.sh:417`	Single-attempt GraphQL query in polling loop	Outer polling loop provides resilience; single-attempt per poll iteration is intentional for efficiency
`common.sh:478`	Unhandled deployment statuses continue polling	This is correct behavior for in-progress states (BUILDING, DEPLOYING, etc.). Current fallthrough is intentional.
`bootstrap-preview-auth.sh:129`	Step summary logs gate even when skipped	Cosmetic; shows intended service name regardless of gate execution

claude · 2026-04-01T16:34:46Z

+    15 \
+    4
+  echo "::endgroup::"
+fi


💭 Consider: Add logging when Railway deployment gate is bypassed

Issue: When Railway env vars (RAILWAY_API_TOKEN, RAILWAY_PROJECT_ID, RAILWAY_SPICEDB_SERVICE, PR_NUMBER) are not all set, the deployment gate block (lines 99-112) is silently skipped with no log output.

Why: If the SpiceDB readiness probe fails downstream, operators reviewing CI logs won't see why the deployment gate was bypassed, making it harder to diagnose whether the issue is missing config vs. actual deployment problems.

Fix: Add an else branch to clarify when the gate is intentionally skipped:

else preview_log "Skipping Railway deployment gate (Railway env vars not fully configured)." fi

This is optional since the gate is designed to be conditional, but explicit logging aids future debugging.

claude

PR Review Summary

(0) Total Issues | Risk: Low

This is a delta review scoped to the 1 commit since the last automated review (99802da8f).

✅ Delta Changes Look Good

The delta adds a create_preview_environment() retry wrapper that hardens Railway environment creation against API race conditions after delete. Key observations:

Change	Assessment
`create_preview_environment()` retry wrapper	Well-implemented with up to 6 attempts and exponential backoff
"Already exists" race condition handling	Correctly checks if env actually exists after create failure, handles gracefully
Error capture via temp file	Proper cleanup in all code paths, clear error messages to stderr
`railway_wait_for_environment_absent` with `\|\| true`	Intentional — wait is best-effort before retry, not a hard requirement
Variable rename `status` → `http_status`	Cosmetic clarity improvement, no behavior change

Implementation Quality

The new function demonstrates solid defensive programming:

Idempotency-aware: Treats "create returned error but env now exists" as success
Clear logging: Each attempt logged with preview_log, race condition retries explained
Proper backoff: Uses sleep_with_backoff_and_jitter 2 "${attempt}" 10 (2s base, 10s cap)
Fail-fast on real errors: Non-transient errors exit immediately with descriptive message
Resource cleanup: Temp files removed in all exit paths

🕐 Pending Recommendations (1)

💭 bootstrap-preview-auth.sh:112 Add logging when Railway deployment gate is bypassed (from prior review — still unaddressed)

✅ APPROVE

Summary: The delta is a well-crafted reliability improvement for the recreate_preview path. The new retry wrapper correctly handles Railway API eventual consistency after environment deletion. The implementation follows established patterns in this codebase (exponential backoff, structured logging, proper error propagation). Ship it! 🚢

Reviewers (2)

Reviewer	Returned	Main Findings	Consider	While You're Here	Inline Comments	Pending Recs	Discarded
`pr-review-devops`	0	0	0	0	0	0	0
`pr-review-errors`	0	0	0	0	0	0	0
Total	0	0	0	0	0	0	0

Note: Delta review — both reviewers found the implementation to be solid with no issues. The retry pattern, error handling, and logging are all appropriate for the use case.

claude

PR Review Summary

(0) Total Issues | Risk: Low

This is a delta review scoped to the 1 commit since the last automated review (99802da8f).

✅ Delta Changes Look Good

The delta contains only cosmetic job name clarifications with no behavior changes:

Job	Before	After
`preview-disabled`	`Preview Flow Disabled`	`Preview Disabled Notice (runs only when previews are off)`
`teardown-tier1`	`Teardown Tier 1 (Railway)`	`Teardown Tier 1 (Railway, PR close only)`
`teardown-vercel`	`Teardown Vercel Preview Env Vars`	`Teardown Vercel Preview Env Vars (PR close only)`

These changes improve documentation/readability by clarifying:

When the preview-disabled job runs
That teardown jobs only run on PR close events

No functional changes. No issues identified.

🕐 Pending Recommendations (1)

💭 bootstrap-preview-auth.sh:112 Add logging when Railway deployment gate is bypassed (from prior review — still unaddressed, optional polish)

✅ APPROVE

Summary: The delta since the last review is trivial — just three job name clarifications that improve CI log readability. All substantive issues from prior reviews have been addressed. The one remaining item (logging when the deployment gate is bypassed) is optional polish and not blocking. Ship it! 🚀

Reviewers (0)

Reviewer	Returned	Main Findings	Consider	While You're Here	Inline Comments	Pending Recs	Discarded
Total	0	0	0	0	0	0	0

Note: Delta review — no reviewer dispatches needed. Changes are purely cosmetic (job name clarifications).

* ci: add preview state janitor and recreate path * ci: simplify preview janitor and var resolution * ci: address preview janitor review feedback * ci: gate preview auth on SpiceDB deployment readiness * ci: tighten preview bootstrap retry budget * ci: retry preview recreate after Railway delete * ci: clarify skipped preview workflow jobs

@claude

* chore(dashboard): dockerize visual regression tests for cross-OS consistency Run Playwright browser inside a Docker container so visual screenshot tests produce identical results on macOS (local dev) and Linux (CI). - Add docker-compose.visual.yml with Playwright server container - Update vitest config to connect to Docker browser via websocket when PW_TEST_CONNECT_WS_ENDPOINT env var is set - Add test:visual and test:visual:update npm scripts - Update CI workflow to use Docker Playwright server instead of bare Playwright install - Regenerate screenshot baselines from Linux container Closes PRD-6191 Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix: address PR review feedback - Add explicit failure handling if Playwright server doesn't start - Bind Docker port to 127.0.0.1 only (don't expose to network) - Align npx playwright version with Docker image (both 1.58.0) Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix: restore Playwright install step in CI The @vitest/browser-playwright package requires a local Playwright install to initialize, even when the actual browser runs in Docker via connectOptions. Keep the install step alongside the Docker server. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(dashboard): pass PW_TEST_CONNECT_WS_ENDPOINT through turbo strict mode Turbo v2 strict mode filters env vars not listed in turbo.json from child processes. The Playwright WebSocket endpoint was being silently dropped, causing vitest to fall back to local Chromium instead of the Docker server — producing mismatched screenshots in CI. Also pins docker-compose.visual.yml to linux/amd64 so local baselines match CI regardless of host architecture (see microsoft/playwright#13873), and fixes the Playwright cache restore-keys prefix mismatch. Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: add visual regression test workflow to AGENTS.md * fix(dashboard): resolve Monaco strict locator violation in nested error state visual test * fix(dashboard): fix Monaco strict locator violation with data-testid and stable render wait * Refactor vitest.config.ts by removing unused code * Update vitest.config.ts * fix(dashboard): restore onUnhandledError handler for Monaco browser tests The previous refactor removed the onUnhandledError handler, causing CI to fail with exit code 1 due to 3 known, unfixable Monaco Editor errors in Vitest browser mode: 1. "Cannot use import statement outside a module" - Monaco web workers cannot load ESM in the Vitest browser sandbox 2. "InvalidCharacterError" / "is not a valid name" - Monaco attempts createElement with an SVG data URI as the tag name 3. "Closing rpc while" - Vitest worker RPC shutdown race condition These errors were originally suppressed by Nick in #2046 and #2078 after investigation confirmed they are unfixable Monaco/Vitest internals that do not affect test correctness. Refs: #2046, #2078 * fix(dashboard): remove unused pixelmatch devDependency The pixelmatch package is no longer imported after the vitest.config.ts refactor removed the custom tolerantPixelmatch comparator. Knip correctly flags it as unused. * fix * upd * upd * format * lock * rm * fix * Create fluffy-gorillas-joke.md * Apply suggestion from @claude[bot] Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> * fix(@inkeep/agents-work-apps): mark `@slack/socket-mode` as `dependency` (#2951) * upd * upd * Apply suggestion from @dimaMachina * Apply suggestion from @dimaMachina * Create breezy-lemons-dream.md * Document MCP header forwarding in Visual Builder docs (#2956) * docs: add MCP header forwarding and fix header key casing examples * updated warnings in headers docs * updated warnings in mcp servers docs * In product copilot tutorial (#2957) * docs build updated api reference * tutorial done * implements pnpm minimumReleaseAge and upgrades pnpm to 10.33.0 (#2958) * implements pnpm minimumReleaseAge * upgrades pnpm to 10.16.0 * upgrades pnpm to 10.33.0 * ci: surface stable preview URLs in PRs (#2799) * ci: surface stable preview urls in PRs * fix: add temp file cleanup trap and paginate comment search - Add EXIT trap to clean up mktemp file - Paginate through all PR comments when searching for the existing marker comment, fixing duplicate-comment risk on PRs with 100+ comments Co-authored-by: Andrew Mikofalvy <[email protected]> Co-Authored-By: Claude Opus 4.6 <[email protected]> * ci: fix preview URL comment updates --------- Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: Andrew Mikofalvy <[email protected]> Co-authored-by: Claude Opus 4.6 <[email protected]> * feat: S3 presigned URLs for private media delivery (#2887) * feat: add S3 presigned URL support for private media delivery - Add optional getPresignedUrl() to BlobStorageProvider interface - Implement in S3BlobStorageProvider using @aws-sdk/s3-request-presigner - Make resolveMessageBlobUris() async with presigned URL first, manage proxy fallback for non-S3 backends (Option D hybrid) - Update both call sites (run + manage conversation routes) with await - Add presigned URL tests to s3-provider and resolve-blob-uris test suites - Include full spec with evidence files Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * docs: add S3 blob storage deployment guide Covers S3 setup, IAM permissions, env vars, S3-compatible services, storage backend priority, and presigned URL delivery flow. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * chore: add changeset for S3 presigned URL support Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: add error handling for presigned URL failures with proxy fallback - Wrap getPresignedUrl() in try-catch so failures fall through to manage proxy URL instead of crashing the entire conversation response - Add test for presigned URL failure → proxy fallback path - Add mixed-content test with presigned URLs active - Fix doc icon quoting convention Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: correct callout type and remove inaccurate configurable claim in docs Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * feat: make presigned URL expiry configurable via `BLOB_STORAGE_PRESIGNED_URL_EXPIRY_SECONDS` - Add `BLOB_STORAGE_PRESIGNED_URL_EXPIRY_SECONDS` to env.ts Zod schema (default 7200s / 2 hours, range 60–604800) - Replace hardcoded `DEFAULT_PRESIGNED_EXPIRY_SECONDS` constant in s3-provider.ts with env var lookup - Update tests to use env var in mocks and verify new default - Add env var to .env.example files and deployment docs * fix: address PR review comments - Clarify Vercel Blob is also a valid production backend (serves via proxy) - Add per-service S3-compatible path-style guidance (R2 vs B2 vs Spaces) - Make first resolve-blob-uris test explicitly set its mock (test isolation) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * chore: reset lockfile from main to minimize resolution drift Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: pin s3-request-presigner to match client-s3 to prevent lockfile drift Pin @aws-sdk/s3-request-presigner to 3.995.0 (same as resolved client-s3) to minimize pnpm-lock.yaml changes and prevent react version mismatch that caused agents-email test failures in CI. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Resolve blob storage provider once per message list instead of per message Addresses PR feedback: call getBlobStorageProvider() once in resolveMessagesListBlobUris and pass the provider through to resolveMessageBlobUris, avoiding N singleton lookups per conversation retrieval. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * chore: reset lockfile from main and reinstall to fix CI Reset pnpm-lock.yaml from main per repo guidelines to prevent resolution drift that was causing monaco-editor ESM import failures in agents-manage-ui tests. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Address remaining review suggestions: list-level presigned URL test and AWS CLI prerequisite - Add test for resolveMessagesListBlobUris with presigned URLs active across multiple messages, covering Promise.all handling - Add AWS CLI prerequisite note to S3 setup docs Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Remove AWS CLI prerequisite note from S3 docs Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> Co-authored-by: pullfrog[bot] <226033991+pullfrog[bot]@users.noreply.github.com> * ci: add preview janitor and recreate control (#2930) * ci: add preview state janitor and recreate path * ci: simplify preview janitor and var resolution * ci: address preview janitor review feedback * ci: gate preview auth on SpiceDB deployment readiness * ci: tighten preview bootstrap retry budget * ci: retry preview recreate after Railway delete * ci: clarify skipped preview workflow jobs * Rename headers in schema and usage to hyphen format (#2962) * Version Packages (#2952) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Andrew Mikofalvy <[email protected]> * Update actions/setup-node and pnpm/action-setup to Node.js 24 versions (#2963) Upgrade actions/setup-node from v4 to v6.3.0 and pnpm/action-setup from v4 to v5.0.0 across all workflows to resolve the Node.js 20 deprecation warning. Node.js 20 actions will be forced to Node.js 24 starting June 2, 2026. https://claude.ai/code/session_01D5Ah1eAYvZCS2SfZ5Lopi3 Co-authored-by: Claude <[email protected]> * credential id reference added (#2967) * Add minimumReleaseAgeExclude for @inkeep/* packages (#2968) Excludes @inkeep scoped packages from the 1440-minute minimum release age gate so our own published packages can be installed immediately after release. https://claude.ai/code/session_01LBEnpfsjj6r4cdwXs2VeHx Co-authored-by: Claude <[email protected]> * add TooltipProvider * add back timeout * should fix tests * polish * fix * pnpm i * upd --------- Co-authored-by: Varun Varahabhotla <[email protected]> Co-authored-by: Claude Opus 4.6 <[email protected]> Co-authored-by: Varun Varahabhotla <[email protected]> Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Co-authored-by: bryan-inkeep <[email protected]> Co-authored-by: Gaurav Varma <[email protected]> Co-authored-by: robert-inkeep <[email protected]> Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: Andrew Mikofalvy <[email protected]> Co-authored-by: Andrew Mikofalvy <[email protected]> Co-authored-by: pullfrog[bot] <226033991+pullfrog[bot]@users.noreply.github.com> Co-authored-by: inkeep-internal-ci[bot] <259778081+inkeep-internal-ci[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

ci: add preview state janitor and recreate path

dbf5a4e

vercel Bot deployed to Preview – agents-docs March 31, 2026 15:01 View deployment

vercel Bot deployed to Preview – agents-manage-ui March 31, 2026 15:01 View deployment

vercel Bot deployed to Preview – agents-api March 31, 2026 15:03 View deployment

ci: simplify preview janitor and var resolution

fb5eebd

vnv-varun requested a review from Copilot March 31, 2026 15:05

Copilot started reviewing on behalf of vnv-varun March 31, 2026 15:06 View session

vnv-varun marked this pull request as ready for review March 31, 2026 15:07

vercel Bot deployed to Preview – agents-manage-ui March 31, 2026 15:07 View deployment

vercel Bot deployed to Preview – agents-docs March 31, 2026 15:08 View deployment

vercel Bot deployed to Preview – agents-api March 31, 2026 15:09 View deployment

pullfrog Bot reviewed Mar 31, 2026

View reviewed changes

Copilot AI reviewed Mar 31, 2026

View reviewed changes

claude Bot reviewed Mar 31, 2026

View reviewed changes

github-actions Bot deleted a comment from claude Bot Mar 31, 2026

ci: address preview janitor review feedback

8973aad

vercel Bot deployed to Preview – agents-manage-ui March 31, 2026 15:20 View deployment

claude Bot reviewed Mar 31, 2026

View reviewed changes

github-actions Bot deleted a comment from claude Bot Mar 31, 2026

vercel Bot deployed to Preview – agents-docs March 31, 2026 15:20 View deployment

vercel Bot deployed to Preview – agents-api March 31, 2026 15:21 View deployment

vnv-varun added 2 commits April 1, 2026 12:28

ci: gate preview auth on SpiceDB deployment readiness

3b0164a

ci: tighten preview bootstrap retry budget

64b34fd

vercel Bot deployed to Preview – agents-manage-ui April 1, 2026 16:30 View deployment

github-actions Bot deleted a comment from claude Bot Apr 1, 2026

vercel Bot deployed to Preview – agents-manage-ui April 1, 2026 16:31 View deployment

vercel Bot deployed to Preview – agents-docs April 1, 2026 16:32 View deployment

vercel Bot deployed to Preview – agents-api April 1, 2026 16:33 View deployment

claude Bot reviewed Apr 1, 2026

View reviewed changes

github-actions Bot deleted a comment from claude Bot Apr 1, 2026

ci: retry preview recreate after Railway delete

99802da

vercel Bot deployed to Preview – agents-manage-ui April 1, 2026 17:24 View deployment

vercel Bot deployed to Preview – agents-docs April 1, 2026 17:24 View deployment

vercel Bot deployed to Preview – agents-api April 1, 2026 17:25 View deployment

claude Bot reviewed Apr 1, 2026

View reviewed changes

github-actions Bot deleted a comment from claude Bot Apr 1, 2026

ci: clarify skipped preview workflow jobs

3654e2d

vercel Bot deployed to Preview – agents-manage-ui April 1, 2026 19:09 View deployment

vercel Bot deployed to Preview – agents-docs April 1, 2026 19:10 View deployment

claude Bot reviewed Apr 1, 2026

View reviewed changes

github-actions Bot deleted a comment from claude Bot Apr 1, 2026

vercel Bot deployed to Preview – agents-api April 1, 2026 19:12 View deployment

vnv-varun requested a review from amikofalvy April 1, 2026 19:23

amikofalvy approved these changes Apr 1, 2026

View reviewed changes

vnv-varun added this pull request to the merge queue Apr 1, 2026

Merged via the queue into main with commit 0016fc4 Apr 1, 2026
28 checks passed

vnv-varun deleted the varun/preview-state-hardening branch April 1, 2026 21:02

Conversation

vnv-varun commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Why

Confirmed Stale Railway Envs

What Caused Those Misses

SpiceDB Readiness Fix

What We Verified

Test Plan

Notes

Uh oh!

vercel Bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

changeset-bot Bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

pullfrog Bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key changes

Scheduled janitor for stale preview environments

Manual recreate_preview for drifted PR environments

SpiceDB deployment readiness gate

Retry-resilient environment creation and transient error handling

Structured logging and teardown hardening

Clearer workflow job names for conditional steps

Uh oh!

pullfrog Bot left a comment

Choose a reason for hiding this comment

Uh oh!

pullfrog Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

pullfrog Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

PR Review Summary

🟠⚠️ Major (2) 🟠⚠️

🟡 Minor (1) 🟡

💭 Consider (2) 💭

💡 APPROVE WITH SUGGESTIONS

Uh oh!

claude Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

PR Review Summary

✅ Prior Feedback Addressed

Additional Improvements in Delta

✅ APPROVE

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

PR Review Summary

✅ Delta Changes Look Good

💭 Consider (1) 💭

🕐 Pending Recommendations (0)

✅ APPROVE

Uh oh!

claude Bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

vnv-varun commented Mar 31, 2026 •

edited

Loading

vercel Bot commented Mar 31, 2026 •

edited

Loading

changeset-bot Bot commented Mar 31, 2026 •

edited

Loading

pullfrog Bot commented Mar 31, 2026 •

edited

Loading

Manual `recreate_preview` for drifted PR environments