Skip to content

ci: add preview janitor and recreate control#2930

Merged
vnv-varun merged 7 commits intomainfrom
varun/preview-state-hardening
Apr 1, 2026
Merged

ci: add preview janitor and recreate control#2930
vnv-varun merged 7 commits intomainfrom
varun/preview-state-hardening

Conversation

@vnv-varun
Copy link
Copy Markdown
Contributor

@vnv-varun vnv-varun commented Mar 31, 2026

Summary

Harden preview environment lifecycle management so stale preview state is repaired automatically, individual PR previews can be force-recreated when they drift, and preview auth bootstrap fails faster and more transparently when SpiceDB is not actually ready yet.

Changes

  • add a scheduled/manual Preview Janitor workflow that cleans stale Railway pr-* environments and can also run stale Vercel preview env-var cleanup
  • add a Railway janitor script that enumerates pr-* environments, maps them back to GitHub PR state, deletes merged/closed PR envs, and verifies deletion
  • add a manual recreate_preview workflow input so a PR can force teardown and reprovision of its Railway preview environment when reused preview state has drifted
  • harden recreate_preview against Railway's delete/create name-reuse race by retrying create when Railway still reports the just-deleted env name as existing
  • harden teardown and cleanup helpers for missing/disappearing environments, repo-scoped GitHub queries, deletion failure accounting, and clearer exit behavior
  • add structured preview phase logging and wait-loop logging across provisioning, bootstrap, and teardown so future timeouts are diagnosable from CI logs
  • gate preview auth bootstrap on the Railway spicedb service deployment being ready, not just env existence / proxy creation
  • keep the existing SpiceDB schema probe as the final functional readiness check, but bound the retry budget so failures surface in minutes rather than dragging toward the full job timeout

Why

We were still seeing three classes of preview-state problems:

  1. Reused pr-* environments could drift and require manual cleanup or full reprovision.
  2. Close-event teardown was still event-driven, so if it failed once, stale preview environments remained indefinitely.
  3. Bootstrap Preview Auth could still race SpiceDB startup even after Railway provisioning had succeeded, leading to UNAVAILABLE, ECONNRESET, and Failed to connect failures during the schema probe.

This PR addresses all three.

  • recreate_preview gives a first-class recovery path for one PR.
  • the janitor gives a system-wide repair path for missed teardown events.
  • the new SpiceDB deployment gate fixes the remaining readiness race at the right layer instead of only piling on more client retries.

It also closes one Railway-specific edge case in the recreate flow itself: after deleting a preview environment, Railway can briefly reject a fresh create with An environment with that name already exists even while the environment lookup returns absent. The recreate path now treats that as a bounded retry case instead of failing the run.

Confirmed Stale Railway Envs

Before cleanup, the Railway project still had these stale preview environments for merged PRs:

  • pr-2901
  • pr-2902
  • pr-2906
  • pr-2907
  • pr-2908
  • pr-2909

I confirmed each of those PRs is closed/merged in GitHub.

After the live cleanup, a follow-up janitor dry-run reported 19 Railway preview environments, all tied to open PRs, with 0 stale environments left to delete.

What Caused Those Misses

Those stale environments all came from a transition window before the earlier preview hardening had landed on main:

  • #2900 merged at 2026-03-30T21:31:52Z
  • #2921 merged at 2026-03-31T02:14:47Z

All six stale PRs above were merged before those fixes were on main. Their close-event Preview Environments workflow did fire, but the run failed in the old Teardown Tier 1 (Railway) path immediately after merge, which left the Railway env behind.

This PR accounts for that class of miss in two ways:

  • a hardier teardown path with better logs
  • a scheduled/manual janitor that repairs missed teardown events after the fact

SpiceDB Readiness Fix

The remaining preview flake we were still seeing was not Railway provisioning itself. Recent failures on other PRs showed this pattern:

  • Railway env creation passed
  • DB migrations passed
  • then wait-for-spicedb.ts timed out with repeated UNAVAILABLE, ECONNRESET, or Failed to connect

The problem was that env existence and TCP proxy readiness were not strong enough readiness signals. SpiceDB could still be deploying even though the proxy existed.

This PR now:

  • waits for the Railway spicedb service instance to report a ready deployment state first
  • then runs the existing schema-readiness probe with a bounded preview-only retry budget
  • reduces the chance of long silent waits by flattening the retry model and reducing the bootstrap timeout window

What We Verified

  • bash -n passes for the changed preview shell scripts
  • workflow YAML parses successfully for both preview workflows
  • git diff --check passes
  • the new Railway janitor script ran successfully against the live agents Railway project in dry-run mode before cleanup
  • the same Railway janitor script was then run live and deleted stale merged-PR environments that had been left behind
  • the new Railway SpiceDB deployment gate was exercised directly against a live preview env and returned SUCCESS
  • the Railway recreate name-reuse race was reproduced live against a disposable pr-* env and the patched recreate logic recovered successfully
  • the latest full PR run passed end to end, including:
    • Provision Tier 1 (Railway)
    • Bootstrap Preview Auth
    • Inject Vercel Preview Env Vars
    • Smoke Test Preview URLs
  • the latest manual workflow_dispatch run with recreate_preview=true also passed end to end on the updated head:

Test Plan

  • bash -n .github/scripts/preview/*.sh for the changed preview scripts
  • parse .github/workflows/preview-environments.yml
  • parse .github/workflows/preview-janitor.yml
  • git diff --check
  • live dry-run of cleanup-stale-railway-envs.sh against the Railway project
  • live cleanup of confirmed stale merged-PR Railway preview environments
  • live verification of the Railway SpiceDB deployment gate against a healthy preview env
  • live reproduction of the Railway recreate name-reuse race plus recovery validation on a disposable pr-* env
  • end-to-end passing Preview Environments run on this PR after the final SpiceDB readiness changes
  • confirm the latest workflow_dispatch run with recreate_preview=true completes on the updated head
  • run the new Preview Janitor workflow in GitHub after merge

Notes

  • pnpm check is still failing in this workspace because @inkeep/agents-api test boot requires INKEEP_AGENTS_MANAGE_DATABASE_URL and INKEEP_AGENTS_RUN_DATABASE_URL to be set. That failure is repo/test-environment baseline and unrelated to the preview-shell/workflow changes in this PR.
  • Scheduled janitor runs intentionally perform real cleanup. Manual dispatch defaults to dry-run.

@vercel
Copy link
Copy Markdown

vercel Bot commented Mar 31, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agents-api Ready Ready Preview, Comment Apr 1, 2026 7:12pm
agents-docs Ready Ready Preview, Comment Apr 1, 2026 7:12pm
agents-manage-ui Ready Ready Preview, Comment Apr 1, 2026 7:12pm

Request Review

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Mar 31, 2026

⚠️ No Changeset found

Latest commit: 3654e2d

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@pullfrog
Copy link
Copy Markdown
Contributor

pullfrog Bot commented Mar 31, 2026

TL;DR — Adds a scheduled/manual "Preview Janitor" workflow that deletes stale Railway and Vercel preview environments left behind by missed close events, adds a recreate_preview dispatch input so a single PR can force a full teardown-and-reprovision, and hardens the entire preview lifecycle with SpiceDB deployment gating, transient-error retries, GraphQL error checking, environment-create retry logic, and structured logging.

Key changes

  • Add cleanup-stale-railway-envs.sh janitor script — enumerates all pr-* Railway environments, maps each to GitHub PR state, and deletes environments for closed/merged PRs with post-delete verification and per-deletion failure tracking.
  • Add preview-janitor.yml workflow — runs every 6 hours on a cron schedule (and manually) to invoke the Railway janitor and the Vercel stale-env cleanup, with dry-run default for manual dispatch.
  • Add recreate_preview workflow dispatch input — allows force-deleting and reprovisioning a PR's Railway environment when reused state has drifted.
  • Gate preview auth on SpiceDB deployment readiness — before probing gRPC, bootstrap-preview-auth.sh now polls Railway's deployment API until the SpiceDB service reaches SUCCESS or SLEEPING, preventing probes against a service that hasn't finished deploying.
  • Retry transient SpiceDB transport failures during bootstrap — wraps db:auth:init with run_with_transient_spicedb_retry, matching a curated pattern of transport error strings (UNAVAILABLE, ECONNRESET, ETIMEDOUT, etc.) with exponential backoff and jitter.
  • Add create_preview_environment retry wrapper — retries Railway environment creation up to 6 attempts with backoff, handling the race where Railway reports "already exists" after a delete by waiting for absence before retrying.
  • Add railway_require_graphql_success to mutation wrappersrailway_environment_create_from_source, railway_environment_delete_by_id, and railway_variable_collection_upsert now validate GraphQL responses and propagate failures instead of silently succeeding.
  • Expand common.sh shared helper library — adds ~15 new functions including railway_service_instance_json, railway_wait_for_service_deployment_ready, railway_ensure_tcp_proxy, vercel_list_preview_only_env_vars, redact_preview_logs, sleep_with_backoff_and_jitter, runtime_var_is_unresolved, and preview_should_log_wait_attempt.
  • Add preview_log structured logging — introduces a shared preview_log helper with UTC timestamps and adds phase-level log lines across provisioning, bootstrap, teardown, and wait loops.
  • Harden teardown for idempotencyteardown-railway.sh now checks environment existence before attempting teardown, handles mid-teardown disappearance gracefully, and verifies deletion with railway_wait_for_environment_absent.
  • Clarify workflow job names for skipped/conditional jobs — renames Preview Flow Disabled, Teardown Tier 1, and Teardown Vercel Preview Env Vars job names to indicate when each job actually runs (e.g. "PR close only", "runs only when previews are off").

Summary | 8 files | 7 commits | base: mainvarun/preview-state-hardening


Scheduled janitor for stale preview environments

Before: If a PR close event's teardown failed, the Railway pr-* environment was left behind indefinitely — cleanup was purely event-driven with no fallback.
After: A new cleanup-stale-railway-envs.sh script runs on a 6-hour cron (and manually) via preview-janitor.yml, cross-referencing every pr-* Railway environment against GitHub PR state and deleting environments for closed/merged PRs.

The janitor supports dry-run mode (default for manual dispatch), an opt-in flag to delete orphaned environments whose PR returns 404 (DELETE_UNKNOWN_PREVIEW_ENVS), and writes a summary table to GITHUB_STEP_SUMMARY with counts of kept, deleted, unknown, and errored environments. Individual deletion failures are tracked separately so a single failed delete does not abort the entire sweep. A companion Vercel job in the same workflow cleans branch-scoped preview env vars for stale branches.

How does the janitor decide what to delete?

It lists all Railway environments matching ^pr-[0-9]+$, extracts the PR number, queries the GitHub API for PR state, and categorizes each as open (keep), closed (delete), or missing (skip unless DELETE_UNKNOWN_PREVIEW_ENVS=true). Deletions are verified with railway_wait_for_environment_absent.

cleanup-stale-railway-envs.sh · cleanup-vercel-stale-env.sh · preview-janitor.yml


Manual recreate_preview for drifted PR environments

Before: A PR with a drifted Railway environment required manual Railway console intervention to reprovision.
After: A recreate_preview boolean input on the workflow_dispatch trigger deletes the existing Railway environment and waits for it to be absent before creating a fresh one from the template.

The flag flows from the workflow input through compute-context outputs to provision-railway.sh via RECREATE_PREVIEW_ENV.

provision-railway.sh · preview-environments.yml


SpiceDB deployment readiness gate

Before: bootstrap-preview-auth.sh immediately probed SpiceDB gRPC readiness after Railway provisioning, sometimes hitting a service that hadn't finished deploying yet.
After: The script first polls Railway's deployment API via railway_wait_for_service_deployment_ready until the SpiceDB service reaches SUCCESS or SLEEPING, then proceeds to the gRPC readiness probe.

The gate is conditional — it only runs when RAILWAY_API_TOKEN, RAILWAY_PROJECT_ID, RAILWAY_SPICEDB_SERVICE, and PR_NUMBER are all set. Terminal deployment states (FAILED, CRASHED, REMOVED, SKIPPED) return failure immediately instead of polling until timeout.

bootstrap-preview-auth.sh · common.sh


Retry-resilient environment creation and transient error handling

Before: A single SpiceDB transport failure during db:auth:init failed the preview run; Railway environment creation was a single-shot call that could fail if a prior delete hadn't fully propagated.
After: create_preview_environment retries up to 6 attempts with exponential backoff, detecting "already exists" race conditions after delete and waiting for absence before retrying. run_with_transient_spicedb_retry retries db:auth:init for transient transport errors. railway_require_graphql_success validates every mutation response.

Function Purpose
create_preview_environment Retries Railway environment creation with "already exists" race handling
run_with_transient_spicedb_retry Retries commands that fail with transient SpiceDB/gRPC errors
railway_require_graphql_success Asserts GraphQL response has no errors, returns non-zero on failure
sleep_with_backoff_and_jitter Exponential backoff with random jitter via Python3

provision-railway.sh · bootstrap-preview-auth.sh · common.sh


Structured logging and teardown hardening

Before: Preview scripts used bare echo with no timestamps; teardown assumed the environment existed and failed on race conditions.
After: A shared preview_log function emits [preview][UTC timestamp] lines to stderr, and teardown-railway.sh checks existence before teardown, handles mid-teardown disappearance, and verifies deletion.

Wait loops now log on attempts 1, every 5th attempt, and the final attempt via preview_should_log_wait_attempt. The resolve_runtime_vars polling loop reports which variables remain unresolved via runtime_var_is_unresolved. A new redact_preview_logs filter strips PostgreSQL connection strings, secrets, auth cookies, and bearer tokens from log output.

common.sh · teardown-railway.sh · provision-railway.sh


Clearer workflow job names for conditional steps

Before: Workflow job names like Teardown Tier 1 (Railway) didn't indicate their run conditions, making it unclear why they were skipped in logs.
After: Job names now include their trigger conditions — e.g. Teardown Tier 1 (Railway, PR close only) and Preview Disabled Notice (runs only when previews are off).

preview-environments.yml

Pullfrog  | View workflow run | Triggered by Pullfrog | Using Claude Opus𝕏

Copy link
Copy Markdown
Contributor

@pullfrog pullfrog Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low urgency. Solid operational hardening — the janitor, recreate path, SpiceDB retry, and structured logging all address real gaps. Two minor items below; neither is blocking.

Pullfrog  | View workflow run | Using Claude Opus𝕏

Comment on lines +238 to +242
runtime_var_is_unresolved() {
local value="${1:-}"

[ -z "${value}" ] || printf '%s' "${value}" | grep -q '\$[{][{]'
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (readability): runtime_var_is_unresolved is defined here on line 238 but first referenced inside the resolve_runtime_vars function body starting at line 187. This works because bash evaluates function bodies lazily — resolve_runtime_vars isn't called until line 245 — but the ordering is surprising and fragile. Moving the definition above resolve_runtime_vars() (or into common.sh alongside the other helpers) would make the dependency explicit and prevent a future refactor from breaking call order.

Not blocking — the current code is correct at runtime.

RAILWAY_PROJECT_ID: ${{ vars.RAILWAY_PROJECT_ID }}
GH_TOKEN: ${{ github.token }}
GITHUB_REPOSITORY: ${{ github.repository }}
DRY_RUN: ${{ github.event_name == 'workflow_dispatch' && inputs.dry_run && 'true' || 'false' }}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirm intent: On schedule triggers, inputs.dry_run is undefined so this expression evaluates to 'false', meaning scheduled runs execute real deletes. The PR description indicates this is intentional (the janitor is meant to repair missed teardowns automatically). Just flagging for visibility — if you ever want scheduled runs to default to dry-run, this expression would need adjustment.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the CI automation around ephemeral preview environments by adding a scheduled/manual “janitor” cleanup path and a manual “recreate preview” path, plus improved logging and retries to make long-running preview steps more diagnosable and resilient.

Changes:

  • Add a new scheduled/manual Preview Janitor workflow to delete stale Railway pr-* environments and stale Vercel branch-scoped env vars.
  • Add a manual recreate_preview input to force teardown + reprovision of a PR’s Railway preview environment.
  • Improve preview script resilience and diagnosability via structured phase logging, safer teardown, and transient SpiceDB transport retries.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
.github/workflows/preview-janitor.yml New scheduled/manual janitor workflow for Railway + Vercel stale preview cleanup.
.github/workflows/preview-environments.yml Adds recreate_preview input plumbed into Railway provisioning.
.github/scripts/preview/common.sh Adds preview_log, wait-loop logging helpers, safer JSON edge handling, and env-deletion wait helper.
.github/scripts/preview/provision-railway.sh Adds “recreate existing env” path and more structured provisioning logs.
.github/scripts/preview/teardown-railway.sh Hardens teardown for missing/disappearing envs and uses shared deletion-wait helper.
.github/scripts/preview/bootstrap-preview-auth.sh Retries transient SpiceDB transport failures and adds more phase logging.
.github/scripts/preview/cleanup-vercel-stale-env.sh Ensures PR lookup is repo-scoped when GITHUB_REPOSITORY is set.
.github/scripts/preview/cleanup-stale-railway-envs.sh New Railway janitor script to enumerate pr-* envs, map to PR state, and delete stale envs with verification.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +23 to +32
status="$(
curl -sS \
--connect-timeout 10 \
--max-time 30 \
-o "${response_file}" \
-w '%{http_code}' \
-H "Authorization: Bearer ${GH_TOKEN}" \
-H "Accept: application/vnd.github+json" \
"https://api.github.com/repos/${GITHUB_REPOSITORY}/pulls/${pr_number}"
)"
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GitHub PR lookup is hard-coded to https://api.github.com/.... GitHub Actions exposes GITHUB_API_URL (and this also matters for GHES); consider building the endpoint from ${GITHUB_API_URL} (fallbacking to api.github.com) to avoid breaking the janitor in non-public-GitHub contexts.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(3) Total Issues | Risk: Medium

🟠⚠️ Major (2) 🟠⚠️

Inline Comments:

  • 🟠 Major: cleanup-stale-railway-envs.sh:59 Deletion errors silently ignored — return status not checked
  • 🟠 Major: cleanup-stale-railway-envs.sh:100-105 Deletion failures not tracked in exit code — janitor may exit 0 despite failed cleanups

🟡 Minor (1) 🟡

Inline Comments:

  • 🟡 Minor: preview-environments.yml:389-390 Unused env vars passed to teardown job

💭 Consider (2) 💭

💭 1) provision-railway.sh:238 Function definition ordering

Issue: runtime_var_is_unresolved() is defined after resolve_runtime_vars() which references it.
Why: While this works correctly in bash (functions are resolved at call time, not definition time), the code structure is confusing. If someone reorders the script, it could break.
Fix: Move runtime_var_is_unresolved() definition to before line 167 for clearer declaration-before-use ordering.

💭 2) preview-janitor.yml:56 Scheduled runs delete without dry-run

Issue: The DRY_RUN expression evaluates to false for scheduled runs, meaning the cron job will perform actual deletions.
Why: This differs from the manual dispatch default (dry_run: true). While this appears intentional for automated cleanup, it could surprise operators.
Fix: Add a comment documenting this intentional behavior, or change to default dry-run for scheduled runs if that's preferred.


💡 APPROVE WITH SUGGESTIONS

Summary: This is a solid infrastructure hardening PR that closes real gaps in preview environment lifecycle management. The janitor workflow and recreate path are well-designed. The two Major findings around error tracking in the janitor script should be addressed to ensure operators can trust the exit code — without this, failed cleanups could silently accumulate. The other items are minor polish.

Discarded (3)
Location Issue Reason Discarded
provision-railway.sh:238 Function runtime_var_is_unresolved defined after use causes runtime error False positive — bash resolves function references at call time, not definition time. The function IS defined before resolve_runtime_vars is called. Reclassified as Consider (code style).
common.sh:27 String equality instead of numeric comparison Works correctly for numeric strings; purely stylistic preference
bootstrap-preview-auth.sh:42 Retry pattern may miss some transient errors Low risk — current pattern covers common failure modes; adding more patterns is speculative
Reviewers (3)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
pr-review-errors 5 0 0 0 2 0 3
pr-review-devops 8 0 2 0 1 0 5
pr-review-standards 2 0 0 0 0 0 2
Total 15 0 2 0 3 0 10

Note: Standards reviewer findings were duplicates of devops findings (function order, dry-run behavior).

preview_log "Deleting stale Railway preview environment ${env_name}."
railway_environment_delete_by_id "${env_id}" >/dev/null
railway_wait_for_environment_absent "${RAILWAY_PROJECT_ID}" "${env_name}" 10 2
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR: Deletion errors are silently ignored

Issue: railway_environment_delete_by_id return status is not checked. If deletion fails, the function proceeds to railway_wait_for_environment_absent which will eventually fail with a generic timeout message.

Why: When debugging deletion failures, operators see "Railway environment still exists after waiting for deletion" but won't know why (permission denied, API error, rate limit). The actual Railway API error is discarded.

Fix: Check the return status before proceeding:

delete_env_and_verify() {
  local env_id="$1"
  local env_name="$2"

  preview_log "Deleting stale Railway preview environment ${env_name}."
  if ! railway_environment_delete_by_id "${env_id}"; then
    echo "Failed to delete Railway environment ${env_name}." >&2
    return 1
  fi
  railway_wait_for_environment_absent "${RAILWAY_PROJECT_ID}" "${env_name}" 10 2
}

Refs:

Comment on lines +100 to +105
preview_log "[dry-run] Would delete stale Railway preview environment ${env_name}."
else
delete_env_and_verify "${env_id}" "${env_name}"
deleted=$((deleted + 1))
deleted_names+=("${env_name}")
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR: Deletion failures not tracked in exit code

Issue: When delete_env_and_verify fails during the janitor loop, the error is not counted. The final exit code (line 153) only reflects GitHub API lookup errors (errors variable), not Railway deletion failures.

Why: The janitor could fail to delete multiple Railway environments due to API issues, but exit with code 0 if all GitHub PR lookups succeeded. Operators checking CI may incorrectly believe all targeted environments were cleaned up.

Fix: Add a deletion_failures counter:

# Add at line 77 with other counters:
deletion_failures=0

# In the delete blocks (lines 100-105 and 113-118):
if ! delete_env_and_verify "${env_id}" "${env_name}"; then
  deletion_failures=$((deletion_failures + 1))
else
  deleted=$((deleted + 1))
  deleted_names+=("${env_name}")
fi

# Update exit check at line 153:
if [ "${errors}" -gt 0 ] || [ "${deletion_failures}" -gt 0 ]; then
  echo "Encountered ${errors} GitHub lookup error(s) and ${deletion_failures} Railway deletion failure(s)." >&2
  exit 1
fi

@github-actions github-actions Bot deleted a comment from claude Bot Mar 31, 2026
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(0) Total Issues | Risk: Low

✅ Prior Feedback Addressed

All issues from the previous review have been resolved in commit 8973aad63:

Prior Issue Resolution
🟠 Major: Deletion errors silently ignored ✅ Fixed — delete_env_and_verify now checks return status and returns 1 on failure (L58-61)
🟠 Major: Deletion failures not tracked in exit code ✅ Fixed — Added deletion_failures counter, tracked in loop, checked at exit (L82, L107-112, L163-165)
🟡 Minor: Unused env vars passed to teardown ✅ Fixed — RAILWAY_TEMPLATE_ENVIRONMENT and RAILWAY_OUTPUT_SERVICE removed from teardown job
💭 Consider: Function definition ordering ✅ Fixed — runtime_var_is_unresolved() moved to L121-125, before resolve_runtime_vars() which uses it
💭 Consider: Scheduled runs delete without dry-run (undocumented) ✅ Fixed — Comment added at L57 documenting intentional behavior
🔵 Copilot: Hard-coded GitHub API URL ✅ Fixed — GITHUB_API_URL now configurable with fallback (L16, L32) and passed from workflow (L55)

Additional Improvements in Delta

The delta also includes GraphQL error handling improvements in common.sh:

  • railway_environment_create_from_source, railway_environment_delete_by_id, and railway_variable_collection_upsert now properly check for GraphQL errors and return failure status
  • This hardens the entire Railway mutation surface against silent GraphQL-level failures

✅ APPROVE

Summary: All prior review feedback has been addressed comprehensively. The janitor now properly tracks and reports deletion failures, function ordering is cleaner, and the behavior of scheduled vs manual runs is documented. The additional GraphQL error handling improvements further strengthen the reliability of the preview infrastructure. Ship it! 🚀

Reviewers (0)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
Total 0 0 0 0 0 0 0

Note: Delta review — no new reviewer dispatches needed. All prior issues resolved.

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(0) Total Issues | Risk: Low

This is a delta review scoped to the 2 commits since the last automated review (3b0164a67, 64b34fd64).

✅ Delta Changes Look Good

The delta adds a Railway SpiceDB deployment readiness gate before probing SpiceDB gRPC readiness. Key observations:

Change Assessment
railway_wait_for_service_deployment_ready() Well-implemented with proper terminal state handling (FAILED, CRASHED, REMOVED, SKIPPED)
Configurable curl timeouts in railway_graphql() Good improvement for polling queries
Timeout reduction (20→10 min) Tight but adequate (~6.5 min worst case calculated)
Removal of retry wrapper from SpiceDB readiness probe Correct — now redundant with deployment gate

💭 Consider (1) 💭

Inline Comments:

  • 💭 Consider: bootstrap-preview-auth.sh:112 Add logging when Railway deployment gate is bypassed

🕐 Pending Recommendations (0)

All issues from the previous review were addressed. No pending items.


✅ APPROVE

Summary: The delta changes are solid infrastructure hardening. The new deployment readiness gate properly sequences the SpiceDB gRPC probe after Railway reports the deployment is ready, which addresses the root cause of transient bootstrap failures. Terminal deployment states are handled correctly with descriptive error messages. Ship it! 🚀

Reviewers (2)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
pr-review-devops 3 0 0 0 0 0 3
pr-review-errors 3 0 1 0 1 0 1
Total 6 0 1 0 1 0 4

Note: Delta review — reviewer output was lighter given limited scope. Devops findings were informational (timeout math, cosmetic logging). Error reviewer's single-attempt query concern was duplicate of devops finding. Unhandled deployment status finding was overly defensive (current behavior is reasonable).

Discarded (4)
Location Issue Reason Discarded
preview-environments.yml:230 Timeout reduced from 20 to 10 minutes Valid observation but math works out (~6.5 min worst case fits in 10 min budget). Intentional budget tightening.
common.sh:417 Single-attempt GraphQL query in polling loop Outer polling loop provides resilience; single-attempt per poll iteration is intentional for efficiency
common.sh:478 Unhandled deployment statuses continue polling This is correct behavior for in-progress states (BUILDING, DEPLOYING, etc.). Current fallthrough is intentional.
bootstrap-preview-auth.sh:129 Step summary logs gate even when skipped Cosmetic; shows intended service name regardless of gate execution

15 \
4
echo "::endgroup::"
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 Consider: Add logging when Railway deployment gate is bypassed

Issue: When Railway env vars (RAILWAY_API_TOKEN, RAILWAY_PROJECT_ID, RAILWAY_SPICEDB_SERVICE, PR_NUMBER) are not all set, the deployment gate block (lines 99-112) is silently skipped with no log output.

Why: If the SpiceDB readiness probe fails downstream, operators reviewing CI logs won't see why the deployment gate was bypassed, making it harder to diagnose whether the issue is missing config vs. actual deployment problems.

Fix: Add an else branch to clarify when the gate is intentionally skipped:

else
  preview_log "Skipping Railway deployment gate (Railway env vars not fully configured)."
fi

This is optional since the gate is designed to be conditional, but explicit logging aids future debugging.

@github-actions github-actions Bot deleted a comment from claude Bot Apr 1, 2026
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(0) Total Issues | Risk: Low

This is a delta review scoped to the 1 commit since the last automated review (99802da8f).

✅ Delta Changes Look Good

The delta adds a create_preview_environment() retry wrapper that hardens Railway environment creation against API race conditions after delete. Key observations:

Change Assessment
create_preview_environment() retry wrapper Well-implemented with up to 6 attempts and exponential backoff
"Already exists" race condition handling Correctly checks if env actually exists after create failure, handles gracefully
Error capture via temp file Proper cleanup in all code paths, clear error messages to stderr
railway_wait_for_environment_absent with || true Intentional — wait is best-effort before retry, not a hard requirement
Variable rename statushttp_status Cosmetic clarity improvement, no behavior change

Implementation Quality

The new function demonstrates solid defensive programming:

  • Idempotency-aware: Treats "create returned error but env now exists" as success
  • Clear logging: Each attempt logged with preview_log, race condition retries explained
  • Proper backoff: Uses sleep_with_backoff_and_jitter 2 "${attempt}" 10 (2s base, 10s cap)
  • Fail-fast on real errors: Non-transient errors exit immediately with descriptive message
  • Resource cleanup: Temp files removed in all exit paths

🕐 Pending Recommendations (1)


✅ APPROVE

Summary: The delta is a well-crafted reliability improvement for the recreate_preview path. The new retry wrapper correctly handles Railway API eventual consistency after environment deletion. The implementation follows established patterns in this codebase (exponential backoff, structured logging, proper error propagation). Ship it! 🚢

Reviewers (2)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
pr-review-devops 0 0 0 0 0 0 0
pr-review-errors 0 0 0 0 0 0 0
Total 0 0 0 0 0 0 0

Note: Delta review — both reviewers found the implementation to be solid with no issues. The retry pattern, error handling, and logging are all appropriate for the use case.

@github-actions github-actions Bot deleted a comment from claude Bot Apr 1, 2026
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(0) Total Issues | Risk: Low

This is a delta review scoped to the 1 commit since the last automated review (99802da8f).

✅ Delta Changes Look Good

The delta contains only cosmetic job name clarifications with no behavior changes:

Job Before After
preview-disabled Preview Flow Disabled Preview Disabled Notice (runs only when previews are off)
teardown-tier1 Teardown Tier 1 (Railway) Teardown Tier 1 (Railway, PR close only)
teardown-vercel Teardown Vercel Preview Env Vars Teardown Vercel Preview Env Vars (PR close only)

These changes improve documentation/readability by clarifying:

  1. When the preview-disabled job runs
  2. That teardown jobs only run on PR close events

No functional changes. No issues identified.

🕐 Pending Recommendations (1)

  • 💭 bootstrap-preview-auth.sh:112 Add logging when Railway deployment gate is bypassed (from prior review — still unaddressed, optional polish)

✅ APPROVE

Summary: The delta since the last review is trivial — just three job name clarifications that improve CI log readability. All substantive issues from prior reviews have been addressed. The one remaining item (logging when the deployment gate is bypassed) is optional polish and not blocking. Ship it! 🚀

Reviewers (0)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
Total 0 0 0 0 0 0 0

Note: Delta review — no reviewer dispatches needed. Changes are purely cosmetic (job name clarifications).

@vnv-varun vnv-varun added this pull request to the merge queue Apr 1, 2026
Merged via the queue into main with commit 0016fc4 Apr 1, 2026
28 checks passed
@vnv-varun vnv-varun deleted the varun/preview-state-hardening branch April 1, 2026 21:02
dimaMachina pushed a commit that referenced this pull request Apr 2, 2026
* ci: add preview state janitor and recreate path

* ci: simplify preview janitor and var resolution

* ci: address preview janitor review feedback

* ci: gate preview auth on SpiceDB deployment readiness

* ci: tighten preview bootstrap retry budget

* ci: retry preview recreate after Railway delete

* ci: clarify skipped preview workflow jobs
github-merge-queue Bot pushed a commit that referenced this pull request Apr 2, 2026
* chore(dashboard): dockerize visual regression tests for cross-OS consistency

Run Playwright browser inside a Docker container so visual screenshot
tests produce identical results on macOS (local dev) and Linux (CI).

- Add docker-compose.visual.yml with Playwright server container
- Update vitest config to connect to Docker browser via websocket
  when PW_TEST_CONNECT_WS_ENDPOINT env var is set
- Add test:visual and test:visual:update npm scripts
- Update CI workflow to use Docker Playwright server instead of
  bare Playwright install
- Regenerate screenshot baselines from Linux container

Closes PRD-6191

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: address PR review feedback

- Add explicit failure handling if Playwright server doesn't start
- Bind Docker port to 127.0.0.1 only (don't expose to network)
- Align npx playwright version with Docker image (both 1.58.0)

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: restore Playwright install step in CI

The @vitest/browser-playwright package requires a local Playwright
install to initialize, even when the actual browser runs in Docker
via connectOptions. Keep the install step alongside the Docker server.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(dashboard): pass PW_TEST_CONNECT_WS_ENDPOINT through turbo strict mode

Turbo v2 strict mode filters env vars not listed in turbo.json from
child processes. The Playwright WebSocket endpoint was being silently
dropped, causing vitest to fall back to local Chromium instead of the
Docker server — producing mismatched screenshots in CI.

Also pins docker-compose.visual.yml to linux/amd64 so local baselines
match CI regardless of host architecture (see
microsoft/playwright#13873), and fixes the
Playwright cache restore-keys prefix mismatch.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* docs: add visual regression test workflow to AGENTS.md

* fix(dashboard): resolve Monaco strict locator violation in nested error state visual test

* fix(dashboard): fix Monaco strict locator violation with data-testid and stable render wait

* Refactor vitest.config.ts by removing unused code

* Update vitest.config.ts

* fix(dashboard): restore onUnhandledError handler for Monaco browser tests

The previous refactor removed the onUnhandledError handler, causing CI
to fail with exit code 1 due to 3 known, unfixable Monaco Editor errors
in Vitest browser mode:

1. "Cannot use import statement outside a module" - Monaco web workers
   cannot load ESM in the Vitest browser sandbox
2. "InvalidCharacterError" / "is not a valid name" - Monaco attempts
   createElement with an SVG data URI as the tag name
3. "Closing rpc while" - Vitest worker RPC shutdown race condition

These errors were originally suppressed by Nick in #2046 and #2078 after
investigation confirmed they are unfixable Monaco/Vitest internals that
do not affect test correctness.

Refs: #2046, #2078

* fix(dashboard): remove unused pixelmatch devDependency

The pixelmatch package is no longer imported after the vitest.config.ts
refactor removed the custom tolerantPixelmatch comparator. Knip
correctly flags it as unused.

* fix

* upd

* upd

* format

* lock

* rm

* fix

* Create fluffy-gorillas-joke.md

* Apply suggestion from @claude[bot]

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

* fix(@inkeep/agents-work-apps): mark `@slack/socket-mode` as `dependency` (#2951)

* upd

* upd

* Apply suggestion from @dimaMachina

* Apply suggestion from @dimaMachina

* Create breezy-lemons-dream.md

* Document MCP header forwarding in Visual Builder docs (#2956)

* docs: add MCP header forwarding and fix header key casing examples

* updated warnings in headers docs

* updated warnings in mcp servers docs

* In product copilot tutorial (#2957)

* docs build updated api reference

* tutorial done

* implements pnpm minimumReleaseAge and upgrades pnpm to 10.33.0 (#2958)

* implements pnpm minimumReleaseAge

* upgrades pnpm to 10.16.0

* upgrades pnpm to 10.33.0

* ci: surface stable preview URLs in PRs (#2799)

* ci: surface stable preview urls in PRs

* fix: add temp file cleanup trap and paginate comment search

- Add EXIT trap to clean up mktemp file
- Paginate through all PR comments when searching for the existing
  marker comment, fixing duplicate-comment risk on PRs with 100+ comments

Co-authored-by: Andrew Mikofalvy <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>

* ci: fix preview URL comment updates

---------

Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Andrew Mikofalvy <[email protected]>
Co-authored-by: Claude Opus 4.6 <[email protected]>

* feat: S3 presigned URLs for private media delivery (#2887)

* feat: add S3 presigned URL support for private media delivery

- Add optional getPresignedUrl() to BlobStorageProvider interface
- Implement in S3BlobStorageProvider using @aws-sdk/s3-request-presigner
- Make resolveMessageBlobUris() async with presigned URL first, manage
  proxy fallback for non-S3 backends (Option D hybrid)
- Update both call sites (run + manage conversation routes) with await
- Add presigned URL tests to s3-provider and resolve-blob-uris test suites
- Include full spec with evidence files

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* docs: add S3 blob storage deployment guide

Covers S3 setup, IAM permissions, env vars, S3-compatible services,
storage backend priority, and presigned URL delivery flow.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* chore: add changeset for S3 presigned URL support

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* fix: add error handling for presigned URL failures with proxy fallback

- Wrap getPresignedUrl() in try-catch so failures fall through to manage
  proxy URL instead of crashing the entire conversation response
- Add test for presigned URL failure → proxy fallback path
- Add mixed-content test with presigned URLs active
- Fix doc icon quoting convention

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* fix: correct callout type and remove inaccurate configurable claim in docs

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* feat: make presigned URL expiry configurable via `BLOB_STORAGE_PRESIGNED_URL_EXPIRY_SECONDS`

- Add `BLOB_STORAGE_PRESIGNED_URL_EXPIRY_SECONDS` to env.ts Zod schema
  (default 7200s / 2 hours, range 60–604800)
- Replace hardcoded `DEFAULT_PRESIGNED_EXPIRY_SECONDS` constant in
  s3-provider.ts with env var lookup
- Update tests to use env var in mocks and verify new default
- Add env var to .env.example files and deployment docs

* fix: address PR review comments

- Clarify Vercel Blob is also a valid production backend (serves via proxy)
- Add per-service S3-compatible path-style guidance (R2 vs B2 vs Spaces)
- Make first resolve-blob-uris test explicitly set its mock (test isolation)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* chore: reset lockfile from main to minimize resolution drift

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* fix: pin s3-request-presigner to match client-s3 to prevent lockfile drift

Pin @aws-sdk/s3-request-presigner to 3.995.0 (same as resolved client-s3)
to minimize pnpm-lock.yaml changes and prevent react version mismatch
that caused agents-email test failures in CI.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* Resolve blob storage provider once per message list instead of per message

Addresses PR feedback: call getBlobStorageProvider() once in
resolveMessagesListBlobUris and pass the provider through to
resolveMessageBlobUris, avoiding N singleton lookups per conversation
retrieval.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* chore: reset lockfile from main and reinstall to fix CI

Reset pnpm-lock.yaml from main per repo guidelines to prevent
resolution drift that was causing monaco-editor ESM import failures
in agents-manage-ui tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* Address remaining review suggestions: list-level presigned URL test and AWS CLI prerequisite

- Add test for resolveMessagesListBlobUris with presigned URLs active
  across multiple messages, covering Promise.all handling
- Add AWS CLI prerequisite note to S3 setup docs

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* Remove AWS CLI prerequisite note from S3 docs

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
Co-authored-by: pullfrog[bot] <226033991+pullfrog[bot]@users.noreply.github.com>

* ci: add preview janitor and recreate control (#2930)

* ci: add preview state janitor and recreate path

* ci: simplify preview janitor and var resolution

* ci: address preview janitor review feedback

* ci: gate preview auth on SpiceDB deployment readiness

* ci: tighten preview bootstrap retry budget

* ci: retry preview recreate after Railway delete

* ci: clarify skipped preview workflow jobs

* Rename headers in schema and usage to hyphen format (#2962)

* Version Packages (#2952)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Andrew Mikofalvy <[email protected]>

* Update actions/setup-node and pnpm/action-setup to Node.js 24 versions (#2963)

Upgrade actions/setup-node from v4 to v6.3.0 and pnpm/action-setup from
v4 to v5.0.0 across all workflows to resolve the Node.js 20 deprecation
warning. Node.js 20 actions will be forced to Node.js 24 starting June 2, 2026.

https://claude.ai/code/session_01D5Ah1eAYvZCS2SfZ5Lopi3

Co-authored-by: Claude <[email protected]>

* credential id reference added (#2967)

* Add minimumReleaseAgeExclude for @inkeep/* packages (#2968)

Excludes @inkeep scoped packages from the 1440-minute minimum release age
gate so our own published packages can be installed immediately after release.

https://claude.ai/code/session_01LBEnpfsjj6r4cdwXs2VeHx

Co-authored-by: Claude <[email protected]>

* add TooltipProvider

* add back timeout

* should fix tests

* polish

* fix

* pnpm i

* upd

---------

Co-authored-by: Varun Varahabhotla <[email protected]>
Co-authored-by: Claude Opus 4.6 <[email protected]>
Co-authored-by: Varun Varahabhotla <[email protected]>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Co-authored-by: bryan-inkeep <[email protected]>
Co-authored-by: Gaurav Varma <[email protected]>
Co-authored-by: robert-inkeep <[email protected]>
Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Andrew Mikofalvy <[email protected]>
Co-authored-by: Andrew Mikofalvy <[email protected]>
Co-authored-by: pullfrog[bot] <226033991+pullfrog[bot]@users.noreply.github.com>
Co-authored-by: inkeep-internal-ci[bot] <259778081+inkeep-internal-ci[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants