ci: add preview janitor and recreate control#2930
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
|
TL;DR — Adds a scheduled/manual "Preview Janitor" workflow that deletes stale Railway and Vercel preview environments left behind by missed close events, adds a Key changes
Summary | 8 files | 7 commits | base: Scheduled janitor for stale preview environments
The janitor supports dry-run mode (default for manual dispatch), an opt-in flag to delete orphaned environments whose PR returns 404 (
Manual
|
| Function | Purpose |
|---|---|
create_preview_environment |
Retries Railway environment creation with "already exists" race handling |
run_with_transient_spicedb_retry |
Retries commands that fail with transient SpiceDB/gRPC errors |
railway_require_graphql_success |
Asserts GraphQL response has no errors, returns non-zero on failure |
sleep_with_backoff_and_jitter |
Exponential backoff with random jitter via Python3 |
provision-railway.sh · bootstrap-preview-auth.sh · common.sh
Structured logging and teardown hardening
Before: Preview scripts used bare
echowith no timestamps; teardown assumed the environment existed and failed on race conditions.
After: A sharedpreview_logfunction emits[preview][UTC timestamp]lines to stderr, andteardown-railway.shchecks existence before teardown, handles mid-teardown disappearance, and verifies deletion.
Wait loops now log on attempts 1, every 5th attempt, and the final attempt via preview_should_log_wait_attempt. The resolve_runtime_vars polling loop reports which variables remain unresolved via runtime_var_is_unresolved. A new redact_preview_logs filter strips PostgreSQL connection strings, secrets, auth cookies, and bearer tokens from log output.
common.sh · teardown-railway.sh · provision-railway.sh
Clearer workflow job names for conditional steps
Before: Workflow job names like
Teardown Tier 1 (Railway)didn't indicate their run conditions, making it unclear why they were skipped in logs.
After: Job names now include their trigger conditions — e.g.Teardown Tier 1 (Railway, PR close only)andPreview Disabled Notice (runs only when previews are off).
Claude Opus | 𝕏
There was a problem hiding this comment.
Low urgency. Solid operational hardening — the janitor, recreate path, SpiceDB retry, and structured logging all address real gaps. Two minor items below; neither is blocking.
Claude Opus | 𝕏
| runtime_var_is_unresolved() { | ||
| local value="${1:-}" | ||
|
|
||
| [ -z "${value}" ] || printf '%s' "${value}" | grep -q '\$[{][{]' | ||
| } |
There was a problem hiding this comment.
Nit (readability): runtime_var_is_unresolved is defined here on line 238 but first referenced inside the resolve_runtime_vars function body starting at line 187. This works because bash evaluates function bodies lazily — resolve_runtime_vars isn't called until line 245 — but the ordering is surprising and fragile. Moving the definition above resolve_runtime_vars() (or into common.sh alongside the other helpers) would make the dependency explicit and prevent a future refactor from breaking call order.
Not blocking — the current code is correct at runtime.
| RAILWAY_PROJECT_ID: ${{ vars.RAILWAY_PROJECT_ID }} | ||
| GH_TOKEN: ${{ github.token }} | ||
| GITHUB_REPOSITORY: ${{ github.repository }} | ||
| DRY_RUN: ${{ github.event_name == 'workflow_dispatch' && inputs.dry_run && 'true' || 'false' }} |
There was a problem hiding this comment.
Confirm intent: On schedule triggers, inputs.dry_run is undefined so this expression evaluates to 'false', meaning scheduled runs execute real deletes. The PR description indicates this is intentional (the janitor is meant to repair missed teardowns automatically). Just flagging for visibility — if you ever want scheduled runs to default to dry-run, this expression would need adjustment.
There was a problem hiding this comment.
Pull request overview
This PR hardens the CI automation around ephemeral preview environments by adding a scheduled/manual “janitor” cleanup path and a manual “recreate preview” path, plus improved logging and retries to make long-running preview steps more diagnosable and resilient.
Changes:
- Add a new scheduled/manual Preview Janitor workflow to delete stale Railway
pr-*environments and stale Vercel branch-scoped env vars. - Add a manual
recreate_previewinput to force teardown + reprovision of a PR’s Railway preview environment. - Improve preview script resilience and diagnosability via structured phase logging, safer teardown, and transient SpiceDB transport retries.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
.github/workflows/preview-janitor.yml |
New scheduled/manual janitor workflow for Railway + Vercel stale preview cleanup. |
.github/workflows/preview-environments.yml |
Adds recreate_preview input plumbed into Railway provisioning. |
.github/scripts/preview/common.sh |
Adds preview_log, wait-loop logging helpers, safer JSON edge handling, and env-deletion wait helper. |
.github/scripts/preview/provision-railway.sh |
Adds “recreate existing env” path and more structured provisioning logs. |
.github/scripts/preview/teardown-railway.sh |
Hardens teardown for missing/disappearing envs and uses shared deletion-wait helper. |
.github/scripts/preview/bootstrap-preview-auth.sh |
Retries transient SpiceDB transport failures and adds more phase logging. |
.github/scripts/preview/cleanup-vercel-stale-env.sh |
Ensures PR lookup is repo-scoped when GITHUB_REPOSITORY is set. |
.github/scripts/preview/cleanup-stale-railway-envs.sh |
New Railway janitor script to enumerate pr-* envs, map to PR state, and delete stale envs with verification. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| status="$( | ||
| curl -sS \ | ||
| --connect-timeout 10 \ | ||
| --max-time 30 \ | ||
| -o "${response_file}" \ | ||
| -w '%{http_code}' \ | ||
| -H "Authorization: Bearer ${GH_TOKEN}" \ | ||
| -H "Accept: application/vnd.github+json" \ | ||
| "https://api.github.com/repos/${GITHUB_REPOSITORY}/pulls/${pr_number}" | ||
| )" |
There was a problem hiding this comment.
The GitHub PR lookup is hard-coded to https://api.github.com/.... GitHub Actions exposes GITHUB_API_URL (and this also matters for GHES); consider building the endpoint from ${GITHUB_API_URL} (fallbacking to api.github.com) to avoid breaking the janitor in non-public-GitHub contexts.
There was a problem hiding this comment.
PR Review Summary
(3) Total Issues | Risk: Medium
🟠⚠️ Major (2) 🟠⚠️
Inline Comments:
- 🟠 Major:
cleanup-stale-railway-envs.sh:59Deletion errors silently ignored — return status not checked - 🟠 Major:
cleanup-stale-railway-envs.sh:100-105Deletion failures not tracked in exit code — janitor may exit 0 despite failed cleanups
🟡 Minor (1) 🟡
Inline Comments:
- 🟡 Minor:
preview-environments.yml:389-390Unused env vars passed to teardown job
💭 Consider (2) 💭
💭 1) provision-railway.sh:238 Function definition ordering
Issue: runtime_var_is_unresolved() is defined after resolve_runtime_vars() which references it.
Why: While this works correctly in bash (functions are resolved at call time, not definition time), the code structure is confusing. If someone reorders the script, it could break.
Fix: Move runtime_var_is_unresolved() definition to before line 167 for clearer declaration-before-use ordering.
💭 2) preview-janitor.yml:56 Scheduled runs delete without dry-run
Issue: The DRY_RUN expression evaluates to false for scheduled runs, meaning the cron job will perform actual deletions.
Why: This differs from the manual dispatch default (dry_run: true). While this appears intentional for automated cleanup, it could surprise operators.
Fix: Add a comment documenting this intentional behavior, or change to default dry-run for scheduled runs if that's preferred.
💡 APPROVE WITH SUGGESTIONS
Summary: This is a solid infrastructure hardening PR that closes real gaps in preview environment lifecycle management. The janitor workflow and recreate path are well-designed. The two Major findings around error tracking in the janitor script should be addressed to ensure operators can trust the exit code — without this, failed cleanups could silently accumulate. The other items are minor polish.
Discarded (3)
| Location | Issue | Reason Discarded |
|---|---|---|
provision-railway.sh:238 |
Function runtime_var_is_unresolved defined after use causes runtime error |
False positive — bash resolves function references at call time, not definition time. The function IS defined before resolve_runtime_vars is called. Reclassified as Consider (code style). |
common.sh:27 |
String equality instead of numeric comparison | Works correctly for numeric strings; purely stylistic preference |
bootstrap-preview-auth.sh:42 |
Retry pattern may miss some transient errors | Low risk — current pattern covers common failure modes; adding more patterns is speculative |
Reviewers (3)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
pr-review-errors |
5 | 0 | 0 | 0 | 2 | 0 | 3 |
pr-review-devops |
8 | 0 | 2 | 0 | 1 | 0 | 5 |
pr-review-standards |
2 | 0 | 0 | 0 | 0 | 0 | 2 |
| Total | 15 | 0 | 2 | 0 | 3 | 0 | 10 |
Note: Standards reviewer findings were duplicates of devops findings (function order, dry-run behavior).
| preview_log "Deleting stale Railway preview environment ${env_name}." | ||
| railway_environment_delete_by_id "${env_id}" >/dev/null | ||
| railway_wait_for_environment_absent "${RAILWAY_PROJECT_ID}" "${env_name}" 10 2 | ||
| } |
There was a problem hiding this comment.
🟠 MAJOR: Deletion errors are silently ignored
Issue: railway_environment_delete_by_id return status is not checked. If deletion fails, the function proceeds to railway_wait_for_environment_absent which will eventually fail with a generic timeout message.
Why: When debugging deletion failures, operators see "Railway environment still exists after waiting for deletion" but won't know why (permission denied, API error, rate limit). The actual Railway API error is discarded.
Fix: Check the return status before proceeding:
delete_env_and_verify() {
local env_id="$1"
local env_name="$2"
preview_log "Deleting stale Railway preview environment ${env_name}."
if ! railway_environment_delete_by_id "${env_id}"; then
echo "Failed to delete Railway environment ${env_name}." >&2
return 1
fi
railway_wait_for_environment_absent "${RAILWAY_PROJECT_ID}" "${env_name}" 10 2
}Refs:
- teardown-railway.sh:27 — similar pattern with explicit error check
| preview_log "[dry-run] Would delete stale Railway preview environment ${env_name}." | ||
| else | ||
| delete_env_and_verify "${env_id}" "${env_name}" | ||
| deleted=$((deleted + 1)) | ||
| deleted_names+=("${env_name}") | ||
| fi |
There was a problem hiding this comment.
🟠 MAJOR: Deletion failures not tracked in exit code
Issue: When delete_env_and_verify fails during the janitor loop, the error is not counted. The final exit code (line 153) only reflects GitHub API lookup errors (errors variable), not Railway deletion failures.
Why: The janitor could fail to delete multiple Railway environments due to API issues, but exit with code 0 if all GitHub PR lookups succeeded. Operators checking CI may incorrectly believe all targeted environments were cleaned up.
Fix: Add a deletion_failures counter:
# Add at line 77 with other counters:
deletion_failures=0
# In the delete blocks (lines 100-105 and 113-118):
if ! delete_env_and_verify "${env_id}" "${env_name}"; then
deletion_failures=$((deletion_failures + 1))
else
deleted=$((deleted + 1))
deleted_names+=("${env_name}")
fi
# Update exit check at line 153:
if [ "${errors}" -gt 0 ] || [ "${deletion_failures}" -gt 0 ]; then
echo "Encountered ${errors} GitHub lookup error(s) and ${deletion_failures} Railway deletion failure(s)." >&2
exit 1
fiThere was a problem hiding this comment.
PR Review Summary
(0) Total Issues | Risk: Low
✅ Prior Feedback Addressed
All issues from the previous review have been resolved in commit 8973aad63:
| Prior Issue | Resolution |
|---|---|
| 🟠 Major: Deletion errors silently ignored | ✅ Fixed — delete_env_and_verify now checks return status and returns 1 on failure (L58-61) |
| 🟠 Major: Deletion failures not tracked in exit code | ✅ Fixed — Added deletion_failures counter, tracked in loop, checked at exit (L82, L107-112, L163-165) |
| 🟡 Minor: Unused env vars passed to teardown | ✅ Fixed — RAILWAY_TEMPLATE_ENVIRONMENT and RAILWAY_OUTPUT_SERVICE removed from teardown job |
| 💭 Consider: Function definition ordering | ✅ Fixed — runtime_var_is_unresolved() moved to L121-125, before resolve_runtime_vars() which uses it |
| 💭 Consider: Scheduled runs delete without dry-run (undocumented) | ✅ Fixed — Comment added at L57 documenting intentional behavior |
| 🔵 Copilot: Hard-coded GitHub API URL | ✅ Fixed — GITHUB_API_URL now configurable with fallback (L16, L32) and passed from workflow (L55) |
Additional Improvements in Delta
The delta also includes GraphQL error handling improvements in common.sh:
railway_environment_create_from_source,railway_environment_delete_by_id, andrailway_variable_collection_upsertnow properly check for GraphQL errors and return failure status- This hardens the entire Railway mutation surface against silent GraphQL-level failures
✅ APPROVE
Summary: All prior review feedback has been addressed comprehensively. The janitor now properly tracks and reports deletion failures, function ordering is cleaner, and the behavior of scheduled vs manual runs is documented. The additional GraphQL error handling improvements further strengthen the reliability of the preview infrastructure. Ship it! 🚀
Reviewers (0)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
| Total | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Note: Delta review — no new reviewer dispatches needed. All prior issues resolved.
There was a problem hiding this comment.
PR Review Summary
(0) Total Issues | Risk: Low
This is a delta review scoped to the 2 commits since the last automated review (3b0164a67, 64b34fd64).
✅ Delta Changes Look Good
The delta adds a Railway SpiceDB deployment readiness gate before probing SpiceDB gRPC readiness. Key observations:
| Change | Assessment |
|---|---|
railway_wait_for_service_deployment_ready() |
Well-implemented with proper terminal state handling (FAILED, CRASHED, REMOVED, SKIPPED) |
Configurable curl timeouts in railway_graphql() |
Good improvement for polling queries |
| Timeout reduction (20→10 min) | Tight but adequate (~6.5 min worst case calculated) |
| Removal of retry wrapper from SpiceDB readiness probe | Correct — now redundant with deployment gate |
💭 Consider (1) 💭
Inline Comments:
- 💭 Consider:
bootstrap-preview-auth.sh:112Add logging when Railway deployment gate is bypassed
🕐 Pending Recommendations (0)
All issues from the previous review were addressed. No pending items.
✅ APPROVE
Summary: The delta changes are solid infrastructure hardening. The new deployment readiness gate properly sequences the SpiceDB gRPC probe after Railway reports the deployment is ready, which addresses the root cause of transient bootstrap failures. Terminal deployment states are handled correctly with descriptive error messages. Ship it! 🚀
Reviewers (2)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
pr-review-devops |
3 | 0 | 0 | 0 | 0 | 0 | 3 |
pr-review-errors |
3 | 0 | 1 | 0 | 1 | 0 | 1 |
| Total | 6 | 0 | 1 | 0 | 1 | 0 | 4 |
Note: Delta review — reviewer output was lighter given limited scope. Devops findings were informational (timeout math, cosmetic logging). Error reviewer's single-attempt query concern was duplicate of devops finding. Unhandled deployment status finding was overly defensive (current behavior is reasonable).
Discarded (4)
| Location | Issue | Reason Discarded |
|---|---|---|
preview-environments.yml:230 |
Timeout reduced from 20 to 10 minutes | Valid observation but math works out (~6.5 min worst case fits in 10 min budget). Intentional budget tightening. |
common.sh:417 |
Single-attempt GraphQL query in polling loop | Outer polling loop provides resilience; single-attempt per poll iteration is intentional for efficiency |
common.sh:478 |
Unhandled deployment statuses continue polling | This is correct behavior for in-progress states (BUILDING, DEPLOYING, etc.). Current fallthrough is intentional. |
bootstrap-preview-auth.sh:129 |
Step summary logs gate even when skipped | Cosmetic; shows intended service name regardless of gate execution |
| 15 \ | ||
| 4 | ||
| echo "::endgroup::" | ||
| fi |
There was a problem hiding this comment.
💭 Consider: Add logging when Railway deployment gate is bypassed
Issue: When Railway env vars (RAILWAY_API_TOKEN, RAILWAY_PROJECT_ID, RAILWAY_SPICEDB_SERVICE, PR_NUMBER) are not all set, the deployment gate block (lines 99-112) is silently skipped with no log output.
Why: If the SpiceDB readiness probe fails downstream, operators reviewing CI logs won't see why the deployment gate was bypassed, making it harder to diagnose whether the issue is missing config vs. actual deployment problems.
Fix: Add an else branch to clarify when the gate is intentionally skipped:
else
preview_log "Skipping Railway deployment gate (Railway env vars not fully configured)."
fiThis is optional since the gate is designed to be conditional, but explicit logging aids future debugging.
There was a problem hiding this comment.
PR Review Summary
(0) Total Issues | Risk: Low
This is a delta review scoped to the 1 commit since the last automated review (99802da8f).
✅ Delta Changes Look Good
The delta adds a create_preview_environment() retry wrapper that hardens Railway environment creation against API race conditions after delete. Key observations:
| Change | Assessment |
|---|---|
create_preview_environment() retry wrapper |
Well-implemented with up to 6 attempts and exponential backoff |
| "Already exists" race condition handling | Correctly checks if env actually exists after create failure, handles gracefully |
| Error capture via temp file | Proper cleanup in all code paths, clear error messages to stderr |
railway_wait_for_environment_absent with || true |
Intentional — wait is best-effort before retry, not a hard requirement |
Variable rename status → http_status |
Cosmetic clarity improvement, no behavior change |
Implementation Quality
The new function demonstrates solid defensive programming:
- Idempotency-aware: Treats "create returned error but env now exists" as success
- Clear logging: Each attempt logged with
preview_log, race condition retries explained - Proper backoff: Uses
sleep_with_backoff_and_jitter 2 "${attempt}" 10(2s base, 10s cap) - Fail-fast on real errors: Non-transient errors exit immediately with descriptive message
- Resource cleanup: Temp files removed in all exit paths
🕐 Pending Recommendations (1)
- 💭
bootstrap-preview-auth.sh:112Add logging when Railway deployment gate is bypassed (from prior review — still unaddressed)
✅ APPROVE
Summary: The delta is a well-crafted reliability improvement for the recreate_preview path. The new retry wrapper correctly handles Railway API eventual consistency after environment deletion. The implementation follows established patterns in this codebase (exponential backoff, structured logging, proper error propagation). Ship it! 🚢
Reviewers (2)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
pr-review-devops |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
pr-review-errors |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Total | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Note: Delta review — both reviewers found the implementation to be solid with no issues. The retry pattern, error handling, and logging are all appropriate for the use case.
There was a problem hiding this comment.
PR Review Summary
(0) Total Issues | Risk: Low
This is a delta review scoped to the 1 commit since the last automated review (99802da8f).
✅ Delta Changes Look Good
The delta contains only cosmetic job name clarifications with no behavior changes:
| Job | Before | After |
|---|---|---|
preview-disabled |
Preview Flow Disabled |
Preview Disabled Notice (runs only when previews are off) |
teardown-tier1 |
Teardown Tier 1 (Railway) |
Teardown Tier 1 (Railway, PR close only) |
teardown-vercel |
Teardown Vercel Preview Env Vars |
Teardown Vercel Preview Env Vars (PR close only) |
These changes improve documentation/readability by clarifying:
- When the preview-disabled job runs
- That teardown jobs only run on PR close events
No functional changes. No issues identified.
🕐 Pending Recommendations (1)
- 💭
bootstrap-preview-auth.sh:112Add logging when Railway deployment gate is bypassed (from prior review — still unaddressed, optional polish)
✅ APPROVE
Summary: The delta since the last review is trivial — just three job name clarifications that improve CI log readability. All substantive issues from prior reviews have been addressed. The one remaining item (logging when the deployment gate is bypassed) is optional polish and not blocking. Ship it! 🚀
Reviewers (0)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
| Total | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Note: Delta review — no reviewer dispatches needed. Changes are purely cosmetic (job name clarifications).
* ci: add preview state janitor and recreate path * ci: simplify preview janitor and var resolution * ci: address preview janitor review feedback * ci: gate preview auth on SpiceDB deployment readiness * ci: tighten preview bootstrap retry budget * ci: retry preview recreate after Railway delete * ci: clarify skipped preview workflow jobs
* chore(dashboard): dockerize visual regression tests for cross-OS consistency Run Playwright browser inside a Docker container so visual screenshot tests produce identical results on macOS (local dev) and Linux (CI). - Add docker-compose.visual.yml with Playwright server container - Update vitest config to connect to Docker browser via websocket when PW_TEST_CONNECT_WS_ENDPOINT env var is set - Add test:visual and test:visual:update npm scripts - Update CI workflow to use Docker Playwright server instead of bare Playwright install - Regenerate screenshot baselines from Linux container Closes PRD-6191 Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix: address PR review feedback - Add explicit failure handling if Playwright server doesn't start - Bind Docker port to 127.0.0.1 only (don't expose to network) - Align npx playwright version with Docker image (both 1.58.0) Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix: restore Playwright install step in CI The @vitest/browser-playwright package requires a local Playwright install to initialize, even when the actual browser runs in Docker via connectOptions. Keep the install step alongside the Docker server. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(dashboard): pass PW_TEST_CONNECT_WS_ENDPOINT through turbo strict mode Turbo v2 strict mode filters env vars not listed in turbo.json from child processes. The Playwright WebSocket endpoint was being silently dropped, causing vitest to fall back to local Chromium instead of the Docker server — producing mismatched screenshots in CI. Also pins docker-compose.visual.yml to linux/amd64 so local baselines match CI regardless of host architecture (see microsoft/playwright#13873), and fixes the Playwright cache restore-keys prefix mismatch. Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: add visual regression test workflow to AGENTS.md * fix(dashboard): resolve Monaco strict locator violation in nested error state visual test * fix(dashboard): fix Monaco strict locator violation with data-testid and stable render wait * Refactor vitest.config.ts by removing unused code * Update vitest.config.ts * fix(dashboard): restore onUnhandledError handler for Monaco browser tests The previous refactor removed the onUnhandledError handler, causing CI to fail with exit code 1 due to 3 known, unfixable Monaco Editor errors in Vitest browser mode: 1. "Cannot use import statement outside a module" - Monaco web workers cannot load ESM in the Vitest browser sandbox 2. "InvalidCharacterError" / "is not a valid name" - Monaco attempts createElement with an SVG data URI as the tag name 3. "Closing rpc while" - Vitest worker RPC shutdown race condition These errors were originally suppressed by Nick in #2046 and #2078 after investigation confirmed they are unfixable Monaco/Vitest internals that do not affect test correctness. Refs: #2046, #2078 * fix(dashboard): remove unused pixelmatch devDependency The pixelmatch package is no longer imported after the vitest.config.ts refactor removed the custom tolerantPixelmatch comparator. Knip correctly flags it as unused. * fix * upd * upd * format * lock * rm * fix * Create fluffy-gorillas-joke.md * Apply suggestion from @claude[bot] Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> * fix(@inkeep/agents-work-apps): mark `@slack/socket-mode` as `dependency` (#2951) * upd * upd * Apply suggestion from @dimaMachina * Apply suggestion from @dimaMachina * Create breezy-lemons-dream.md * Document MCP header forwarding in Visual Builder docs (#2956) * docs: add MCP header forwarding and fix header key casing examples * updated warnings in headers docs * updated warnings in mcp servers docs * In product copilot tutorial (#2957) * docs build updated api reference * tutorial done * implements pnpm minimumReleaseAge and upgrades pnpm to 10.33.0 (#2958) * implements pnpm minimumReleaseAge * upgrades pnpm to 10.16.0 * upgrades pnpm to 10.33.0 * ci: surface stable preview URLs in PRs (#2799) * ci: surface stable preview urls in PRs * fix: add temp file cleanup trap and paginate comment search - Add EXIT trap to clean up mktemp file - Paginate through all PR comments when searching for the existing marker comment, fixing duplicate-comment risk on PRs with 100+ comments Co-authored-by: Andrew Mikofalvy <[email protected]> Co-Authored-By: Claude Opus 4.6 <[email protected]> * ci: fix preview URL comment updates --------- Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: Andrew Mikofalvy <[email protected]> Co-authored-by: Claude Opus 4.6 <[email protected]> * feat: S3 presigned URLs for private media delivery (#2887) * feat: add S3 presigned URL support for private media delivery - Add optional getPresignedUrl() to BlobStorageProvider interface - Implement in S3BlobStorageProvider using @aws-sdk/s3-request-presigner - Make resolveMessageBlobUris() async with presigned URL first, manage proxy fallback for non-S3 backends (Option D hybrid) - Update both call sites (run + manage conversation routes) with await - Add presigned URL tests to s3-provider and resolve-blob-uris test suites - Include full spec with evidence files Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * docs: add S3 blob storage deployment guide Covers S3 setup, IAM permissions, env vars, S3-compatible services, storage backend priority, and presigned URL delivery flow. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * chore: add changeset for S3 presigned URL support Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: add error handling for presigned URL failures with proxy fallback - Wrap getPresignedUrl() in try-catch so failures fall through to manage proxy URL instead of crashing the entire conversation response - Add test for presigned URL failure → proxy fallback path - Add mixed-content test with presigned URLs active - Fix doc icon quoting convention Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: correct callout type and remove inaccurate configurable claim in docs Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * feat: make presigned URL expiry configurable via `BLOB_STORAGE_PRESIGNED_URL_EXPIRY_SECONDS` - Add `BLOB_STORAGE_PRESIGNED_URL_EXPIRY_SECONDS` to env.ts Zod schema (default 7200s / 2 hours, range 60–604800) - Replace hardcoded `DEFAULT_PRESIGNED_EXPIRY_SECONDS` constant in s3-provider.ts with env var lookup - Update tests to use env var in mocks and verify new default - Add env var to .env.example files and deployment docs * fix: address PR review comments - Clarify Vercel Blob is also a valid production backend (serves via proxy) - Add per-service S3-compatible path-style guidance (R2 vs B2 vs Spaces) - Make first resolve-blob-uris test explicitly set its mock (test isolation) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * chore: reset lockfile from main to minimize resolution drift Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: pin s3-request-presigner to match client-s3 to prevent lockfile drift Pin @aws-sdk/s3-request-presigner to 3.995.0 (same as resolved client-s3) to minimize pnpm-lock.yaml changes and prevent react version mismatch that caused agents-email test failures in CI. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Resolve blob storage provider once per message list instead of per message Addresses PR feedback: call getBlobStorageProvider() once in resolveMessagesListBlobUris and pass the provider through to resolveMessageBlobUris, avoiding N singleton lookups per conversation retrieval. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * chore: reset lockfile from main and reinstall to fix CI Reset pnpm-lock.yaml from main per repo guidelines to prevent resolution drift that was causing monaco-editor ESM import failures in agents-manage-ui tests. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Address remaining review suggestions: list-level presigned URL test and AWS CLI prerequisite - Add test for resolveMessagesListBlobUris with presigned URLs active across multiple messages, covering Promise.all handling - Add AWS CLI prerequisite note to S3 setup docs Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Remove AWS CLI prerequisite note from S3 docs Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> Co-authored-by: pullfrog[bot] <226033991+pullfrog[bot]@users.noreply.github.com> * ci: add preview janitor and recreate control (#2930) * ci: add preview state janitor and recreate path * ci: simplify preview janitor and var resolution * ci: address preview janitor review feedback * ci: gate preview auth on SpiceDB deployment readiness * ci: tighten preview bootstrap retry budget * ci: retry preview recreate after Railway delete * ci: clarify skipped preview workflow jobs * Rename headers in schema and usage to hyphen format (#2962) * Version Packages (#2952) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Andrew Mikofalvy <[email protected]> * Update actions/setup-node and pnpm/action-setup to Node.js 24 versions (#2963) Upgrade actions/setup-node from v4 to v6.3.0 and pnpm/action-setup from v4 to v5.0.0 across all workflows to resolve the Node.js 20 deprecation warning. Node.js 20 actions will be forced to Node.js 24 starting June 2, 2026. https://claude.ai/code/session_01D5Ah1eAYvZCS2SfZ5Lopi3 Co-authored-by: Claude <[email protected]> * credential id reference added (#2967) * Add minimumReleaseAgeExclude for @inkeep/* packages (#2968) Excludes @inkeep scoped packages from the 1440-minute minimum release age gate so our own published packages can be installed immediately after release. https://claude.ai/code/session_01LBEnpfsjj6r4cdwXs2VeHx Co-authored-by: Claude <[email protected]> * add TooltipProvider * add back timeout * should fix tests * polish * fix * pnpm i * upd --------- Co-authored-by: Varun Varahabhotla <[email protected]> Co-authored-by: Claude Opus 4.6 <[email protected]> Co-authored-by: Varun Varahabhotla <[email protected]> Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Co-authored-by: bryan-inkeep <[email protected]> Co-authored-by: Gaurav Varma <[email protected]> Co-authored-by: robert-inkeep <[email protected]> Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: Andrew Mikofalvy <[email protected]> Co-authored-by: Andrew Mikofalvy <[email protected]> Co-authored-by: pullfrog[bot] <226033991+pullfrog[bot]@users.noreply.github.com> Co-authored-by: inkeep-internal-ci[bot] <259778081+inkeep-internal-ci[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Summary
Harden preview environment lifecycle management so stale preview state is repaired automatically, individual PR previews can be force-recreated when they drift, and preview auth bootstrap fails faster and more transparently when SpiceDB is not actually ready yet.
Changes
Preview Janitorworkflow that cleans stale Railwaypr-*environments and can also run stale Vercel preview env-var cleanuppr-*environments, maps them back to GitHub PR state, deletes merged/closed PR envs, and verifies deletionrecreate_previewworkflow input so a PR can force teardown and reprovision of its Railway preview environment when reused preview state has driftedrecreate_previewagainst Railway's delete/create name-reuse race by retrying create when Railway still reports the just-deleted env name as existingspicedbservice deployment being ready, not just env existence / proxy creationWhy
We were still seeing three classes of preview-state problems:
pr-*environments could drift and require manual cleanup or full reprovision.Bootstrap Preview Authcould still race SpiceDB startup even after Railway provisioning had succeeded, leading toUNAVAILABLE,ECONNRESET, andFailed to connectfailures during the schema probe.This PR addresses all three.
recreate_previewgives a first-class recovery path for one PR.It also closes one Railway-specific edge case in the recreate flow itself: after deleting a preview environment, Railway can briefly reject a fresh create with
An environment with that name already existseven while the environment lookup returns absent. The recreate path now treats that as a bounded retry case instead of failing the run.Confirmed Stale Railway Envs
Before cleanup, the Railway project still had these stale preview environments for merged PRs:
pr-2901pr-2902pr-2906pr-2907pr-2908pr-2909I confirmed each of those PRs is closed/merged in GitHub.
After the live cleanup, a follow-up janitor dry-run reported
19Railway preview environments, all tied to open PRs, with0stale environments left to delete.What Caused Those Misses
Those stale environments all came from a transition window before the earlier preview hardening had landed on
main:2026-03-30T21:31:52Z2026-03-31T02:14:47ZAll six stale PRs above were merged before those fixes were on
main. Their close-eventPreview Environmentsworkflow did fire, but the run failed in the oldTeardown Tier 1 (Railway)path immediately after merge, which left the Railway env behind.This PR accounts for that class of miss in two ways:
SpiceDB Readiness Fix
The remaining preview flake we were still seeing was not Railway provisioning itself. Recent failures on other PRs showed this pattern:
wait-for-spicedb.tstimed out with repeatedUNAVAILABLE,ECONNRESET, orFailed to connectThe problem was that env existence and TCP proxy readiness were not strong enough readiness signals. SpiceDB could still be deploying even though the proxy existed.
This PR now:
spicedbservice instance to report a ready deployment state firstWhat We Verified
bash -npasses for the changed preview shell scriptsgit diff --checkpassesagentsRailway project in dry-run mode before cleanupSUCCESSpr-*env and the patched recreate logic recovered successfullyProvision Tier 1 (Railway)Bootstrap Preview AuthInject Vercel Preview Env VarsSmoke Test Preview URLsworkflow_dispatchrun withrecreate_preview=truealso passed end to end on the updated head:23861566947Test Plan
bash -n .github/scripts/preview/*.shfor the changed preview scripts.github/workflows/preview-environments.yml.github/workflows/preview-janitor.ymlgit diff --checkcleanup-stale-railway-envs.shagainst the Railway projectpr-*envPreview Environmentsrun on this PR after the final SpiceDB readiness changesworkflow_dispatchrun withrecreate_preview=truecompletes on the updated headPreview Janitorworkflow in GitHub after mergeNotes
pnpm checkis still failing in this workspace because@inkeep/agents-apitest boot requiresINKEEP_AGENTS_MANAGE_DATABASE_URLandINKEEP_AGENTS_RUN_DATABASE_URLto be set. That failure is repo/test-environment baseline and unrelated to the preview-shell/workflow changes in this PR.