fix(ci): teardown Vercel preview env vars on PR close#2921
Conversation
Preview env var upserts accumulate branch-scoped entries across Vercel projects but never clean them up, eventually hitting Vercel's 64KB envs_size_too_large limit and blocking all preview deployments. - Add teardown-vercel-preview-env.sh to delete branch-scoped env vars for a closed PR's branch from both Vercel projects - Add teardown-vercel job to preview-environments.yml, runs in parallel with teardown-tier1 on PR close - Add cleanup-vercel-stale-env.sh one-time script to purge all existing branch-scoped preview env vars (safe — open PRs re-create on next push) - Add cleanup_stale_env workflow_dispatch input to trigger cleanup from the Actions UI Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
|
TL;DR — Branch-scoped Vercel preview env vars were never cleaned up when PRs closed, eventually hitting Vercel's 64 KB env size limit and blocking all preview deployments. This PR adds automatic per-PR teardown on close and a local bulk cleanup script to purge accumulated stale vars, backed by a shared paginated safety layer that refuses to delete any env var not exclusively targeting the Key changes
Summary | 5 files | 12 commits | base: Automatic env var teardown on PR close
The job mirrors the existing
Local bulk cleanup script with open-PR awareness
The script uses the
Shared paginated safety layer in
|
There was a problem hiding this comment.
Clean, well-structured fix for the envs_size_too_large blocker. The root cause analysis is solid and the two-pronged approach (per-PR teardown + one-time bulk cleanup) is the right call. One issue worth addressing around API pagination in the cleanup script.
Claude Opus | 𝕏
| --connect-timeout 10 \ | ||
| --max-time 60 \ | ||
| -H "Authorization: Bearer ${VERCEL_TOKEN}" \ | ||
| "https://api.vercel.com/v10/projects/${project_id}/env?teamId=${VERCEL_ORG_ID}" \ |
There was a problem hiding this comment.
The Vercel GET /v10/projects/{id}/env endpoint is paginated (default limit 20). Since this script's entire purpose is to purge all accumulated branch-scoped env vars — potentially hundreds — a single un-paginated request will silently miss everything beyond the first page.
Consider either passing ?limit=100 (the Vercel max), or looping with pagination until .pagination.next is absent. Given that this script exists precisely because env vars accumulated past reasonable limits, pagination handling seems important for the cleanup to actually work.
The per-PR teardown script has the same gap in theory, but in practice a single branch creates ~15 vars per project (within one page), so it's lower risk there.
| fi | ||
| done <<< "${ids}" | ||
|
|
||
| echo "Deleted ${count} env var(s) for project ${project_id} branch ${branch}." |
There was a problem hiding this comment.
This reports Deleted ${count} but count is the number found, not the number successfully deleted. If any deletion fails (line 54-61 warns but continues), the final message overstates. The cleanup script handles this correctly with separate deleted/failed counters — consider the same pattern here for consistency.
| echo "Deleted ${count} env var(s) for project ${project_id} branch ${branch}." | |
| echo "Deleted env var(s) for project ${project_id} branch ${branch} (${count} found)." |
| VERCEL_MANAGE_UI_PROJECT_ID \ | ||
| PR_BRANCH | ||
|
|
||
| if ! [[ "${PR_BRANCH}" =~ ^[A-Za-z0-9._/-]+$ ]]; then |
There was a problem hiding this comment.
Nit: this regex doesn't allow + in branch names (e.g., dependabot/npm_and_yarn/lodash-4.17.21+build). Not critical if your branch naming convention excludes it, but worth noting.
| name: Cleanup Stale Vercel Preview Env Vars | ||
| runs-on: ubuntu-latest | ||
| timeout-minutes: 10 | ||
| if: inputs.cleanup_stale_env == true |
There was a problem hiding this comment.
cleanup-vercel-stale-env has no needs: [compute-context] dependency, which is correct since it only uses inputs.*. But it also has no guard against running on pull_request events — inputs.cleanup_stale_env will be null/empty on PR triggers, which evaluates falsy in the if expression, so this is safe. Just confirming this is intentional and correct.
There was a problem hiding this comment.
PR Review Summary
(0) Total Issues | Risk: Low
This PR correctly addresses the root cause of envs_size_too_large errors by adding cleanup mechanisms for branch-scoped Vercel preview env vars. The implementation is well-structured and follows established patterns from existing preview scripts.
💭 Consider (2) 💭
💭 1) cleanup-vercel-stale-env.sh:76 / teardown-vercel-preview-env.sh:60 Capture API errors for debugging
Issue: Both scripts suppress curl DELETE output (>/dev/null 2>&1), so when deletions fail the warning message doesn't include the actual Vercel API error response.
Why: Makes it harder to diagnose issues like rate limiting, auth problems, or API changes. The teardown script runs on every PR close, so silent failures could accumulate.
Fix: Capture stderr to a temp file and include the API error in warning output, similar to how the list operation already captures and reports errors:
local err_file
err_file="$(mktemp)"
if ! curl ... -X DELETE ... 2>"${err_file}"; then
echo "Warning: failed to delete ${env_key}: $(cat "${err_file}")" >&2
rm -f "${err_file}"
fiRefs:
💭 2) cleanup-vercel-stale-env.sh Document exit behavior on partial failures
Issue: The cleanup script continues after individual env var deletion failures and always exits 0, even when some deletions fail.
Why: For a one-time cleanup tool this is reasonable (partial progress is better than failing early), but operators should be aware that the workflow job will succeed even with partial failures.
Fix: Add a comment documenting this intentional behavior, or optionally return non-zero when failed > 0 if strict failure semantics are preferred.
Refs:
✅ APPROVE
Summary: Clean fix for a real production issue. The scripts follow established patterns (sourcing common.sh, using set -euo pipefail, proper input validation, timeouts), and the workflow conditions are correctly scoped. Ship it! 🚀
Reviewers (2)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
pr-review-devops |
4 | 0 | 2 | 0 | 0 | 0 | 2 |
pr-review-standards |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Total | 4 | 0 | 2 | 0 | 0 | 0 | 2 |
Note: 2 INFO-level findings from devops reviewer were discarded (informational notes, not actionable issues).
…vars Add vercel_list_preview_only_env_vars helper to common.sh that aborts if any branch-scoped env var targets production or development. Both teardown and cleanup scripts now use this helper and additionally filter with (.target | sort) == ["preview"] in jq selects. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
There was a problem hiding this comment.
PR Review Summary
(0) Total Issues | Risk: Low
This re-review covers the delta since the last automated review: commit 3ee8363 ("fix(ci): safeguard Vercel env teardown to only delete preview-target vars").
Changes Reviewed
The delta adds a good safety improvement:
-
New shared helper
vercel_list_preview_only_env_vars()incommon.sh:- Centralizes env var listing logic
- Safety check: Refuses to proceed if any branch-scoped vars target production or development (not just preview)
- Returns JSON for callers to filter
-
Script refactoring:
- Both cleanup and teardown scripts now use the shared helper
- Additional filter ensures only
target == ["preview"]vars are deleted
The safety check is intentionally conservative — it protects against accidentally deleting vars that might affect production or development environments, even if they're branch-scoped.
🕐 Pending Recommendations (1)
- 🟡
teardown-vercel-preview-env.sh:56Deletion count reports attempted vs actual deletions (pullfrog)
✅ APPROVE
Summary: The delta adds a well-designed safety layer to prevent accidental deletion of non-preview env vars. The conservative "fail-safe" approach is appropriate for infrastructure automation. The only outstanding item is the minor count accuracy issue raised in the prior review. Ship it! 🚀
Discarded (1)
| Location | Issue | Reason Discarded |
|---|---|---|
common.sh:417-419 |
Safety check might reject multi-target preview vars (e.g., ["development", "preview"]) |
Intentionally conservative behavior — the safety check's purpose is to prevent accidental deletion of vars that affect non-preview environments. Failing safe is correct here. |
Reviewers (1)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
pr-review-devops |
2 | 0 | 0 | 0 | 0 | 1 | 1 |
| Total | 2 | 0 | 0 | 0 | 0 | 1 | 1 |
- Paginate vercel_list_preview_only_env_vars with limit=100 and cursor loop so cleanup handles hundreds of accumulated env vars - Track actual deleted count in teardown script (was reporting found count regardless of failures) - Allow + in branch name regex for dependabot-style branches Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Only deletes Vercel preview env vars for branches with no open PR. Branches with open PRs are skipped and reported. Requires gh CLI. Supports DRY_RUN=true to preview what would be deleted. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
There was a problem hiding this comment.
PR Review Summary
(1) Total Issues | Risk: Low
This re-review covers the delta since the last automated review (commit 3ee8363). The 3 new commits address prior feedback well:
- ✅ Pagination:
common.shnow correctly paginates Vercel API with cursor-based iteration - ✅ Delete count: Teardown script now tracks actual successful deletions vs attempted
- ✅ Branch regex: Now includes
+and=characters for dependabot-style branches - ✅ Open PR check: Cleanup script now queries GitHub before deleting (smart addition)
- ✅ Local-only cleanup: Removed workflow dispatch input — bulk cleanup is now a local script only (good design choice)
🟠⚠️ Major (1) 🟠⚠️
🟠 1) cleanup-vercel-stale-env.sh:25-30 gh CLI failure silently treats branches as stale
Issue: The branch_has_open_pr() function uses || echo "0" to handle gh pr list failures, which means any failure (network issues, rate limiting, expired auth) returns 0, causing the function to return false (no open PR). This classifies the branch as "stale" when it might actually have an open PR.
Why: If gh fails for all branches (e.g., auth expired), every branch would be incorrectly classified as stale, leading to deletion of env vars for branches that actually have open PRs. While this script is now local-only (not CI), it's a one-time cleanup tool where silent failure modes are particularly risky.
Fix: Distinguish between "zero PRs found" (legitimate) and "command failed" (should abort). Options:
branch_has_open_pr() {
local branch="$1"
local output=""
local exit_code=0
output="$(gh pr list --head "${branch}" --state open --json number --jq 'length' 2>&1)" || exit_code=$?
if [ "${exit_code}" -ne 0 ]; then
echo "Error: gh pr list failed for branch ${branch}: ${output}" >&2
return 2 # Distinct exit code for "failed to check"
fi
[ "${output}" -gt 0 ]
}Then in the caller, treat exit code 2 as "skip this branch" or "abort entirely" rather than "no open PR".
Alternatively, run gh auth status once at script start to fail fast if auth is invalid.
Refs:
💭 Consider (1) 💭
💭 1) cleanup-vercel-stale-env.sh:71-73 Empty array expansion edge case in older bash
Issue: With set -u, ${open_branches[@]} on an empty array can cause "unbound variable" errors in bash < 4.4.
Why: The printf is inside an if block checking array length, so it's guarded. This is a very edge-case issue since GitHub Actions runners and most dev machines use recent bash. Since this is a local-only script, the risk is minimal.
Fix: If you want to be defensive, use "${open_branches[@]+"${open_branches[@]}"}" pattern, but this is optional.
Refs:
💡 APPROVE WITH SUGGESTIONS
Summary: The delta addresses all prior feedback cleanly. The pagination implementation is correct, delete counts are now accurate, and removing the workflow dispatch was the right call. The one remaining issue (silent gh failure) is worth fixing since this is a cleanup tool where silent failure modes can cause unintended deletions. That said, it's a local-only script with DRY_RUN support, so the overall risk is manageable. Ship it once the gh error handling is tightened! 🚀
Discarded (2)
| Location | Issue | Reason Discarded |
|---|---|---|
common.sh:409-435 |
Pagination implementation is correct | INFO-level positive observation, not an issue |
preview-environments.yml |
Removing cleanup_stale_env dispatch is correct design |
INFO-level positive observation — local-only for bulk ops is appropriate |
Reviewers (1)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
pr-review-devops |
4 | 1 | 1 | 0 | 0 | 0 | 2 |
| Total | 4 | 1 | 1 | 0 | 0 | 0 | 2 |
Note: Prior pullfrog feedback (delete count accuracy) was addressed in commit 04e8f4786 — no longer pending.
Checks macOS and Linux Vercel CLI auth paths so you only need to provide the three ID env vars when running locally. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
There was a problem hiding this comment.
PR Review Summary
(1) Total Issues | Risk: Low
This re-review covers the delta since the last automated review (commit 5bcdce3b9d9c). The new commit (f712904bb) is an empty CI re-trigger commit with no code changes.
Delta Analysis
✅ No code changes — The single commit since the last review (ci: re-trigger CI (flaky SpiceDB E2E test)) contains no file modifications. This is a CI re-trigger for a flaky E2E test.
🕐 Pending Recommendations (1)
- 🟠
cleanup-vercel-stale-env.sh:60-65ghCLI failure silently treats branches as stale — raised in prior review, still unaddressed
The branch_has_open_pr() function uses || echo "0" which treats any gh CLI failure (network, rate limiting, expired auth) as "no open PR", potentially causing deletion of env vars for branches that actually have open PRs.
Risk context: This is a local-only cleanup script (not CI), so the blast radius is limited to manual operator runs. The DRY_RUN mode also provides a safety check before committing to deletions.
💡 APPROVE WITH SUGGESTIONS
Summary: No code changes in this delta — just a CI re-trigger commit. The implementation from prior commits remains solid. The one pending issue (silent gh CLI failure handling) is worth addressing but not blocking given the script is local-only with DRY_RUN support. Ship it! 🚀
Reviewers (0)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
| Total | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
Note: Delta-scoped review — no reviewers dispatched as delta contains no code changes.
Ito Test Report ✅11 test cases ran. 11 passed. All 11 test cases passed, confirming the preview-environment teardown workflow is reliable and safe in non-production, deterministic runs with simulated GitHub/Vercel interactions and no live destructive calls. Key findings were that teardown executes correctly for internal PR close events (including exact branch matching with special characters, pagination beyond 100 vars, and mixed per-project states), no-ops cleanly when nothing matches, remains gated off for synchronize/external-fork/manual-non-close events, and enforces safeguards through idempotent repeat runs, non-preview target refusal, and early rejection of malicious branch input. ✅ Passed (11)Commit: Tell us how we did: Give Ito Feedback |
The CI workflow starts SpiceDB but never writes the schema, causing intermittent failures when tests try to create projects: "FAILED_PRECONDITION: object definition 'project' not found" Install zed CLI and write the schema after DB migrations, before tests. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
There was a problem hiding this comment.
PR Review Summary
(1) Total Issues | Risk: Low
This re-review covers the delta since the last automated review (commit f712904bb). The new commit (e2bd0e36e) adds a SpiceDB schema write step before E2E tests.
Delta Analysis
✅ Clean, correct change — The step:
- Installs the
zedCLI and writes the SpiceDB schema before E2E tests - Placement is correct (after migrations, before E2E tests)
- Token matches the SpiceDB container's preshared key
- Resolves the flaky E2E test issue where schema wasn't initialized
💭 Consider (1) 💭
💭 1) .github/workflows/ci.yml:380 Unpinned zed version
Issue: go install github.com/authzed/zed/cmd/zed@latest pulls the latest version on every CI run, which can cause non-deterministic behavior if zed introduces breaking changes.
Why: Minor reliability concern for CI reproducibility. However, this is consistent with the existing pattern — the authzed/spicedb Docker image is also unpinned (line 356).
Fix: Optionally pin to a specific version: go install github.com/authzed/zed/cmd/[email protected] (or current stable). Not blocking.
Refs:
🕐 Pending Recommendations (1)
- 🟠
cleanup-vercel-stale-env.sh:60-65ghCLI failure silently treats branches as stale — raised in prior review, still unaddressed
✅ APPROVE
Summary: The delta is a straightforward CI fix that correctly initializes the SpiceDB schema before E2E tests. The one Consider item (version pinning) is consistent with existing patterns in the workflow. The prior pending issue (gh CLI silent failure) remains from earlier reviews but is outside this delta's scope and affects a local-only script with DRY_RUN support. Ship it! 🚀
Discarded (2)
| Location | Issue | Reason Discarded |
|---|---|---|
ci.yml:380 |
Go toolchain dependency not cached | Optimization opportunity, not a defect. The current approach works correctly. |
ci.yml:378 |
Consider reusing existing TS schema write infra | Informational context. The CLI approach is appropriate for standalone schema application before E2E tests. |
Reviewers (1)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
pr-review-devops |
3 | 0 | 1 | 0 | 0 | 1 | 2 |
| Total | 3 | 0 | 1 | 0 | 0 | 1 | 2 |
Note: Delta-scoped review — only reviewed changes since commit f712904bb.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
There was a problem hiding this comment.
PR Review Summary
(0) Total Issues | Risk: Low
This re-review covers the delta since the last automated review (commit e2bd0e36ea98). The single commit (e0147135e) adds GOPATH/bin to PATH for zed CLI.
Delta Analysis
✅ Trivial, correct fix — The change adds:
export PATH="$(go env GOPATH)/bin:${PATH}"This ensures the zed binary installed by go install is available in the subsequent zed schema write command. Without this, the shell wouldn't find zed since go install places binaries in GOPATH/bin which isn't in PATH by default.
Correctness:
$(go env GOPATH)correctly retrieves the Go workspace path- Adding to beginning of PATH ensures the newly installed binary takes precedence
- The export happens in the same
run:block so it's available for the next command in the script
🕐 Pending Recommendations (1)
- 🟠
cleanup-vercel-stale-env.sh:60-65ghCLI failure silently treats branches as stale — raised in prior review, still unaddressed
The branch_has_open_pr() function uses || echo "0" which treats any gh CLI failure (network, rate limiting, expired auth) as "no open PR", potentially causing deletion of env vars for branches that actually have open PRs.
Risk context: This is a local-only cleanup script (not CI), so the blast radius is limited to manual operator runs. The DRY_RUN mode also provides a safety check before committing to deletions.
✅ APPROVE
Summary: The delta is a minimal, correct fix that ensures the zed CLI is available after go install. No new issues. The prior pending issue (gh CLI silent failure) remains from earlier reviews but affects a local-only script with DRY_RUN support — not blocking. Ship it! 🚀
Reviewers (0)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
| Total | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
Note: Delta-scoped review — no reviewers dispatched as delta is a single-line PATH fix with no complexity requiring specialist review.
Ito Test Report ✅12 test cases ran. 12 passed. All 12 test cases passed with no confirmed production defects, and the unified run consistently validated PR-close teardown behavior under deterministic non-production simulation/local fixtures. Key findings were that eligible closes removed exact-match branch-scoped preview env vars with deleted/total reporting across both projects, ineligible gating correctly skipped teardown and preserved data, no-op/pagination/special-character/race-idempotency/lookalike-collision paths behaved correctly, safety guards refused non-preview targets while partial delete failures remained visible, navigation and mobile observability evidence stayed actionable, and CI correctly orders migrations then SpiceDB schema write (localhost:50051 with dev-secret-key) before create-agents E2E. ✅ Passed (12)Commit: Tell us how we did: Give Ito Feedback |
* fix(ci): teardown Vercel preview env vars on PR close Preview env var upserts accumulate branch-scoped entries across Vercel projects but never clean them up, eventually hitting Vercel's 64KB envs_size_too_large limit and blocking all preview deployments. - Add teardown-vercel-preview-env.sh to delete branch-scoped env vars for a closed PR's branch from both Vercel projects - Add teardown-vercel job to preview-environments.yml, runs in parallel with teardown-tier1 on PR close - Add cleanup-vercel-stale-env.sh one-time script to purge all existing branch-scoped preview env vars (safe — open PRs re-create on next push) - Add cleanup_stale_env workflow_dispatch input to trigger cleanup from the Actions UI Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix(ci): safeguard Vercel env teardown to only delete preview-target vars Add vercel_list_preview_only_env_vars helper to common.sh that aborts if any branch-scoped env var targets production or development. Both teardown and cleanup scripts now use this helper and additionally filter with (.target | sort) == ["preview"] in jq selects. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix(ci): address PR review — paginate Vercel API, fix delete count - Paginate vercel_list_preview_only_env_vars with limit=100 and cursor loop so cleanup handles hundreds of accumulated env vars - Track actual deleted count in teardown script (was reporting found count regardless of failures) - Allow + in branch name regex for dependabot-style branches Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix(ci): remove bulk cleanup from workflow, keep as local script only Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix(ci): cleanup script checks GitHub for open PRs before deleting Only deletes Vercel preview env vars for branches with no open PR. Branches with open PRs are skipped and reported. Requires gh CLI. Supports DRY_RUN=true to preview what would be deleted. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix(ci): auto-detect VERCEL_TOKEN from local Vercel CLI auth Checks macOS and Linux Vercel CLI auth paths so you only need to provide the three ID env vars when running locally. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix(ci): use temp files in Vercel env listing to avoid arg length limit The API project had 1,391+ env vars which exceeded shell argument limits when passed via jq --argjson. Switch to temp files for pagination accumulation. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix(ci): add retry with backoff for rate limiting, document cleanup script - Retry each DELETE up to 3 times with exponential backoff on 429s - Add header comment explaining the script's purpose and when to use it Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * ci: re-trigger CI (flaky SpiceDB E2E test) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix(ci): write SpiceDB schema before running E2E integration tests The CI workflow starts SpiceDB but never writes the schema, causing intermittent failures when tests try to create projects: "FAILED_PRECONDITION: object definition 'project' not found" Install zed CLI and write the schema after DB migrations, before tests. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix(ci): add GOPATH/bin to PATH for zed CLI Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
Summary
envs_size_too_largeerror blocking all preview deployments (e.g. PR #2913 failure)teardown-verceljob that runs on PR close alongsideteardown-tier1, deleting the closed PR's branch-scoped env varscleanup-vercel-stale-env.shscript to purge stale env vars from closed PRs (with retry/backoff for rate limiting, open PR detection, and dry run support)Test plan
cleanup-vercel-stale-env.shlocally — verify stale env vars are deleted and open PR branches are skippedenvs_size_too_largeis resolvedteardown-verceljob runs and removes branch-scoped env vars🤖 Generated with Claude Code