Conversation
Adds a cleanup endpoint and data-access function to find and fail workflows that have been suspended longer than a configurable timeout (default 30 minutes). Prevents indefinite accumulation of orphaned approval requests. Closes PRD-6453
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
🦋 Changeset detectedLatest commit: f39bf1a The changes in this PR will be included in the next version bump. This PR includes changesets to release 10 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
TL;DR — When an SSE stream drops while a durable tool-approval request is in flight, the workflow now survives the failure. Approval state is persisted into workflow execution metadata and exposed via a new polling endpoint so the UI can recover without relying on the original stream. Key changes
Summary | 7 files | 4 commits | base: Durable approval state in workflow metadata
The
Pending-approvals polling endpoint
The handler fetches the conversation (404 if missing), enforces end-user scoping by comparing
Graceful SSE failure handling
|
There was a problem hiding this comment.
Severity: High — The core data-access and test code are solid, but the route has a cross-tenant authorization gap that needs to be resolved before merging, and a minor consistency issue with the OpenAPIHono instantiation.
Claude Opus | 𝕏
There was a problem hiding this comment.
PR Review Summary
(8) Total Issues | Risk: High
🔴❗ Critical (2) ❗🔴
🔴 1) workflowExecutions.ts:68-83 Cross-tenant data exposure in getStaleWorkflowExecutions
Issue: The getStaleWorkflowExecutions function queries workflow executions across ALL tenants without filtering by tenantId. This is the only data-access function in this file that lacks tenant scoping — all sibling functions (getWorkflowExecution, getWorkflowExecutionByConversation, updateWorkflowExecutionStatus) correctly filter by tenantId and projectId.
Why: In a multi-tenant system, this allows any authenticated API key holder to discover and modify workflow executions belonging to other tenants. The route handler calls this function, gets back workflows from ALL tenants, then iterates and updates them using each execution's own tenantId — effectively allowing Tenant A's API key to fail Tenant B's workflows.
Fix: Add tenantId as a required parameter and include it in the WHERE clause. See inline comment for the exact code change.
Refs:
🔴 2) workflowCleanup.ts:74-82 Handler ignores authenticated execution context
Issue: The route handler never accesses c.get('executionContext') to retrieve the authenticated tenant/project context, despite being protected by runApiKeyAuth middleware. Every other route handler in /run/api/ extracts and uses this context.
Why: The authentication middleware gates access to the endpoint but provides no authorization — the caller can act on any tenant's data. The handler should extract tenantId from the context and pass it to the data-access function.
Fix: Add context extraction at the start of the handler. See inline comment for the pattern used by sibling handlers.
Refs:
Inline Comments:
- 🔴 Critical:
workflowExecutions.ts:68-83Missing tenant isolation in query (with fix) - 🔴 Critical:
workflowCleanup.ts:74-82Handler ignores execution context
🟠⚠️ Major (3) 🟠⚠️
🟠 1) workflowCleanup.ts:87-109 Partial failures silently succeed
Issue: The batch cleanup catches individual update failures (lines 102-107) and logs them, but returns 200 OK with only { cleanedUp } count. Callers have no visibility into failure rates.
Why: A cron job seeing cleanedUp: 10 might assume everything worked when 90% of the operation failed. This hides systemic issues like database problems or permission errors affecting specific tenants.
Fix: Use Promise.allSettled and return { cleanedUp, failed, total }. Update the response schema accordingly.
Refs:
🟠 2) workflowCleanup.ts:87 Unbounded concurrent database writes
Issue: Promise.all() fires up to 1000 parallel UPDATE queries without concurrency limiting. This can overwhelm the database connection pool.
Why: At scale, this pattern causes connection pool exhaustion, query timeouts, and degrades service for concurrent requests.
Fix: Use p-limit or process in chunks. See inline comment for example.
🟠 3) workflowCleanup.ts Missing HTTP endpoint integration tests
Issue: The route handler POST /workflows/cleanup-stale has no integration tests. The existing tests only cover the data-access layer functions.
Why: Without endpoint tests, request validation, authorization, and response format could regress undetected. Authorization bypass could allow DoS by repeatedly failing all suspended workflows.
Fix: Add integration tests for: (1) successful cleanup with correct count, (2) unauthenticated requests return 401, (3) invalid body returns 400, (4) empty result returns { cleanedUp: 0 }.
Inline Comments:
- 🟠 Major:
workflowCleanup.ts:87-109Partial failures return 200 OK without details - 🟠 Major:
workflowCleanup.ts:87Unbounded concurrent database writes
🟡 Minor (3) 🟡
🟡 1) workflowExecutions.ts Missing database index for cleanup query
Issue: The query filters on status = 'suspended' and updatedAt <= staleBefore, then orders by updatedAt. No index exists for (status, updatedAt).
Why: Without an index, this query performs a sequential scan. As the table grows, cleanup jobs become slower and may cause lock contention.
Fix: Add a partial index: CREATE INDEX workflow_executions_stale_cleanup_idx ON workflow_executions (updated_at) WHERE status = 'suspended';
🟡 2) workflowCleanup.ts:79-82 Database query error not explicitly handled
Issue: The getStaleWorkflowExecutions call has no try/catch. Database errors bubble up as unhandled exceptions despite a 500 response schema being defined.
Fix: Wrap in try/catch and return proper 500 response, or document that global middleware handles this.
🟡 3) workflowCleanup.ts:102-107 Error logging lacks tenant context
Issue: Error log includes executionId but omits tenantId and projectId, making incident investigation harder.
Fix: Include tenant context in the log object. See inline comment.
Inline Comments:
- 🟡 Minor:
workflowCleanup.ts:102-107Error logging lacks tenant correlation
🚫 REQUEST CHANGES
Summary: This PR has a critical security issue — the cleanup endpoint queries and modifies workflows across ALL tenants without tenant scoping. The data-access function getStaleWorkflowExecutions is the only one in its file that doesn't filter by tenantId, and the route handler doesn't extract tenant context from authentication. This must be fixed before merging. Additionally, consider adding concurrency limits and surfacing failure counts in the response.
Discarded (4)
| Location | Issue | Reason Discarded |
|---|---|---|
workflowCleanup.test.ts:128-138 |
Empty result test doesn't verify anything meaningful | Other tests already cover staleBefore filtering adequately |
workflowCleanup.ts:102-107 |
Catch block doesn't distinguish error types | Low confidence — depends on what errors updateWorkflowExecutionStatus actually throws; may return null for not-found |
workflowCleanup.ts:79-109 |
Race condition between concurrent cleanup calls | While technically valid, the operation is inherently idempotent (setting failed status again is harmless) and the UPDATE filters by id |
workflowCleanup.ts |
Missing observability/tracing | Nice-to-have but not blocking for this maintenance endpoint |
Reviewers (5)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
pr-review-standards |
2 | 0 | 0 | 0 | 2 | 0 | 0 |
pr-review-security-iam |
4 | 2 | 0 | 0 | 0 | 0 | 2 |
pr-review-errors |
3 | 1 | 0 | 0 | 1 | 0 | 1 |
pr-review-tests |
3 | 1 | 0 | 0 | 0 | 0 | 2 |
pr-review-sre |
5 | 2 | 0 | 0 | 2 | 0 | 1 |
| Total | 17 | 6 | 0 | 0 | 5 | 0 | 6 |
Preview URLsUse these stable preview aliases for testing this PR:
These point to the same Vercel preview deployment as the bot comment, but they stay stable and easier to find. Raw Vercel deployment URLs
|
Addresses review feedback: - Add tenantId to getStaleWorkflowExecutions query (critical: prevents cross-tenant data exposure) - Extract executionContext in route handler for proper authorization - Use Promise.allSettled to surface failure counts in response - Add concurrency limit to avoid connection pool exhaustion - Include tenant context in error logs - Add tenant isolation test case - Type OpenAPIHono with AppVariables for consistency Closes PRD-6453
|
Review feedback status: No action required. The single review thread on
|
|
Verified all review feedback has been addressed in commit Fixes confirmed:
|
|
No action needed. The single review thread on
|
|
All 8 review threads have been addressed by the PR author in commit Fixes confirmed:
|
|
All 5 review threads from the original review are already resolved by the PR author in commit
|
There was a problem hiding this comment.
PR Review Summary
(1) Total Issues | Risk: Medium
Re-review scoped to delta — Prior critical issues (cross-tenant data exposure, missing failure counts, missing tenant context in logs) were correctly addressed in commit
2341408b9.
🟠⚠️ Major (1) 🟠⚠️
Inline Comments:
- 🟠 Major:
workflowCleanup.ts:105-127Staggered delay doesn't effectively limit concurrent DB connections
💭 Consider (1) 💭
💭 1) workflowCleanup.ts:97-101 No mutex for concurrent cleanup invocations
Issue: The cleanup endpoint has no guard against concurrent invocations from cron job retries or load balancer duplicates.
Why: Both requests would query the same stale workflows and attempt to update them, causing unnecessary database contention. However, since the UPDATE is idempotent (setting status: 'failed' twice has no adverse effect) and limit bounds the blast radius, this is a lower-severity concern.
Fix: Consider documenting that callers should ensure mutual exclusion, or add a distributed lock in a future iteration if concurrent invocations become problematic in production.
💡 APPROVE WITH SUGGESTIONS
Summary: The critical security issues from the prior review were correctly fixed — tenant scoping is now properly enforced at both the DAL and handler layers, and the response includes failure metrics. The remaining issue is the concurrency limiting pattern which doesn't truly bound concurrent database connections. This should be addressed before heavy production use, but doesn't block the PR. Nice work on the comprehensive tenant isolation test!
Reviewers (3)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
pr-review-sre |
2 | 0 | 1 | 0 | 1 | 0 | 0 |
pr-review-standards |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
pr-review-tests |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Total | 2 | 0 | 1 | 0 | 1 | 0 | 0 |
Ito Test Report ✅13 test cases ran. 13 passed. Across 13 executed cases for POST /run/api/workflows/cleanup-stale, all passed (13/13, 0 failures), and code-first review found no plausible production defects while confirming core behavior: successful cleanup with correct aggregate counters, default parameter parity, zero-work handling, invalid-limit 400 validation, oldest-first selection with limit, exact stale-boundary inclusion, immediate rerun idempotency, metadata key preservation plus timeout markers, and strict tenant isolation. Adversarial and UX checks also held—missing/malformed auth and cross-tenant header forgery were rejected without unauthorized mutations (typically 404 Project not found in local non-production fallback), a 20-request concurrent flood produced only 200s with no 5xx and left no residual stale work, and docs/mobile back-forward deep-link navigation triggered cleanup only on intentional execute actions, with results gathered under deterministic local fixtures and test-only non-production bypasses where noted. ✅ Passed (13)Commit: Tell us how we did: Give Ito Feedback |
|
Review submitted. Leaping into action...
|
|
Addressed review feedback from @amikofalvy — updated Changes in 53aedc3:
|
amikofalvy
left a comment
There was a problem hiding this comment.
I have open question about why we need this endpoint and who calls it?
Moves the stale workflow cleanup from a run-domain API endpoint to a proper Vercel cron handler at /api/cron/cleanup-stale-workflows, following the existing cleanupStreamChunks pattern: - noAuth() + CRON_SECRET bearer validation - System-level (cross-tenant) cleanup - Runs every 30 minutes via vercel.json cron schedule - Removes the tenant-scoped run domain route
There was a problem hiding this comment.
PR Review Summary
(2) Total Issues | Risk: Medium
Re-review scoped to delta — Changes since commit
2341408b9: Converted to Vercel cron job pattern (addressing @amikofalvy's question about endpoint caller), removed tenant scoping from DAL (now correctly cross-tenant for system cron), usedscopes: TenantScopeConfigpattern (though import now unused).
🟠⚠️ Major (2) 🟠⚠️
Inline Comments:
- 🟠 Major:
workflowExecutions.ts:5UnusedTenantScopeConfigimport left over from refactoring - 🟠 Major:
cleanupStaleWorkflows.ts:69-89Staggered delay doesn't effectively limit concurrent DB connections (persists from prior review)
💭 Consider (1) 💭
💭 1) cleanupStaleWorkflows.ts:113-116 Missing correlation ID for cleanup runs
Issue: The completion log lacks a correlation ID, making it harder to trace cleanup operations across logs.
Why: For a cross-tenant cron processing up to 100 workflows every 30 minutes, operators cannot correlate log entries with other system events without a run ID.
Fix: Generate a cleanupRunId at handler start and include in all logs:
const cleanupRunId = crypto.randomUUID();
logger.info({ cleanupRunId, cleanedUp, failed, total }, 'Stale workflow cleanup completed via cron');Refs: Low priority — follows existing cleanupStreamChunks pattern which also lacks this.
✅ Security Review Passed
The cron auth pattern (noAuth() + CRON_SECRET bearer auth + constantTimeEqual) correctly matches the established cleanupStreamChunks.ts pattern. Cross-tenant query is now intentional and appropriate for a background cleanup job. The updateWorkflowExecutionStatus calls are still scoped to each execution's tenant/project.
💡 APPROVE WITH SUGGESTIONS
Summary: Good progress! The architectural shift to a Vercel cron job correctly resolves the prior cross-tenant security concerns — cross-tenant cleanup is now intentional for a privileged system cron. The cron auth pattern matches the established peer exactly. Two items to address: (1) remove the unused TenantScopeConfig import, and (2) the concurrency limiting pattern still doesn't truly bound concurrent DB connections (same issue from prior review). Neither blocks merge, but the concurrency pattern should be fixed before heavy production use.
Discarded (1)
| Location | Issue | Reason Discarded |
|---|---|---|
workflowCleanup.test.ts |
Removed tenant isolation test | Correctly removed — tenant isolation test no longer applies since cross-tenant is now by design for cron |
Reviewers (3)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
pr-review-sre |
2 | 0 | 1 | 0 | 1 | 0 | 0 |
pr-review-security-iam |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
pr-review-standards |
1 | 0 | 0 | 0 | 1 | 0 | 0 |
| Total | 3 | 0 | 1 | 0 | 2 | 0 | 0 |
Ito Test Report ✅14 test cases ran. 14 passed. The unified run passed completely with 14/14 test cases successful and 0 failures, indicating no code-supported product defects for the cleanup-stale-workflows endpoint under local deterministic test conditions. Key findings were that authorization is consistently fail-closed (missing/malformed/adversarial/session/mobile direct-hit requests return 401/431 without mutation), while valid bearer requests correctly enforce stale-only cleanup logic (inclusive 30-minute boundary, non-suspended exclusion, metadata-preserving timeout merge), process oldest-first in bounded batches of 100 with isolated partial-failure accounting, remain idempotent across repeats, and converge safely under concurrent load without 5xx errors. ✅ Passed (14)Commit: Tell us how we did: Give Ito Feedback |
|
A matching internal PR is ready in inkeep/agents-private#83 for canonical review and merge.
This comment will be updated as the bridge state changes. |
There was a problem hiding this comment.
PR Review Summary
(2) Total Issues | Risk: Low
Re-review scoped to delta — The code has not substantively changed since the last automated review (
2d5c46912344). Prior critical issues (cross-tenant data exposure, missing failure counts) were correctly addressed. The remaining prior findings (unused import, concurrency pattern) persist but are acknowledged by the author.
🟡 Minor (1) 🟡
Inline Comments:
- 🟡 Minor:
workflowCleanup.test.ts:61Incomplete assertion for staleBefore filter test
💭 Consider (1) 💭
💭 1) workflowCleanup.test.ts:139-179 Missing edge case test for update on non-existent workflow
Issue: The tests don't cover the case where updateWorkflowExecutionStatus is called on a workflow that doesn't exist (returns null). The handler counts any non-rejected promise as success.
Why: If a workflow is deleted between getStaleWorkflowExecutions and updateWorkflowExecutionStatus (race condition), the update returns null but is counted as "cleaned up". This could lead to misleading metrics, though not data corruption.
Fix: Add a test verifying updateWorkflowExecutionStatus returns null for non-existent workflow IDs. Consider whether the handler should distinguish between "updated" and "already deleted" cases in its success count.
🕐 Pending Recommendations (2)
- 🟠
workflowExecutions.ts:5UnusedTenantScopeConfigimport left over from refactoring — will fail lint - 🟠
cleanupStaleWorkflows.ts:69-89Staggered delay doesn't effectively limit concurrent DB connections (author acknowledged: sufficient for cron use case)
✅ Security Review Passed
The cron auth pattern (noAuth() + CRON_SECRET bearer auth + constantTimeEqual) correctly matches the established cleanupStreamChunks.ts pattern. Cross-tenant query is intentional and appropriate for a background cleanup job. The updateWorkflowExecutionStatus calls are properly scoped to each execution's tenant/project.
✅ APPROVE
Summary: This PR is ready to merge. The implementation correctly follows established patterns for Vercel cron jobs and addresses all critical issues from prior reviews. The remaining pending recommendations are minor (unused import, concurrency pattern acknowledged as acceptable for this use case). The test coverage gaps identified are minor edge cases that don't block the feature. Nice implementation of the stale workflow cleanup! 🎉
Discarded (0)
No findings discarded in this review cycle.
Reviewers (4)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
pr-review-standards |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
pr-review-sre |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
pr-review-tests |
2 | 0 | 1 | 0 | 1 | 0 | 0 |
pr-review-appsec |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Total | 2 | 0 | 1 | 0 | 1 | 0 | 0 |
Note: Standards, SRE, and AppSec reviewers returned no new findings — all previously identified issues remain in Pending Recommendations.
| }); | ||
|
|
||
| const ids = result.map((r) => r.id); | ||
| expect(ids).toContain(staleExecution.id); |
There was a problem hiding this comment.
🟡 Minor: Incomplete assertion for staleBefore filter
Issue: This test creates both a stale and recent execution but only asserts that the stale execution IS in the results. It does not assert that the recent execution is NOT in the results.
Why: If the filter logic were broken and returned all suspended workflows regardless of time, this test would still pass. The assertion is incomplete for validating the staleBefore filter behavior.
Fix: Store the return value of the second createExecution call and add:
const recentExecution = await createExecution({
status: 'suspended',
updatedAt: recentTime,
});
// ... existing assertions ...
expect(ids).not.toContain(recentExecution.id);
Ito Test Report ✅14 test cases ran. 14 passed. The unified local non-production run passed all 14/14 test cases with zero failures, confirming the cleanup endpoint’s behavior matched code expectations throughout. Key findings were that auth hardening consistently returned 401 for missing/malformed/incorrect credentials and 200 only for the valid bearer token, cleanup logic correctly targeted only stale suspended workflows (adding timeout metadata while preserving existing metadata, respecting 100-item oldest-first limits, and remaining idempotent), and resilience checks showed isolated partial-failure reporting plus safe convergence under repeated, concurrent, method/path confusion, deep-link/history, and desktop/mobile access scenarios without unintended mutations. ✅ Passed (14)Commit: Tell us how we did: Give Ito Feedback |
There was a problem hiding this comment.
PR Review Summary
(1) Total Issues | Risk: Low
Re-review scoped to delta — The commits since last automated review (
fbaff7c00..HEAD) are merge commits from main only. The PR code has not substantively changed since the previous APPROVE. All critical issues from earlier reviews have been addressed.
🟡 Minor (1) 🟡
Inline Comments:
- 🟡 Minor:
workflowExecutions.ts:5UnusedTenantScopeConfigimport will fail lint
🕐 Pending Recommendations (2)
- 🟡
workflowExecutions.ts:5UnusedTenantScopeConfigimport — addressed via inline comment above - 🟠
cleanupStaleWorkflows.ts:69-89Staggered delay doesn't effectively limit concurrent DB connections — author acknowledged as acceptable for cron use case
✅ Implementation Review
The implementation correctly follows established patterns:
| Aspect | Status |
|---|---|
| Cron auth pattern | ✅ Matches cleanupStreamChunks.ts (noAuth() + CRON_SECRET + constantTimeEqual) |
| Cross-tenant query | ✅ Intentional and appropriate for background cleanup job |
| Error handling | ✅ Promise.allSettled with cleanedUp/failed/total response |
| Tenant context in logs | ✅ Includes tenantId, projectId in error logs |
| OpenAPIHono typing | ✅ Correctly typed with AppVariables |
| Cron schedule | ✅ Registered in vercel.json (every 30 minutes) |
✅ APPROVE
Summary: This PR is ready to merge. The stale suspended workflow cleanup implementation correctly follows established patterns for Vercel cron jobs. All critical issues from prior reviews have been addressed. The only remaining item is removing the unused TenantScopeConfig import (minor, will fail lint). Nice work on the durable approvals cleanup! 🎉
Discarded (0)
No findings discarded in this review cycle.
Reviewers (1)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
orchestrator |
1 | 0 | 0 | 0 | 1 | 2 | 0 |
| Total | 1 | 0 | 0 | 0 | 1 | 2 | 0 |
Note: This is a re-review with minimal delta (merge commits only). Sub-reviewers not dispatched as code unchanged.
| import type { AgentsRunDatabaseClient } from '../../db/runtime/runtime-client'; | ||
| import { workflowExecutions } from '../../db/runtime/runtime-schema'; | ||
| import type { WorkflowExecutionInsert, WorkflowExecutionSelect } from '../../types/index'; | ||
| import type { TenantScopeConfig } from '../../types/utility'; |
There was a problem hiding this comment.
🟡 Minor: Unused import
Issue: TenantScopeConfig is imported but not used anywhere in this file.
Why: This will fail Biome lint checks (noUnusedImports rule).
Fix: (1-click apply)
| import type { TenantScopeConfig } from '../../types/utility'; | |
| import type { WorkflowExecutionInsert, WorkflowExecutionSelect } from '../../types/index'; |
Refs:
Ito Test Report ✅10 test cases ran. 10 passed. All 10 test cases passed in a local non-production run, confirming the cleanup-stale-workflows cron route behaved correctly under functional, security, edge, and integration scenarios with no observed regressions. Key findings were that only stale suspended workflows (including boundary-time candidates) were transitioned to failed while fresh/completed rows stayed unchanged, metadata merges preserved existing keys while adding timeout markers, auth was strictly enforced with consistent 401s for missing/malformed bearer tokens, processing was capped to the oldest 100 items, parallel calls and partial row-level failures were handled safely with accurate accounting, and downstream execution/pending-approval endpoints reflected the cleanup transitions (including cross-tenant effects from the current global query). ✅ Passed (10)Commit: Tell us how we did: Give Ito Feedback |
|
This pull request has been automatically marked as stale because it has not had recent activity. If this PR is still relevant:
Thank you for your contributions! |
Summary
getStaleWorkflowExecutionsdata-access function to query suspended workflows older than a configurable thresholdPOST /run/api/workflows/cleanup-staleendpoint that finds stale suspended workflows and marks them as failed withapproval_timeoutmetadataTest plan
getStaleWorkflowExecutions(filtering by status, staleBefore, limit, ordering)updateWorkflowExecutionStatuscorrectly applies timeout metadatapnpm typecheckpassespnpm formatpassesCloses PRD-6453