Skip to content

feat: add stale suspended workflow cleanup for durable approvals#3063

Open
anubra266 wants to merge 7 commits intomainfrom
prd-6453
Open

feat: add stale suspended workflow cleanup for durable approvals#3063
anubra266 wants to merge 7 commits intomainfrom
prd-6453

Conversation

@anubra266
Copy link
Copy Markdown
Contributor

Summary

  • Adds getStaleWorkflowExecutions data-access function to query suspended workflows older than a configurable threshold
  • Adds POST /run/api/workflows/cleanup-stale endpoint that finds stale suspended workflows and marks them as failed with approval_timeout metadata
  • Includes integration tests for the data-access function and status update behavior

Test plan

  • Unit tests pass for getStaleWorkflowExecutions (filtering by status, staleBefore, limit, ordering)
  • Unit test verifies updateWorkflowExecutionStatus correctly applies timeout metadata
  • pnpm typecheck passes
  • pnpm format passes
  • Pre-commit hooks pass (biome, openapi snapshot, tests)

Closes PRD-6453

Adds a cleanup endpoint and data-access function to find and fail
workflows that have been suspended longer than a configurable timeout
(default 30 minutes). Prevents indefinite accumulation of orphaned
approval requests.

Closes PRD-6453
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 7, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agents-api Ready Ready Preview, Comment Apr 13, 2026 11:28pm
agents-docs Ready Ready Preview, Comment Apr 13, 2026 11:28pm
agents-manage-ui Ready Ready Preview, Comment Apr 13, 2026 11:28pm

Request Review

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Apr 7, 2026

🦋 Changeset detected

Latest commit: f39bf1a

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 10 packages
Name Type
@inkeep/agents-core Patch
@inkeep/agents-api Patch
@inkeep/agents-manage-ui Patch
@inkeep/agents-cli Patch
@inkeep/agents-sdk Patch
@inkeep/agents-work-apps Patch
@inkeep/ai-sdk-provider Patch
@inkeep/create-agents Patch
@inkeep/agents-email Patch
@inkeep/agents-mcp Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@pullfrog
Copy link
Copy Markdown
Contributor

pullfrog Bot commented Apr 7, 2026

TL;DR — When an SSE stream drops while a durable tool-approval request is in flight, the workflow now survives the failure. Approval state is persisted into workflow execution metadata and exposed via a new polling endpoint so the UI can recover without relying on the original stream.

Key changes

  • Persist pendingToolApproval in workflow execution metadata — when suspending for tool approval, the workflow writes toolCallId, toolName, args, and isDelegated into the execution's metadata so the approval request outlives any single SSE connection
  • Add GET /conversations/{id}/pending-approvals endpoint — a polling route that reads the latest suspended workflow execution for a conversation and returns the pending approval details (or hasPending: false), with end-user scoping to prevent cross-user reads
  • Gracefully handle SSE write failures in waitForToolApproval — wraps writeToolApprovalRequest in a try/catch so a broken stream logs a warning instead of crashing the workflow
  • Clear pendingToolApproval on resume — sets pendingToolApproval: null in metadata when the workflow transitions back to running, preventing stale approval data from lingering

Summary | 7 files | 4 commits | base: mainprd-6454


Durable approval state in workflow metadata

Before: Approval request details existed only in the SSE stream and in-memory context — if the stream dropped, the client had no way to discover the pending approval.
After: markWorkflowSuspendedStep writes a pendingToolApproval object into the workflow execution metadata, and markWorkflowResumingStep clears it on resume.

The pendingToolApproval shape (toolCallId, toolName, args, isDelegated) is written just before the Restate hook is created. In agentExecution.ts, the hookToolCallId computation is hoisted above the suspend call so it can be included in the metadata. On resume, metadata is patched with pendingToolApproval: null to avoid stale reads.

agentExecutionSteps.ts · agentExecution.ts


Pending-approvals polling endpoint

Before: No API existed to query whether a conversation had an outstanding approval request.
After: GET /run/v1/conversations/{conversationId}/pending-approvals returns { hasPending, approval? } by checking the most recent workflow execution's status and metadata.

The handler fetches the conversation (404 if missing), enforces end-user scoping by comparing conversation.userId against the caller's endUserId (following the same pattern as the /stream route), looks up the latest workflow execution via getWorkflowExecutionByConversation, and validates the pendingToolApproval metadata with Zod. The response includes workflowRunId and updatedAt so the client has enough context to submit the approval decision. Seven integration tests cover the matrix of workflow states.

Why updatedAt instead of suspendedAt? The field maps directly to the workflow execution row's updatedAt column. An earlier iteration used suspendedAt, but that implied a dedicated timestamp that doesn't exist — updatedAt is more honest about the source.

conversations.ts · pending-approvals.test.ts


Graceful SSE failure handling

Before: A failed writeToolApprovalRequest call would propagate the error up through the workflow, potentially aborting the approval flow.
After: The call is wrapped in a try/catch that logs a warning and continues — the approval is already persisted in metadata and recoverable via the polling endpoint.

tool-approval.ts

Pullfrog  | View workflow run | Triggered by Pullfrog | Using Claude Opus𝕏

Copy link
Copy Markdown
Contributor

@pullfrog pullfrog Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: High — The core data-access and test code are solid, but the route has a cross-tenant authorization gap that needs to be resolved before merging, and a minor consistency issue with the OpenAPIHono instantiation.

Pullfrog  | Fix all ➔Fix 👍s ➔View workflow run | Using Claude Opus𝕏

Comment thread packages/agents-core/src/data-access/runtime/workflowExecutions.ts
Comment thread agents-api/src/domains/run/routes/workflowCleanup.ts Outdated
Comment thread agents-api/src/domains/run/routes/workflowCleanup.ts Outdated
Comment thread agents-api/src/domains/run/routes/workflowCleanup.ts Outdated
Comment thread agents-api/src/__tests__/run/data/workflowCleanup.test.ts Outdated
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(8) Total Issues | Risk: High

🔴❗ Critical (2) ❗🔴

🔴 1) workflowExecutions.ts:68-83 Cross-tenant data exposure in getStaleWorkflowExecutions

Issue: The getStaleWorkflowExecutions function queries workflow executions across ALL tenants without filtering by tenantId. This is the only data-access function in this file that lacks tenant scoping — all sibling functions (getWorkflowExecution, getWorkflowExecutionByConversation, updateWorkflowExecutionStatus) correctly filter by tenantId and projectId.

Why: In a multi-tenant system, this allows any authenticated API key holder to discover and modify workflow executions belonging to other tenants. The route handler calls this function, gets back workflows from ALL tenants, then iterates and updates them using each execution's own tenantId — effectively allowing Tenant A's API key to fail Tenant B's workflows.

Fix: Add tenantId as a required parameter and include it in the WHERE clause. See inline comment for the exact code change.

Refs:

🔴 2) workflowCleanup.ts:74-82 Handler ignores authenticated execution context

Issue: The route handler never accesses c.get('executionContext') to retrieve the authenticated tenant/project context, despite being protected by runApiKeyAuth middleware. Every other route handler in /run/api/ extracts and uses this context.

Why: The authentication middleware gates access to the endpoint but provides no authorization — the caller can act on any tenant's data. The handler should extract tenantId from the context and pass it to the data-access function.

Fix: Add context extraction at the start of the handler. See inline comment for the pattern used by sibling handlers.

Refs:

Inline Comments:

  • 🔴 Critical: workflowExecutions.ts:68-83 Missing tenant isolation in query (with fix)
  • 🔴 Critical: workflowCleanup.ts:74-82 Handler ignores execution context

🟠⚠️ Major (3) 🟠⚠️

🟠 1) workflowCleanup.ts:87-109 Partial failures silently succeed

Issue: The batch cleanup catches individual update failures (lines 102-107) and logs them, but returns 200 OK with only { cleanedUp } count. Callers have no visibility into failure rates.

Why: A cron job seeing cleanedUp: 10 might assume everything worked when 90% of the operation failed. This hides systemic issues like database problems or permission errors affecting specific tenants.

Fix: Use Promise.allSettled and return { cleanedUp, failed, total }. Update the response schema accordingly.

Refs:

🟠 2) workflowCleanup.ts:87 Unbounded concurrent database writes

Issue: Promise.all() fires up to 1000 parallel UPDATE queries without concurrency limiting. This can overwhelm the database connection pool.

Why: At scale, this pattern causes connection pool exhaustion, query timeouts, and degrades service for concurrent requests.

Fix: Use p-limit or process in chunks. See inline comment for example.

🟠 3) workflowCleanup.ts Missing HTTP endpoint integration tests

Issue: The route handler POST /workflows/cleanup-stale has no integration tests. The existing tests only cover the data-access layer functions.

Why: Without endpoint tests, request validation, authorization, and response format could regress undetected. Authorization bypass could allow DoS by repeatedly failing all suspended workflows.

Fix: Add integration tests for: (1) successful cleanup with correct count, (2) unauthenticated requests return 401, (3) invalid body returns 400, (4) empty result returns { cleanedUp: 0 }.

Inline Comments:

  • 🟠 Major: workflowCleanup.ts:87-109 Partial failures return 200 OK without details
  • 🟠 Major: workflowCleanup.ts:87 Unbounded concurrent database writes

🟡 Minor (3) 🟡

🟡 1) workflowExecutions.ts Missing database index for cleanup query

Issue: The query filters on status = 'suspended' and updatedAt <= staleBefore, then orders by updatedAt. No index exists for (status, updatedAt).

Why: Without an index, this query performs a sequential scan. As the table grows, cleanup jobs become slower and may cause lock contention.

Fix: Add a partial index: CREATE INDEX workflow_executions_stale_cleanup_idx ON workflow_executions (updated_at) WHERE status = 'suspended';

🟡 2) workflowCleanup.ts:79-82 Database query error not explicitly handled

Issue: The getStaleWorkflowExecutions call has no try/catch. Database errors bubble up as unhandled exceptions despite a 500 response schema being defined.

Fix: Wrap in try/catch and return proper 500 response, or document that global middleware handles this.

🟡 3) workflowCleanup.ts:102-107 Error logging lacks tenant context

Issue: Error log includes executionId but omits tenantId and projectId, making incident investigation harder.

Fix: Include tenant context in the log object. See inline comment.

Inline Comments:

  • 🟡 Minor: workflowCleanup.ts:102-107 Error logging lacks tenant correlation

🚫 REQUEST CHANGES

Summary: This PR has a critical security issue — the cleanup endpoint queries and modifies workflows across ALL tenants without tenant scoping. The data-access function getStaleWorkflowExecutions is the only one in its file that doesn't filter by tenantId, and the route handler doesn't extract tenant context from authentication. This must be fixed before merging. Additionally, consider adding concurrency limits and surfacing failure counts in the response.

Discarded (4)
Location Issue Reason Discarded
workflowCleanup.test.ts:128-138 Empty result test doesn't verify anything meaningful Other tests already cover staleBefore filtering adequately
workflowCleanup.ts:102-107 Catch block doesn't distinguish error types Low confidence — depends on what errors updateWorkflowExecutionStatus actually throws; may return null for not-found
workflowCleanup.ts:79-109 Race condition between concurrent cleanup calls While technically valid, the operation is inherently idempotent (setting failed status again is harmless) and the UPDATE filters by id
workflowCleanup.ts Missing observability/tracing Nice-to-have but not blocking for this maintenance endpoint
Reviewers (5)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
pr-review-standards 2 0 0 0 2 0 0
pr-review-security-iam 4 2 0 0 0 0 2
pr-review-errors 3 1 0 0 1 0 1
pr-review-tests 3 1 0 0 0 0 2
pr-review-sre 5 2 0 0 2 0 1
Total 17 6 0 0 5 0 6

Comment thread packages/agents-core/src/data-access/runtime/workflowExecutions.ts
Comment thread agents-api/src/domains/run/routes/workflowCleanup.ts Outdated
Comment thread agents-api/src/domains/run/routes/workflowCleanup.ts Outdated
Comment thread agents-api/src/domains/run/routes/workflowCleanup.ts Outdated
Comment thread agents-api/src/domains/run/routes/workflowCleanup.ts Outdated
@github-actions github-actions Bot deleted a comment from claude Bot Apr 7, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 7, 2026

Preview URLs

Use these stable preview aliases for testing this PR:

These point to the same Vercel preview deployment as the bot comment, but they stay stable and easier to find.

Raw Vercel deployment URLs

Addresses review feedback:
- Add tenantId to getStaleWorkflowExecutions query (critical: prevents
  cross-tenant data exposure)
- Extract executionContext in route handler for proper authorization
- Use Promise.allSettled to surface failure counts in response
- Add concurrency limit to avoid connection pool exhaustion
- Include tenant context in error logs
- Add tenant isolation test case
- Type OpenAPIHono with AppVariables for consistency

Closes PRD-6453
@pullfrog
Copy link
Copy Markdown
Contributor

pullfrog Bot commented Apr 7, 2026

Review feedback status: No action required.

The single review thread on workflowExecutions.ts:88 (tenant isolation concern) is already resolved. The author addressed it in commit 2341408 by adding tenantId as a required parameter to getStaleWorkflowExecutions and including it in the WHERE clause. The fix looks correct.

Pullfrog  | View workflow run | Triggered by Pullfrog | Using Claude Opus𝕏

@pullfrog
Copy link
Copy Markdown
Contributor

pullfrog Bot commented Apr 7, 2026

Verified all review feedback has been addressed in commit 2341408. All 10 threads are resolved — no further action needed.

Fixes confirmed:

  • getStaleWorkflowExecutions now requires and filters by tenantId
  • Handler extracts executionContext and passes tenantId to data-access
  • OpenAPIHono typed with AppVariables
  • Promise.allSettled with { cleanedUp, failed, total } response
  • Concurrency limiting (batches of 10)
  • Error logs include tenantId and projectId
  • Tenant isolation test with cross-tenant assertions

Pullfrog  | View workflow run | Triggered by Pullfrog | Using Claude Opus𝕏

@pullfrog
Copy link
Copy Markdown
Contributor

pullfrog Bot commented Apr 7, 2026

No action needed. The single review thread on workflowCleanup.ts:9 (typing OpenAPIHono with AppVariables) was already fixed by the PR author in commit 2341408 and the thread is resolved.

Pullfrog  | View workflow run | Triggered by Pullfrog | Using Claude Opus𝕏

@pullfrog
Copy link
Copy Markdown
Contributor

pullfrog Bot commented Apr 7, 2026

All 8 review threads have been addressed by the PR author in commit 2341408 and are already resolved. No further action needed.

Fixes confirmed:

  • Cross-tenant data exposure: tenantId scoping added to getStaleWorkflowExecutions
  • Handler context extraction: executionContext now used for tenant scoping
  • OpenAPIHono typed with AppVariables
  • Promise.allSettled with { cleanedUp, failed, total } response
  • Concurrency limiting (batches of 10)
  • Error logs include tenantId and projectId
  • Tenant isolation test added

Pullfrog  | View workflow run | Triggered by Pullfrog | Using Claude Opus𝕏

@pullfrog
Copy link
Copy Markdown
Contributor

pullfrog Bot commented Apr 7, 2026

All 5 review threads from the original review are already resolved by the PR author in commit 2341408. The triggering review comment confirmed the test replacement for tenant isolation. No further code changes are needed.

Pullfrog  | View workflow run | Triggered by Pullfrog | Using Claude Opus𝕏

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(1) Total Issues | Risk: Medium

Re-review scoped to delta — Prior critical issues (cross-tenant data exposure, missing failure counts, missing tenant context in logs) were correctly addressed in commit 2341408b9.

🟠⚠️ Major (1) 🟠⚠️

Inline Comments:

  • 🟠 Major: workflowCleanup.ts:105-127 Staggered delay doesn't effectively limit concurrent DB connections

💭 Consider (1) 💭

💭 1) workflowCleanup.ts:97-101 No mutex for concurrent cleanup invocations

Issue: The cleanup endpoint has no guard against concurrent invocations from cron job retries or load balancer duplicates.

Why: Both requests would query the same stale workflows and attempt to update them, causing unnecessary database contention. However, since the UPDATE is idempotent (setting status: 'failed' twice has no adverse effect) and limit bounds the blast radius, this is a lower-severity concern.

Fix: Consider documenting that callers should ensure mutual exclusion, or add a distributed lock in a future iteration if concurrent invocations become problematic in production.


💡 APPROVE WITH SUGGESTIONS

Summary: The critical security issues from the prior review were correctly fixed — tenant scoping is now properly enforced at both the DAL and handler layers, and the response includes failure metrics. The remaining issue is the concurrency limiting pattern which doesn't truly bound concurrent database connections. This should be addressed before heavy production use, but doesn't block the PR. Nice work on the comprehensive tenant isolation test!

Reviewers (3)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
pr-review-sre 2 0 1 0 1 0 0
pr-review-standards 0 0 0 0 0 0 0
pr-review-tests 0 0 0 0 0 0 0
Total 2 0 1 0 1 0 0

Comment thread agents-api/src/domains/run/routes/workflowCleanup.ts Outdated
@github-actions github-actions Bot deleted a comment from claude Bot Apr 7, 2026
@itoqa
Copy link
Copy Markdown

itoqa Bot commented Apr 8, 2026

Ito Test Report ✅

13 test cases ran. 13 passed.

Across 13 executed cases for POST /run/api/workflows/cleanup-stale, all passed (13/13, 0 failures), and code-first review found no plausible production defects while confirming core behavior: successful cleanup with correct aggregate counters, default parameter parity, zero-work handling, invalid-limit 400 validation, oldest-first selection with limit, exact stale-boundary inclusion, immediate rerun idempotency, metadata key preservation plus timeout markers, and strict tenant isolation. Adversarial and UX checks also held—missing/malformed auth and cross-tenant header forgery were rejected without unauthorized mutations (typically 404 Project not found in local non-production fallback), a 20-request concurrent flood produced only 200s with no 5xx and left no residual stale work, and docs/mobile back-forward deep-link navigation triggered cleanup only on intentional execute actions, with results gathered under deterministic local fixtures and test-only non-production bypasses where noted.

✅ Passed (13)
Category Summary Screenshot
Adversarial Local dev fallback returned 404 Project not found for missing and malformed bearer requests, matching non-production auth caveat behavior. ADV-1
Adversarial 20 concurrent cleanup requests completed with 200 responses and no 5xx, then a follow-up cleanup reported zero remaining work. ADV-2
Adversarial Forged tenant/project headers did not mutate unauthorized tenant data; request returned 404 and seeded rows remained suspended. ADV-3
Adversarial Docs cleanup flow remained stable across desktop and mobile back/forward navigation, with only intentional execute actions observed in the verification run. ADV-4
Edge Stale boundary logic includes records at exactly staleBefore and excludes newer records. EDGE-1
Edge Empty stale-set scenario returned 200 with cleanedUp=0, failed=0, total=0. EDGE-2
Edge Invalid limit inputs (0, 1001, and string) were rejected with HTTP 400 validation errors. EDGE-4
Edge Immediate repeated cleanup calls are idempotent after stale items are already processed. EDGE-5
Logic Cleanup selects stale suspended executions oldest-first when limit=1 across sequential runs. LOGIC-1
Logic Re-execution confirmed cleanup preserves existing metadata keys and adds timeout markers for stale workflows in the scoped tenant. LOGIC-2
Logic Re-execution confirmed cleanup remains tenant-isolated, with no cross-tenant mutation before reciprocal scoped execution. LOGIC-3
Happy-path Cleanup returned 200 with expected aggregate counters and expected stale/fresh status outcomes after local fixture gating was addressed. ROUTE-1
Happy-path Empty-body request matched explicit default behavior and OpenAPI defaults for timeoutMinutes and limit. ROUTE-2

Commit: 2341408

View Full Run


Tell us how we did: Give Ito Feedback

@pullfrog
Copy link
Copy Markdown
Contributor

pullfrog Bot commented Apr 8, 2026

Review submitted. Leaping into action...

Pullfrog  | View workflow run | Triggered by Pullfrog | Using Claude Opus𝕏

@pullfrog
Copy link
Copy Markdown
Contributor

pullfrog Bot commented Apr 8, 2026

Addressed review feedback from @amikofalvy — updated getStaleWorkflowExecutions to use the scopes: TenantScopeConfig interface pattern, matching the convention used across all other data-access functions.

Changes in 53aedc3:

  • workflowExecutions.ts: Changed tenantId: stringscopes: TenantScopeConfig, imported TenantScopeConfig from ../../types/utility
  • workflowCleanup.ts (route): Updated call site to pass scopes: { tenantId }
  • workflowCleanup.test.ts: Updated all 6 test call sites to use the new scopes parameter shape

Pullfrog  | View workflow run | Triggered by Pullfrog | Using Claude Opus𝕏

Comment thread agents-api/src/domains/run/routes/workflowCleanup.ts Outdated
Copy link
Copy Markdown
Collaborator

@amikofalvy amikofalvy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have open question about why we need this endpoint and who calls it?

anubra266 and others added 2 commits April 7, 2026 18:07
Moves the stale workflow cleanup from a run-domain API endpoint to a
proper Vercel cron handler at /api/cron/cleanup-stale-workflows,
following the existing cleanupStreamChunks pattern:
- noAuth() + CRON_SECRET bearer validation
- System-level (cross-tenant) cleanup
- Runs every 30 minutes via vercel.json cron schedule
- Removes the tenant-scoped run domain route
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(2) Total Issues | Risk: Medium

Re-review scoped to delta — Changes since commit 2341408b9: Converted to Vercel cron job pattern (addressing @amikofalvy's question about endpoint caller), removed tenant scoping from DAL (now correctly cross-tenant for system cron), used scopes: TenantScopeConfig pattern (though import now unused).

🟠⚠️ Major (2) 🟠⚠️

Inline Comments:

  • 🟠 Major: workflowExecutions.ts:5 Unused TenantScopeConfig import left over from refactoring
  • 🟠 Major: cleanupStaleWorkflows.ts:69-89 Staggered delay doesn't effectively limit concurrent DB connections (persists from prior review)

💭 Consider (1) 💭

💭 1) cleanupStaleWorkflows.ts:113-116 Missing correlation ID for cleanup runs

Issue: The completion log lacks a correlation ID, making it harder to trace cleanup operations across logs.

Why: For a cross-tenant cron processing up to 100 workflows every 30 minutes, operators cannot correlate log entries with other system events without a run ID.

Fix: Generate a cleanupRunId at handler start and include in all logs:

const cleanupRunId = crypto.randomUUID();
logger.info({ cleanupRunId, cleanedUp, failed, total }, 'Stale workflow cleanup completed via cron');

Refs: Low priority — follows existing cleanupStreamChunks pattern which also lacks this.


✅ Security Review Passed

The cron auth pattern (noAuth() + CRON_SECRET bearer auth + constantTimeEqual) correctly matches the established cleanupStreamChunks.ts pattern. Cross-tenant query is now intentional and appropriate for a background cleanup job. The updateWorkflowExecutionStatus calls are still scoped to each execution's tenant/project.


💡 APPROVE WITH SUGGESTIONS

Summary: Good progress! The architectural shift to a Vercel cron job correctly resolves the prior cross-tenant security concerns — cross-tenant cleanup is now intentional for a privileged system cron. The cron auth pattern matches the established peer exactly. Two items to address: (1) remove the unused TenantScopeConfig import, and (2) the concurrency limiting pattern still doesn't truly bound concurrent DB connections (same issue from prior review). Neither blocks merge, but the concurrency pattern should be fixed before heavy production use.

Discarded (1)
Location Issue Reason Discarded
workflowCleanup.test.ts Removed tenant isolation test Correctly removed — tenant isolation test no longer applies since cross-tenant is now by design for cron
Reviewers (3)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
pr-review-sre 2 0 1 0 1 0 0
pr-review-security-iam 0 0 0 0 0 0 0
pr-review-standards 1 0 0 0 1 0 0
Total 3 0 1 0 2 0 0

Comment thread packages/agents-core/src/data-access/runtime/workflowExecutions.ts
Comment thread agents-api/src/routes/cleanupStaleWorkflows.ts
@github-actions github-actions Bot deleted a comment from claude Bot Apr 8, 2026
@itoqa
Copy link
Copy Markdown

itoqa Bot commented Apr 8, 2026

Ito Test Report ✅

14 test cases ran. 14 passed.

The unified run passed completely with 14/14 test cases successful and 0 failures, indicating no code-supported product defects for the cleanup-stale-workflows endpoint under local deterministic test conditions. Key findings were that authorization is consistently fail-closed (missing/malformed/adversarial/session/mobile direct-hit requests return 401/431 without mutation), while valid bearer requests correctly enforce stale-only cleanup logic (inclusive 30-minute boundary, non-suspended exclusion, metadata-preserving timeout merge), process oldest-first in bounded batches of 100 with isolated partial-failure accounting, remain idempotent across repeats, and converge safely under concurrent load without 5xx errors.

✅ Passed (14)
Category Summary Screenshot
Adversarial After reseeding 30 stale rows, 10 parallel authorized cleanup requests produced no 5xx and converged to stable terminal state (all non-fault stale rows failed, no oscillation observed). ADV-1
Adversarial Adversarial bearer payloads failed closed (401/431) with no leakage and no unauthorized cleanup mutation. ADV-2
Adversarial With an active Manage UI session, direct API navigation without bearer auth still returned 401 Unauthorized. ADV-3
Adversarial In mobile viewport, direct-hit/back/reload loops on the cleanup endpoint stayed unauthorized and normal UI navigation remained usable. ADV-4
Edge Authorized cleanup endpoint returned HTTP 200 with ok=true and cleanedUp=0 when stale suspended workflow count was zero before and after execution. EDGE-1
Edge Injected forceFail row remained suspended while other eligible stale rows transitioned to failed; response counters tracked partial failure without aborting the batch. EDGE-2
Edge Immediate repeat cleanup processed remaining backlog, and subsequent call reached cleanedUp=0 with only the intentionally failing fault row left pending failure attempts. EDGE-3
Logic Boundary behavior validated as expected after controlled re-check; no product defect confirmed. LOGIC-1
Logic Non-suspended rows remained unchanged after cleanup, confirming status filtering. LOGIC-2
Logic With 121 stale seeded rows, first authorized cleanup processed exactly 100 rows (cleanedUp=99, failed=1 due to injected fault), matching oldest-first bounded batch behavior. LOGIC-3
Logic Existing metadata was preserved while timeout fields were appended on stale-row conversion. LOGIC-4
Happy-path Authorized cleanup on seeded data returned 200 with counters and only stale suspended workflow was transitioned to failed while fresh suspended workflow remained unchanged. ROUTE-1
Happy-path No-header request returned 401 Unauthorized and seeded stale marker row remained suspended (no mutation). ROUTE-2
Happy-path All malformed Authorization formats returned 401 Unauthorized and did not mutate seeded workflow row. ROUTE-3

Commit: 5d5ad5a

View Full Run


Tell us how we did: Give Ito Feedback

@anubra266 anubra266 requested a review from amikofalvy April 9, 2026 16:57
@github-actions
Copy link
Copy Markdown
Contributor

A matching internal PR is ready in inkeep/agents-private#83 for canonical review and merge.

  • Original author attribution is preserved as @anubra266
  • The internal PR is the authoritative merge surface
  • The public repo will pick up the merged change through the normal mirror sync

This comment will be updated as the bridge state changes.

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(2) Total Issues | Risk: Low

Re-review scoped to delta — The code has not substantively changed since the last automated review (2d5c46912344). Prior critical issues (cross-tenant data exposure, missing failure counts) were correctly addressed. The remaining prior findings (unused import, concurrency pattern) persist but are acknowledged by the author.

🟡 Minor (1) 🟡

Inline Comments:

  • 🟡 Minor: workflowCleanup.test.ts:61 Incomplete assertion for staleBefore filter test

💭 Consider (1) 💭

💭 1) workflowCleanup.test.ts:139-179 Missing edge case test for update on non-existent workflow

Issue: The tests don't cover the case where updateWorkflowExecutionStatus is called on a workflow that doesn't exist (returns null). The handler counts any non-rejected promise as success.

Why: If a workflow is deleted between getStaleWorkflowExecutions and updateWorkflowExecutionStatus (race condition), the update returns null but is counted as "cleaned up". This could lead to misleading metrics, though not data corruption.

Fix: Add a test verifying updateWorkflowExecutionStatus returns null for non-existent workflow IDs. Consider whether the handler should distinguish between "updated" and "already deleted" cases in its success count.

🕐 Pending Recommendations (2)


✅ Security Review Passed

The cron auth pattern (noAuth() + CRON_SECRET bearer auth + constantTimeEqual) correctly matches the established cleanupStreamChunks.ts pattern. Cross-tenant query is intentional and appropriate for a background cleanup job. The updateWorkflowExecutionStatus calls are properly scoped to each execution's tenant/project.


✅ APPROVE

Summary: This PR is ready to merge. The implementation correctly follows established patterns for Vercel cron jobs and addresses all critical issues from prior reviews. The remaining pending recommendations are minor (unused import, concurrency pattern acknowledged as acceptable for this use case). The test coverage gaps identified are minor edge cases that don't block the feature. Nice implementation of the stale workflow cleanup! 🎉

Discarded (0)

No findings discarded in this review cycle.

Reviewers (4)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
pr-review-standards 0 0 0 0 0 0 0
pr-review-sre 0 0 0 0 0 0 0
pr-review-tests 2 0 1 0 1 0 0
pr-review-appsec 0 0 0 0 0 0 0
Total 2 0 1 0 1 0 0

Note: Standards, SRE, and AppSec reviewers returned no new findings — all previously identified issues remain in Pending Recommendations.

});

const ids = result.map((r) => r.id);
expect(ids).toContain(staleExecution.id);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor: Incomplete assertion for staleBefore filter

Issue: This test creates both a stale and recent execution but only asserts that the stale execution IS in the results. It does not assert that the recent execution is NOT in the results.

Why: If the filter logic were broken and returned all suspended workflows regardless of time, this test would still pass. The assertion is incomplete for validating the staleBefore filter behavior.

Fix: Store the return value of the second createExecution call and add:

const recentExecution = await createExecution({
  status: 'suspended',
  updatedAt: recentTime,
});
// ... existing assertions ...
expect(ids).not.toContain(recentExecution.id);

@itoqa
Copy link
Copy Markdown

itoqa Bot commented Apr 14, 2026

Ito Test Report ✅

14 test cases ran. 14 passed.

The unified local non-production run passed all 14/14 test cases with zero failures, confirming the cleanup endpoint’s behavior matched code expectations throughout. Key findings were that auth hardening consistently returned 401 for missing/malformed/incorrect credentials and 200 only for the valid bearer token, cleanup logic correctly targeted only stale suspended workflows (adding timeout metadata while preserving existing metadata, respecting 100-item oldest-first limits, and remaining idempotent), and resilience checks showed isolated partial-failure reporting plus safe convergence under repeated, concurrent, method/path confusion, deep-link/history, and desktop/mobile access scenarios without unintended mutations.

✅ Passed (14)
Category Summary Screenshot
Adversarial 20 parallel authorized cleanup calls all returned 200 and converged to failed stale rows with no suspended leftovers. ADV-1
Adversarial Invalid methods/wrong path stayed non-mutating, unauthorized deep-link stayed JSON 401, and authorized desktop/mobile calls returned valid cleanup JSON. ADV-3
Adversarial Deep-link history loops did not duplicate side effects: first authorized call cleaned eligible stale rows and subsequent looped calls reported no additional cleanup. ADV-4
Edge Existing metadata keys were preserved while timeout fields were appended. EDGE-1
Edge Authorized cleanup reported isolated partial failure (failed=1) while other stale rows completed and concurrent requests still converged. EDGE-2
Edge Repeated successful runs converged deterministically (100, then 20, then 0) with no reprocessing after convergence. EDGE-3
Logic Authorized cleanup mutated only stale suspended rows and appended timeout metadata. LOGIC-1
Logic Stale non-suspended records remained unchanged after cleanup. LOGIC-2
Logic Recently updated suspended record remained suspended and unmodified. LOGIC-3
Logic First cleanup processed the oldest 100 stale suspended rows, second processed the remaining 20, and third was idempotent with zero additional cleanup. LOGIC-4
Mobile Mobile viewport preserved API semantics: unauthorized access stayed JSON unauthorized and authorized access returned parseable cleanup JSON with one-time mutation. MOBILE-1
Happy-path Missing Authorization was rejected (401), while a valid bearer token reached the cleanup handler and returned a successful no-op response. ROUTE-1
Happy-path Malformed and incorrect bearer tokens are securely rejected (401 from route logic; malformed newline payload rejected upstream as bad request). ROUTE-2
Happy-path Authorized cleanup succeeded with a no-op result when no stale workflows were eligible, and unauthorized probes still returned 401. ROUTE-3

Commit: fbaff7c

View Full Run


Tell us how we did: Give Ito Feedback

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(1) Total Issues | Risk: Low

Re-review scoped to delta — The commits since last automated review (fbaff7c00..HEAD) are merge commits from main only. The PR code has not substantively changed since the previous APPROVE. All critical issues from earlier reviews have been addressed.

🟡 Minor (1) 🟡

Inline Comments:

  • 🟡 Minor: workflowExecutions.ts:5 Unused TenantScopeConfig import will fail lint

🕐 Pending Recommendations (2)


✅ Implementation Review

The implementation correctly follows established patterns:

Aspect Status
Cron auth pattern ✅ Matches cleanupStreamChunks.ts (noAuth() + CRON_SECRET + constantTimeEqual)
Cross-tenant query ✅ Intentional and appropriate for background cleanup job
Error handling Promise.allSettled with cleanedUp/failed/total response
Tenant context in logs ✅ Includes tenantId, projectId in error logs
OpenAPIHono typing ✅ Correctly typed with AppVariables
Cron schedule ✅ Registered in vercel.json (every 30 minutes)

✅ APPROVE

Summary: This PR is ready to merge. The stale suspended workflow cleanup implementation correctly follows established patterns for Vercel cron jobs. All critical issues from prior reviews have been addressed. The only remaining item is removing the unused TenantScopeConfig import (minor, will fail lint). Nice work on the durable approvals cleanup! 🎉

Discarded (0)

No findings discarded in this review cycle.

Reviewers (1)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
orchestrator 1 0 0 0 1 2 0
Total 1 0 0 0 1 2 0

Note: This is a re-review with minimal delta (merge commits only). Sub-reviewers not dispatched as code unchanged.

import type { AgentsRunDatabaseClient } from '../../db/runtime/runtime-client';
import { workflowExecutions } from '../../db/runtime/runtime-schema';
import type { WorkflowExecutionInsert, WorkflowExecutionSelect } from '../../types/index';
import type { TenantScopeConfig } from '../../types/utility';
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor: Unused import

Issue: TenantScopeConfig is imported but not used anywhere in this file.

Why: This will fail Biome lint checks (noUnusedImports rule).

Fix: (1-click apply)

Suggested change
import type { TenantScopeConfig } from '../../types/utility';
import type { WorkflowExecutionInsert, WorkflowExecutionSelect } from '../../types/index';

Refs:

@github-actions github-actions Bot deleted a comment from claude Bot Apr 15, 2026
@itoqa
Copy link
Copy Markdown

itoqa Bot commented Apr 15, 2026

Ito Test Report ✅

10 test cases ran. 10 passed.

All 10 test cases passed in a local non-production run, confirming the cleanup-stale-workflows cron route behaved correctly under functional, security, edge, and integration scenarios with no observed regressions. Key findings were that only stale suspended workflows (including boundary-time candidates) were transitioned to failed while fresh/completed rows stayed unchanged, metadata merges preserved existing keys while adding timeout markers, auth was strictly enforced with consistent 401s for missing/malformed bearer tokens, processing was capped to the oldest 100 items, parallel calls and partial row-level failures were handled safely with accurate accounting, and downstream execution/pending-approval endpoints reflected the cleanup transitions (including cross-tenant effects from the current global query).

✅ Passed (10)
Category Summary Screenshot
Adversarial Anonymous request to cleanup endpoint returned HTTP 401 with Unauthorized body on localhost non-production target. ADV-1
Adversarial All malformed/incorrect Authorization bearer variants were denied with HTTP 401 and no 5xx responses. ADV-2
Adversarial Five parallel cleanup calls completed without state corruption; all 20 rows ended in failed state. ADV-3
Adversarial Cross-tenant stale executions transitioned from suspended to failed after one authorized cleanup call. ADV-4
Edge Cleanup processed exactly 100 oldest stale rows and left rows 101-130 suspended. EDGE-2
Logic Cleanup preserved existing metadata keys while adding timeout failure metadata on stale suspended execution. LOGIC-1
Logic Injected one row-level failure and confirmed cleanup still updated other rows with failed=1 reported. LOGIC-2
Happy-path Authorized cleanup correctly transitioned stale and boundary suspended workflows while leaving non-target rows unchanged. ROUTE-1
Happy-path Execution endpoint reflected suspended -> failed transition and updatedAt advanced after cleanup. ROUTE-2
Happy-path Pending approvals endpoint changed from hasPending=true to hasPending=false after cleanup. ROUTE-3

Commit: f39bf1a

View Full Run


Tell us how we did: Give Ito Feedback

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has been automatically marked as stale because it has not had recent activity.
It will be closed in 7 days if no further activity occurs.

If this PR is still relevant:

  • Rebase it on the latest main branch
  • Add a comment explaining its current status
  • Request a review if it's ready

Thank you for your contributions!

@github-actions github-actions Bot added the stale label Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants