Update triggers to use daisy chaining#2612
Conversation
🦋 Changeset detectedLatest commit: 7385402 The changes in this PR will be included in the next version bump. This PR includes changesets to release 10 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
The daisy-chaining approach is a sound fix for the timing drift problem — replacing a long-lived loop with per-iteration workflow runs eliminates accumulated scheduling error. The adoption logic for crash recovery between parent/child is well thought through. One bug to fix around cancelled-vs-failed status, and a minor robustness note.
| if (lastError) { | ||
| await markFailedStep({ | ||
| tenantId, | ||
| projectId, | ||
| agentId, | ||
| scheduledTriggerId, | ||
| invocationId: invocation.id, | ||
| }); | ||
| } |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| if (resolvedRef) { | ||
| await withRef(manageDbPool, resolvedRef, async (db) => { | ||
| const workflow = await getScheduledWorkflowByTriggerId(db)({ | ||
| scopes, | ||
| scheduledTriggerId: params.scheduledTriggerId, | ||
| }); | ||
| if (workflow) { | ||
| await updateScheduledWorkflowRunId(db)({ | ||
| scopes, | ||
| scheduledWorkflowId: workflow.id, | ||
| workflowRunId: run.runId, | ||
| status: 'running', | ||
| }); | ||
| } | ||
| }); | ||
| } |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| "@inkeep/agents-api": patch | ||
| --- | ||
|
|
||
| daisy chain trigger |
There was a problem hiding this comment.
Per repo conventions, changeset messages should start with a verb in sentence case and describe the user-facing impact. Suggestion: "Fix scheduled trigger timing drift by replacing long-running loop with per-iteration workflow chaining"
| daisy chain trigger | |
| Fix scheduled trigger timing drift by replacing long-running loop with per-iteration workflow chaining |
|
Replaces the long-lived
|
There was a problem hiding this comment.
PR Review Summary
(7) Total Issues | Risk: Medium
🟠⚠️ Major (4) 🟠⚠️
🟠 1) scheduledTriggerRunner.ts:78-103 TOCTOU race between start() and DB update
Issue: The startNextIterationStep function performs a non-atomic start-then-update pattern: it calls start() to spawn the child workflow, then separately updates the DB with the child's runId. There is a window where the parent can crash after spawning the child but before persisting the child's ID.
Why: If the parent crashes in this window, the DB retains the parent's runId. The adoption logic handles this specific case, but the overall pattern introduces subtle crash-recovery semantics that depend on the workflow system's delivery guarantees. Additionally, if start() itself partially fails (workflow created but run.runId not returned), a ghost workflow could run without being tracked.
Fix: The pattern is acceptable given the adoption mechanism, but consider:
- Adding a compare-and-swap pattern in
updateScheduledWorkflowRunId(only update if currentworkflowRunIdequalsparentRunId) for stricter guarantees - Documenting the crash-recovery invariants and failure windows in a code comment
Refs:
🟠 2) scheduledTriggerRunner.ts:400-407 & 258-266 Missing error handling for startNextIterationStep
Issue: At both the end-of-workflow chaining path (line 400) and the cancelled-invocation chaining path (line 258), startNextIterationStep() is called without error handling. If this call fails (workflow service unavailable, DB error), the workflow terminates or returns misleading status.
Why:
- At line 400: If chaining fails, the function never reaches the
return { status: 'chained' }. The trigger chain silently breaks with no indication in logs. - At line 258: Returns
{ status: 'cancelled' }even if chaining failed, masking the chain break.
Fix: Wrap both calls in try-catch:
try {
await startNextIterationStep({...});
} catch (err) {
await logStep('Failed to chain to next iteration', {
scheduledTriggerId,
error: err instanceof Error ? err.message : String(err),
});
throw err; // Let workflow framework handle retry
}Refs:
🟠 3) scope: system Missing trace context propagation across daisy-chained runs
Issue: When start(scheduledTriggerRunnerWorkflow, [newPayload]) spawns a new workflow run, there is no explicit trace context propagation. Each chained run will start a new trace, fragmenting the observability for a trigger's execution history.
Why: For a cron trigger running every 30 minutes, after a day you'd have ~48 disconnected traces. Debugging issues across long-running triggers becomes difficult since traces are per-iteration rather than connected. The current logging includes parentRunId and childRunId but these are not added to trace context or span attributes.
Fix: Propagate a root trace ID or correlation ID through the payload and include it as a span attribute:
// In ScheduledTriggerRunnerPayload:
rootTraceId?: string | null;
// In startNextIterationStep:
const newPayload: ScheduledTriggerRunnerPayload = {
// ...existing fields
rootTraceId: params.rootTraceId ?? metadata.traceId, // Carry forward or use current
};Consider adding span links to connect parent→child traces for distributed tracing visualization.
Refs:
🟠 4) scope: system Critical test coverage gaps for new functionality
Issue: The PR introduces significant new behavior with zero test coverage:
startNextIterationStepfunction (lines 59-115) — the core chaining mechanism- Adoption logic in
checkTriggerEnabledStep(lines 204-232) — crash recovery - Cancelled invocation chaining (lines 257-266)
- Pre-chain enabled check (lines 384-398)
Why: These code paths handle critical reliability scenarios (crash recovery, chain continuation). Bugs here cause scheduled triggers to silently stop working with no error visibility. The adoption logic is especially subtle — incorrect behavior would cause child workflows spawned by crashed parents to be marked "superseded" and stop immediately.
Fix: Add tests for:
startNextIterationStep— verify payload propagation, DB update, behavior whenresolvedRefis null- Adoption path — test when
workflow.workflowRunId === parentRunId, verify DB update occurs - Superseded detection — test when
workflowRunIddiffers and noparentRunIdmatches - Cancelled cron invocation chains to next iteration; one-time does not
Refs:
Inline Comments:
- 🟠 Major:
.changeset/young-pans-see.md:5Changeset message doesn't follow AGENTS.md style - 🟠 Major:
scheduledTriggerRunner.ts:88Silent failure when resolvedRef is null - 🟠 Major:
scheduledTriggerRunner.ts:94Silent failure when workflow record not found - 🟠 Major:
scheduledTriggerSteps.ts:210-217Adoption DB update lacks error handling
💭 Consider (3) 💭
💭 1) scheduledTriggerRunner.ts:384-398 Pre-chain enabled check may be redundant
Issue: The preChainCheck verifies the trigger is still enabled before chaining, but the next iteration immediately calls checkTriggerEnabledStep as its first action. Two sequential enabled checks occur.
Why: The extra DB query adds latency to every successful cron iteration. If the cost of starting a workflow that immediately stops is negligible, the pre-chain check could be removed.
Fix: Evaluate whether the workflow start cost justifies the extra check. If not, remove the pre-chain check.
💭 2) scheduledTriggerSteps.ts:209-225 Adoption logic lacks observability metrics
Issue: The adoption pattern logs when it occurs but doesn't emit a counter metric.
Why: Frequent adoptions could indicate workflow framework instability. Without metrics, operators cannot monitor the health of the daisy-chaining mechanism or alert on elevated adoption rates.
Fix: Emit a counter metric (e.g., scheduled_trigger.workflow_adoption_count) when adoption occurs.
Inline Comments:
- 💭 Consider:
scheduledTriggerRunner.ts:1-11Restore explanatory comment about why daisy-chaining is used
💡 APPROVE WITH SUGGESTIONS
Summary: The architectural approach (daisy-chaining to fix accumulated replay delay) is sound and well-reasoned. However, several error handling gaps could cause the trigger chain to silently break in production failure scenarios. The adoption logic for crash recovery is thoughtful but needs error handling and test coverage. Consider addressing the silent failure paths in startNextIterationStep and adding tests for the new crash recovery code paths.
Discarded (5)
| Location | Issue | Reason Discarded |
|---|---|---|
scheduledTriggerRunner.ts:363-364 |
Retry jitter uses fixed 30% range without exponential backoff | Pre-existing pattern, not introduced by this PR |
system |
Daisy-chain pattern distinct from existing workflows | Informational only, appropriate divergence for use case |
scheduledTriggerSteps.ts:206 |
Orphan recovery interaction edge case | Too speculative, low confidence without production data |
scheduledTriggerRunner.ts:41 |
chainLogger missing scheduledTriggerId context | Very minor, current per-call logging is adequate |
scheduledTriggerRunner.ts:43-50 |
Payload field validation edge cases | Low severity, existing cron-parser handles invalid timestamps |
Reviewers (6)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
pr-review-tests |
6 | 1 | 0 | 0 | 0 | 0 | 1 |
pr-review-sre |
7 | 2 | 1 | 0 | 0 | 0 | 2 |
pr-review-errors |
5 | 1 | 0 | 0 | 3 | 0 | 0 |
pr-review-architecture |
3 | 1 | 0 | 0 | 0 | 0 | 2 |
pr-review-precision |
3 | 0 | 1 | 0 | 1 | 0 | 0 |
pr-review-standards |
1 | 0 | 0 | 0 | 1 | 0 | 0 |
| Total | 25 | 5 | 2 | 0 | 5 | 0 | 5 |
Note: Several findings were merged across reviewers (e.g., changeset quality flagged by both standards and precision reviewers, race condition flagged by both architecture and SRE reviewers).
| "@inkeep/agents-api": patch | ||
| --- | ||
|
|
||
| daisy chain trigger |
There was a problem hiding this comment.
🟠 MAJOR: Changeset message doesn't follow AGENTS.md style requirements
Issue: The message "daisy chain trigger" lacks sentence case, doesn't start with an action verb, and doesn't explain the user impact. Per AGENTS.md, changelog messages should be specific about what changed and why it matters to consumers.
Why: Users reading the changelog won't understand this fixes a production issue where cron triggers experienced growing execution delays (~11s initially → 2+ minutes after ~17 hours).
Fix:
| daisy chain trigger | |
| --- | |
| "@inkeep/agents-api": patch | |
| --- | |
| Fix accumulated latency in long-running scheduled triggers by using daisy-chain workflow pattern |
Refs:
| const ref = getProjectScopedRef(params.tenantId, params.projectId, 'main'); | ||
| const resolvedRef = await resolveRef(manageDbClient)(ref); | ||
|
|
||
| if (resolvedRef) { |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| scopes, | ||
| scheduledTriggerId: params.scheduledTriggerId, | ||
| }); | ||
| if (workflow) { |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| await withRef(manageDbPool, resolvedRef, async (db) => { | ||
| await updateScheduledWorkflowRunId(db)({ | ||
| scopes, | ||
| scheduledWorkflowId: workflow.id, | ||
| workflowRunId: params.runnerId, | ||
| status: 'running', | ||
| }); | ||
| }); |
There was a problem hiding this comment.
🟠 MAJOR: Adoption DB update lacks error handling
Issue: The updateScheduledWorkflowRunId call can fail (network error, DB constraint violation, connection pool exhaustion), but exceptions propagate unhandled.
Why: If adoption fails, the child workflow stops with 'superseded' on retry because the DB still shows the parent's runId. The cron trigger stops executing entirely until manually restarted. This defeats the purpose of the crash recovery logic.
Fix: Wrap in try-catch with appropriate handling:
if (params.parentRunId && workflow.workflowRunId === params.parentRunId) {
try {
await withRef(manageDbPool, resolvedRef, async (db) => {
await updateScheduledWorkflowRunId(db)({
scopes,
scheduledWorkflowId: workflow.id,
workflowRunId: params.runnerId,
status: 'running',
});
});
logger.info(
{ scheduledTriggerId: params.scheduledTriggerId, parentRunId: params.parentRunId, newRunnerId: params.runnerId },
'Child workflow adopted workflowRunId from parent'
);
} catch (err) {
logger.error(
{ scheduledTriggerId: params.scheduledTriggerId, parentRunId: params.parentRunId, error: err },
'Failed to adopt workflowRunId - workflow may be superseded incorrectly'
);
throw err; // Let workflow framework retry
}
}Refs:
| /** | ||
| * Workflow for running scheduled triggers. | ||
| * Workflow for running scheduled triggers using daisy-chaining. | ||
| * | ||
| * This workflow: | ||
| * 1. Gets or creates the next pending invocation | ||
| * 2. Sleeps until its scheduled time | ||
| * 3. Checks if the trigger is still enabled | ||
| * Each run executes a single iteration: | ||
| * 1. Checks if the trigger is still enabled (with adoption for chained runs) | ||
| * 2. Gets or creates the next pending invocation | ||
| * 3. Sleeps until its scheduled time | ||
| * 4. Executes the agent with retries | ||
| * 5. For cron triggers, loops back to step 1 | ||
| * 5. For cron triggers, starts a fresh workflow for the next iteration (daisy-chain) | ||
| * | ||
| */ |
There was a problem hiding this comment.
💭 Consider: Restore explanatory comment about why daisy-chaining is used
Issue: The original commit included a comment explaining why daisy-chaining solves the accumulated delay problem (bounded event log, avoiding replay delay). This was removed, leaving only the "what" without the "why".
Why: Future maintainers may not understand why daisy-chaining was chosen over the simpler loop approach. The deleted comment provided critical context about the root cause and solution mechanism.
Fix: Add explanation to the header comment:
| /** | |
| * Workflow for running scheduled triggers. | |
| * Workflow for running scheduled triggers using daisy-chaining. | |
| * | |
| * This workflow: | |
| * 1. Gets or creates the next pending invocation | |
| * 2. Sleeps until its scheduled time | |
| * 3. Checks if the trigger is still enabled | |
| * Each run executes a single iteration: | |
| * 1. Checks if the trigger is still enabled (with adoption for chained runs) | |
| * 2. Gets or creates the next pending invocation | |
| * 3. Sleeps until its scheduled time | |
| * 4. Executes the agent with retries | |
| * 5. For cron triggers, loops back to step 1 | |
| * 5. For cron triggers, starts a fresh workflow for the next iteration (daisy-chain) | |
| * | |
| */ | |
| /** | |
| * Workflow for running scheduled triggers using daisy-chaining. | |
| * | |
| * Why daisy-chaining instead of a while loop? | |
| * - Durable workflows maintain an event log for replay/recovery | |
| * - With a loop, each iteration adds events; replay cost grows linearly | |
| * - After ~17 hours of 30-min cron, delay grew from ~11s to ~2+ minutes | |
| * - Daisy-chaining keeps each run's event log bounded (~15 events) | |
| * | |
| * Each run executes a single iteration: | |
| * 1. Checks if the trigger is still enabled (with adoption for chained runs) | |
| * 2. Gets or creates the next pending invocation | |
| * 3. Sleeps until its scheduled time | |
| * 4. Executes the agent with retries | |
| * 5. For cron triggers, starts a fresh workflow for the next iteration (daisy-chain) | |
| * | |
| */ |
Refs:
There was a problem hiding this comment.
PR Review Summary
(2) Total Issues | Risk: Low
Delta Review Context
This is a delta review scoped to 1 commit (d604a5f4d comments) since the last automated review. The delta consists of ~77 insertions addressing prior review feedback on error handling and logging.
✅ Prior Feedback Addressed
The following issues from the prior review have been correctly addressed in this delta:
| Prior Issue | Resolution |
|---|---|
🟠 Major #2: Missing error handling for startNextIterationStep |
✅ Added try-catch with logging at both chaining paths (lines 278-294 and 430-445) |
🟠 Inline: Silent failure when resolvedRef is null |
✅ Added warning log (lines 111-122) |
| 🟠 Inline: Silent failure when workflow record not found | ✅ Added warning log (lines 101-108) |
| 🟠 Inline: Adoption DB update lacks error handling | ✅ Added try-catch with error logging (scheduledTriggerSteps.ts lines 210-230) |
🔴 pullfrog bug: lastError not cleared on cancellation |
✅ Added lastError = null when cancellation detected (line 318) |
| 🔧 Consistency: Logger name mismatch | ✅ Renamed chainLogger → logger |
All error handling implementations follow the correct pattern: try-catch → log with context → re-throw for workflow framework retry.
🕐 Pending Recommendations (2)
- 🟠
.changeset/young-pans-see.md:5Changeset message "daisy chain trigger" doesn't follow AGENTS.md style (raised by both pullfrog and prior claude review) - 💭
scheduledTriggerRunner.ts:11Consider restoring explanatory comment about why daisy-chaining is used
Prior Issues Outside Delta Scope
The following issues from the prior review remain unaddressed but were not expected in this delta (they require additional implementation beyond error handling fixes):
| Prior Issue | Status |
|---|---|
| 🟠 Major #3: Missing trace context propagation across daisy-chained runs | Open - Consider adding rootTraceId to payload |
| 🟠 Major #4: Critical test coverage gaps for new functionality | Open - No tests added for new crash recovery logic |
| 🟠 Major #1: TOCTOU race documentation | Open - The race is acceptable given adoption; documenting invariants is optional |
| 💭 Consider #1: Pre-chain enabled check redundancy | Open - Optional performance consideration |
| 💭 Consider #2: Adoption observability metrics | Open - Optional observability improvement |
✅ APPROVE
Summary: The delta correctly addresses all the error handling and logging gaps identified in the prior review. The implementations are consistent and follow best practices (try-catch → log → re-throw). The pullfrog bug with lastError not being cleared on cancellation has been fixed.
The remaining pending items (changeset message style, explanatory comment) are minor documentation concerns that don't block approval. The larger architectural items (test coverage, trace propagation) were flagged in the original review and remain relevant for follow-up but are not blocking for this fix.
Reviewers (1)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
orchestrator (delta) |
6 | 0 | 0 | 0 | 0 | 2 | 0 |
| Total | 6 | 0 | 0 | 0 | 0 | 2 | 0 |
Note: Delta review focused on verifying prior feedback was correctly addressed. All 6 prior issues examined were resolved correctly.
There was a problem hiding this comment.
PR Review Summary
(0) Total Issues | Risk: Low
Delta Review Context
This is a delta review scoped to changes since the last automated review (commit d604a5f4d → 73854023f). The delta consists solely of:
73854023f Merge branch 'main' into update-triggers
The PR's actual changes are unchanged — the three files modified by this PR (scheduledTriggerRunner.ts, scheduledTriggerSteps.ts, and the changeset) have no diffs since the prior review. The merge from main brings in unrelated changes (context compression improvements, Nango updates, GenericKeyValueInput fix) that do not affect the scheduled trigger workflow.
🕐 Pending Recommendations (2)
These items were raised in prior reviews and remain open — they are minor documentation concerns that don't block approval:
- 🟠
.changeset/young-pans-see.md:5Changeset message "daisy chain trigger" doesn't follow AGENTS.md style (raised by both pullfrog and prior claude review) - 💭
scheduledTriggerRunner.ts:11Consider restoring explanatory comment about why daisy-chaining is used
✅ APPROVE
Summary: The delta since the prior APPROVED review consists only of a merge from main with no impact on the PR's scope. The daisy-chaining implementation correctly addresses the accumulated latency bug, and all prior error handling feedback was addressed in earlier commits. The two pending documentation items (changeset message style, explanatory comment) are nice-to-haves but don't block merging this fix.
Reviewers (0)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
orchestrator (delta) |
0 | 0 | 0 | 0 | 0 | 2 | 0 |
| Total | 0 | 0 | 0 | 0 | 0 | 2 | 0 |
Note: No sub-reviewers dispatched — delta consists only of a merge from main with no changes to the PR's actual scope.
|
No documentation updates required. The daisy-chaining implementation is an internal infrastructure improvement that fixes timing drift in cron scheduled triggers. This change does not affect any public APIs, SDK methods, or user-facing behavior—triggers will simply run more accurately on schedule. |
Scheduled triggers using cron expressions experience accumulated start delay that grows linearly over time. A trigger scheduled every 30 minutes shows ~11s delay initially, degrading to ~2+ minutes after ~17 hours of operation.