Skip to content

fix workflow system so scheduled triggers run on latest code#2706

Closed
shagun-singh-inkeep wants to merge 23 commits intomainfrom
feat/manage-table-cron-dispatcher-v2
Closed

fix workflow system so scheduled triggers run on latest code#2706
shagun-singh-inkeep wants to merge 23 commits intomainfrom
feat/manage-table-cron-dispatcher-v2

Conversation

@shagun-singh-inkeep
Copy link
Copy Markdown
Collaborator

No description provided.

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Mar 16, 2026

🦋 Changeset detected

Latest commit: 35bb793

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 10 packages
Name Type
@inkeep/agents-core Patch
@inkeep/agents-api Patch
@inkeep/agents-manage-ui Patch
@inkeep/agents-cli Patch
@inkeep/agents-sdk Patch
@inkeep/agents-work-apps Patch
@inkeep/ai-sdk-provider Patch
@inkeep/create-agents Patch
@inkeep/agents-email Patch
@inkeep/agents-mcp Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@vercel
Copy link
Copy Markdown

vercel Bot commented Mar 16, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agents-api Ready Ready Preview, Comment Mar 24, 2026 9:14pm
agents-docs Ready Ready Preview, Comment Mar 24, 2026 9:14pm
agents-manage-ui Error Error Mar 24, 2026 9:14pm

Request Review

@pullfrog
Copy link
Copy Markdown
Contributor

pullfrog Bot commented Mar 16, 2026

TL;DR — Replaces the per-trigger daisy-chaining workflow architecture with a single centralized scheduler workflow that polls every 60 seconds, dispatches due triggers as independent one-shot workflows, and stores next_run_at directly on the scheduled_triggers manage table. A post-deploy CI step restarts the scheduler on the latest Vercel deployment so scheduled triggers always run current code.

Key changes

  • schedulerWorkflow — New long-lived singleton workflow that ticks every 60 s, queries all projects for due triggers, and dispatches one-shot execution workflows.
  • triggerDispatcher — New centralized dispatcher that scans next_run_at across all project branches, atomically advances the timestamp (with rollback on failure), and starts scheduledTriggerRunnerWorkflow per trigger.
  • next_run_at on scheduled_triggers — New column on the manage schema that replaces the old scheduled_workflows lookup; routes compute it on create/update.
  • scheduler_state table — New runtime singleton table tracking the current scheduler workflow run ID and deployment ID for supersession detection.
  • scheduledTriggerRunnerWorkflow simplification — Reduced from a daisy-chaining loop (sleep → execute → chain) to a stateless one-shot (check → create invocation → execute with retries → done).
  • restartScheduler deploy hook — New POST /api/deploy/restart-scheduler endpoint called by CI after Vercel deploy to start a fresh scheduler workflow on the new deployment.
  • ScheduledTriggerService gutted — Removed onTriggerCreated, onTriggerDeleted, startScheduledTriggerWorkflow, restartScheduledTriggerWorkflow, and signalStopScheduledTriggerWorkflow; only onTriggerUpdated (for invocation cancellation on re-enable/reschedule) survives.
  • Route handlersagentFull, projectFull, and scheduledTriggers routes no longer call onTriggerCreated/onTriggerDeleted; create/update routes set nextRunAt inline via computeNextRunAt.
  • Data reconciliationcheck function simplified from workflow-status verification to a nextRunAt-presence check on enabled triggers; onCreated/onDeleted handlers removed.

Summary | 34 files | 9 commits | base: mainfeat/manage-table-cron-dispatcher-v2


Centralized scheduler workflow and trigger dispatcher

Before: Each scheduled trigger got its own long-lived workflow that slept until the next cron tick, executed the agent, then daisy-chained to a new workflow for the next iteration. Workflow liveness was tracked in a scheduled_workflows manage table, with adoption and supersession logic.
After: A single schedulerWorkflow runs as a global singleton, ticks every 60 seconds, and delegates to dispatchDueTriggers() which queries next_run_at <= now() across all project branches and starts independent one-shot scheduledTriggerRunnerWorkflow instances.

The dispatcher atomically advances next_run_at before starting the workflow. If the workflow start fails, it rolls back the timestamp. One-time triggers (runAt without cronExpression) set enabled = false and next_run_at = null after dispatch.

How does supersession work on deploy? The scheduler registers its workflow run ID in a `scheduler_state` singleton row. On each tick it checks whether it's still the active scheduler. After a Vercel deploy, CI calls `POST /api/deploy/restart-scheduler` which starts a new scheduler workflow and updates `scheduler_state` — the old scheduler sees the mismatch on its next tick and exits gracefully.

schedulerWorkflow.ts · triggerDispatcher.ts · SchedulerService.ts · schedulerSteps.ts


next_run_at column and manage-table driven scheduling

Before: Scheduling state lived in a separate scheduled_workflows table linking triggers to workflow run IDs. The workflow itself computed the next execution time internally via calculateNextExecutionStep and slept until then.
After: next_run_at is a first-class timestamptz column on scheduled_triggers. Route handlers compute it on create/update using computeNextRunAt(), and the dispatcher advances it after each dispatch.

The computeNextRunAt utility handles both cron and one-time triggers, accepting an optional lastScheduledFor to base the next tick relative to the previous scheduled time rather than wall-clock time.

Migration DB Change
0013_lumpy_apocalypse.sql manage (Doltgres) ALTER TABLE scheduled_triggers ADD COLUMN next_run_at timestamptz
0023_broad_sharon_ventura.sql runtime (Postgres) CREATE TABLE scheduler_state (singleton)

manage-schema.ts · computeNextRunAt.ts · scheduledTriggers.ts (DAL)


One-shot scheduledTriggerRunnerWorkflow

Before: The runner was a long-lived daisy-chaining workflow: check trigger → compute next time → create invocation → sleep → post-sleep re-check → execute → chain to next iteration. It handled adoption, supersession, and cancellation mid-sleep.
After: The runner is a stateless one-shot: check trigger enabled → create idempotent invocation → retry loop with cancellation checks → mark completed or failed → exit.

All scheduling concerns (when to run, what's due) are now the dispatcher's responsibility. The runner only needs a TriggerPayload with scheduledFor and focuses purely on execution.

scheduledTriggerRunner.ts · scheduledTriggerSteps.ts


Deploy restart endpoint and CI integration

Before: No mechanism to move scheduled trigger workflows to a new deployment; they continued running on the old instance until they naturally chained.
After: POST /api/deploy/restart-scheduler starts a fresh scheduler workflow, superseding the old one. The Vercel production workflow calls this after promote + deploy.

The endpoint uses constant-time comparison of INKEEP_AGENTS_RUN_API_BYPASS_SECRET for auth and is registered with noAuth() / security: [] in the OpenAPI spec.

restartScheduler.ts · vercel-production.yml

Pullfrog  | View workflow run | Using Claude Code | Triggered by Pullfrogpullfrog.com𝕏

Copy link
Copy Markdown
Contributor

@pullfrog pullfrog Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Significant architectural improvement — moving from per-trigger daisy-chaining workflows to a centralized scheduler with next_run_at on the manage table is a cleaner model. The supersession mechanism and one-shot runner design are well thought out.

There are a few issues to address before merging, roughly in priority order:

  1. Reconciliation check is brokenlistEnabledScheduledTriggers only selects { id, name }, so (t as any).nextRunAt is always undefined and every enabled trigger will be flagged as missing.
  2. agentFull and projectFull create paths don't set nextRunAt — triggers created through these bulk routes will sit dormant until reconciliation detects them.
  3. Crash between advance-and-dispatch loses one-time triggers — if the process dies after advanceScheduledTriggerNextRunAt commits but before start(workflow) executes, one-time triggers are permanently disabled with no execution.
  4. as any castsnextRunAt is omitted from ScheduledTriggerInsertSchema but passed at the call site, causing multiple as any casts. Clean fix: make nextRunAt an accepted (optional) field in the insert schema.
  5. Security nits on the deploy endpoint — timing-safe comparison leaks secret length; error responses expose err.message.

Pullfrog  | Fix all ➔Fix 👍s ➔View workflow runpullfrog.com𝕏


const missingWorkflows = enabledTriggers
.filter((t) => !workflowsByTriggerId.has(t.id))
.filter((t) => !(t as any).nextRunAt)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: listEnabledScheduledTriggers (in audit-queries.ts) only selects { id, name }nextRunAt is never present on the returned objects. This means !(t as any).nextRunAt is always true, and every enabled trigger will be reported as missing.

Fix: add nextRunAt: scheduledTriggers.nextRunAt to the select() in listEnabledScheduledTriggers, then remove this as any cast.

Comment on lines +27 to +30
orphanedWorkflows: [],
staleWorkflows: [],
deadWorkflows: [],
verificationFailures: [],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These four fields (orphanedWorkflows, staleWorkflows, deadWorkflows, verificationFailures) are now hardcoded empty arrays. Consider updating ScheduledTriggerAuditResult to remove or mark them optional — returning dead-letter fields that can never be populated adds noise.

Comment on lines +397 to +398
nextRunAt,
} as any);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The as any cast is needed because nextRunAt is omitted from ScheduledTriggerInsertSchemaBase. Since the create route now always computes and passes nextRunAt, the insert type should accept it.

Fix: remove nextRunAt from the .omit() in ScheduledTriggerInsertSchemaBase (or add it as .optional()) so the DAL function accepts it without a cast.

Comment on lines +565 to +582
const mergedEnabled = body.enabled !== undefined ? body.enabled : existing.enabled;
const enabledChanged = body.enabled !== undefined && body.enabled !== existing.enabled;

let nextRunAt: string | null | undefined;
if (!mergedEnabled) {
nextRunAt = null;
} else if (scheduleChanged || enabledChanged) {
const mergedCron =
body.cronExpression !== undefined ? body.cronExpression : existing.cronExpression;
const mergedTimezone =
body.cronTimezone !== undefined ? body.cronTimezone : existing.cronTimezone;
const mergedRunAt = body.runAt !== undefined ? body.runAt : existing.runAt;
nextRunAt = computeNextRunAt({
cronExpression: mergedCron,
cronTimezone: mergedTimezone,
runAt: mergedRunAt,
});
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nextRunAt is only recomputed when scheduleChanged || enabledChanged. If only payload or messageTemplate changes, nextRunAt is left unchanged — that's correct.

However, the enabled → disabled transition sets nextRunAt = null here, but onTriggerUpdated no longer cancels pending invocations for that case (the old Case 2 was removed from ScheduledTriggerService.ts). Already-queued pending or running invocations will continue executing even though the user disabled the trigger. Consider adding cancelPendingInvocationsForTrigger to the enabled→disabled path, either here or in onTriggerUpdated.

Comment on lines +78 to +89
await withRef(
manageDbPool,
resolvedRef,
(db) =>
advanceScheduledTriggerNextRunAt(db)({
scopes: { tenantId, projectId, agentId },
scheduledTriggerId,
nextRunAt,
enabled: isOneTime ? false : undefined,
}),
{ commit: true, commitMessage: `Advance next_run_at for trigger ${scheduledTriggerId}` }
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Risk: crash-window between advance and dispatch. If the process dies after this advanceScheduledTriggerNextRunAt commit but before start(scheduledTriggerRunnerWorkflow) on line 100, nextRunAt is already advanced but no workflow was started. The rollback on line 102 only runs if start() throws, not on a process crash.

  • Cron triggers: miss one execution (next tick computes a new nextRunAt) — acceptable.
  • One-time triggers: permanently disabled (enabled=false, nextRunAt=null) with no execution — data loss.

Consider reversing the order: start the workflow first (idempotent via createInvocationIdempotentStep), then advance nextRunAt. If advance fails, the trigger is picked up again next tick; the idempotency key prevents double-execution.

Comment on lines +18 to +21
await upsertSchedulerState(runDbClient)({
currentRunId: run.runId,
deploymentId: getDeploymentId(),
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor race: this upsertSchedulerState call is redundant with registerSchedulerStep inside the workflow. Both write currentRunId to the same singleton row. If the deploy endpoint is called twice in quick succession (e.g. CI retry), you can get:

  1. Call 1: start() → run C
  2. Call 2: start() → run D
  3. Call 1: upsertSchedulerState(C)
  4. Call 2: upsertSchedulerState(D)
  5. Workflow C: registerSchedulerStep(C) — overwrites D

Now both C and D think they're current. Consider removing this outer upsertSchedulerState and relying solely on registerSchedulerStep inside the workflow.

Comment on lines +11 to +15
function constantTimeEqual(a: string, b: string): boolean {
const bufA = Buffer.from(a);
const bufB = Buffer.from(b);
if (bufA.length !== bufB.length) return false;
return timingSafeEqual(bufA, bufB);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The early return false on length mismatch exits faster than timingSafeEqual, letting an attacker binary-search the secret's byte length via response timing. Low-severity in practice (network jitter dominates), but cheap to fix:

import { createHash, timingSafeEqual } from 'node:crypto';

function constantTimeEqual(a: string, b: string): boolean {
  const hash = (s: string) => createHash('sha256').update(s).digest();
  return timingSafeEqual(hash(a), hash(b));
}

{ error: err instanceof Error ? err.message : String(err) },
'Failed to restart scheduler workflow'
);
return c.json({ error: err instanceof Error ? err.message : 'Internal error' }, 500);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err.message is returned to the caller, potentially leaking internal details (DB connection strings, file paths). Since the caller is CI, return a generic message and rely on the server-side log (line 53) for debugging:

return c.json({ error: 'Failed to restart scheduler workflow' }, 500);

Comment on lines +228 to +232
const rows = await db.execute(
sql`SELECT id, tenant_id, project_id, agent_id,
cron_expression, cron_timezone, run_at,
next_run_at, enabled
FROM scheduled_triggers AS OF ${sql.raw(`'${branchName}'`)}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

branchName is interpolated unescaped into the query via sql.raw. It's constructed from tenantId/projectId via getProjectScopedRef (simple concatenation, no sanitization). The values come from the runtime DB so they're trusted internal data — not a regression since this pattern exists elsewhere in the dolt module — but worth hardening. Consider a shared helper that validates branch names for AS OF clauses (e.g., reject values containing ').

}): Promise<DueScheduledTrigger[]> => {
const allDue: DueScheduledTrigger[] = [];

for (const project of params.projects) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loops over every project one-at-a-time with a separate SQL query per project. Fine for small deployments, but could become a bottleneck at scale (N round-trips to Doltgres). Consider adding a log/metric for the iteration count so you can detect when this becomes slow.

claude[bot]

This comment was marked as outdated.

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(4) Total Issues | Risk: High

This is a delta review covering 3 commits since the last automated review. The delta addresses several prior issues but leaves critical blocking items unresolved.

✅ Issues Fixed in Delta

Prior Issue Status Evidence
Reconciliation check broken (nextRunAt missing in select) ✅ Fixed audit-queries.ts:18 now selects nextRunAt, as any cast removed
Crash between advance-and-dispatch loses one-time triggers ✅ Fixed triggerDispatcher.ts:84 now starts workflow before advancing
No test coverage for computeNextRunAt ✅ Fixed 136 lines of tests added
No test coverage for triggerDispatcher ✅ Fixed 240 lines of tests added
Timing-safe comparison leaks secret length ✅ Fixed restartScheduler.ts:11-14 now uses SHA256 hash comparison
Error response exposes err.message ✅ Fixed restartScheduler.ts:54 now returns generic error

🔴❗ Critical (1) ❗🔴

🔴 1) 0013_lumpy_apocalypse.sql:1 Missing data migration for existing enabled triggers

Issue: The migration adds a nullable next_run_at column but does NOT backfill existing enabled triggers. All currently-enabled triggers will have next_run_at = NULL after migration.

Why: The scheduler workflow at findDueScheduledTriggersAcrossProjects only dispatches triggers where next_run_at IS NOT NULL AND next_run_at <= now(). Existing enabled triggers will silently stop running after deploy. This is a one-way door causing production outages for customers relying on scheduled triggers.

Fix: Add a data migration to backfill existing triggers:

-- After the ALTER TABLE, add:
UPDATE scheduled_triggers 
SET next_run_at = NOW() 
WHERE enabled = true AND next_run_at IS NULL;

Or implement startup reconciliation that calls computeNextRunAt for any enabled trigger with NULL next_run_at.

Refs:

🟠⚠️ Major (2) 🟠⚠️

🟠 1) agentFull.ts + projectFull.ts Bulk routes don't compute nextRunAt for new triggers

Issue: Triggers created via createFullAgentServerSide and createFullProjectServerSide (the PUT/PATCH bulk routes) call upsertScheduledTrigger without computing nextRunAt. These triggers will have nextRunAt = NULL and won't be dispatched.

Why: SDK push commands and bulk imports use these routes. Triggers will appear enabled in the UI but will never execute until manually updated via the individual trigger PATCH endpoint.

Fix: Compute nextRunAt before upserting in the DAL functions, following the pattern at scheduledTriggers.ts:374-380:

const nextRunAt = enabled
  ? computeNextRunAt({ cronExpression, cronTimezone, runAt })
  : null;

Refs:

🟠 2) triggerDispatcher.ts:44 Unbounded concurrent dispatches

Issue: All due triggers are dispatched in parallel via Promise.allSettled with no concurrency limit. If many triggers become due simultaneously (scheduler outage recovery, popular cron times), this could spawn hundreds of concurrent workflow starts.

Why: Risks connection pool exhaustion, workflow engine overload, and cascading failures during recovery scenarios.

Fix: Add concurrency limiting:

import pLimit from 'p-limit';
const limit = pLimit(10);
const results = await Promise.allSettled(
  dueTriggers.map((trigger) => limit(() => dispatchSingleTrigger(trigger)))
);

Refs:

🟡 Minor (1) 🟡

Inline Comments:

  • 🟡 Minor: triggerDispatcher.ts:104 Error log missing correlation context

💭 Consider (2) 💭

Inline Comments:

  • 💭 Consider: computeNextRunAt.test.ts:104-126 DST tests use weak assertions
  • 💭 Consider: triggerDispatcher.test.ts:188 Missing test for advance-failure scenario

🕐 Pending Recommendations (5)

Prior issues from pullfrog and earlier claude review that remain unresolved:


🚫 REQUEST CHANGES

Summary: Good progress on the delta — 6 of 9 prior issues have been addressed, including critical fixes to the reconciliation check, dispatch ordering, test coverage, and security. However, the most critical issue remains unresolved: the data migration that will cause all existing scheduled triggers to silently stop working after deploy. This must be addressed before merge.

Secondary priority: the bulk routes (agentFull/projectFull) still don't compute nextRunAt, which will cause SDK-created triggers to be dormant.

Discarded (3)
Location Issue Reason Discarded
triggerDispatcher.ts:108 Success log emitted even when advance fails Acceptable — log clearly says "dispatched" which is accurate; advance failure is logged separately
computeNextRunAt.test.ts No test for invalid cron expressions Low priority — invalid crons should be caught at trigger creation validation
triggerDispatcher.test.ts No test for scheduledFor fallback Low priority — simple fallback unlikely to regress
Reviewers (4)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
pr-review-standards 0 0 0 0 0 0 0
pr-review-tests 5 0 1 0 1 0 2
pr-review-sre 4 1 0 0 1 1 1
pr-review-breaking-changes 3 2 0 0 0 2 0
Total 12 3 1 0 2 3 3

Note: Many findings were deduplicated with prior reviews (pullfrog, earlier claude). Delta-focused review intentionally narrow.

logger.error(
{ scheduledTriggerId, err },
'Failed to advance next_run_at after workflow start; next tick will retry (idempotent)'
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor: Misleading "idempotent" comment

Issue: The comment claims "next tick will retry (idempotent)" but the idempotency guarantee is at the workflow layer (via idempotencyKey = sched_{triggerId}_{scheduledFor} in scheduledTriggerRunner.ts), not the dispatcher layer. If advance fails, the next tick will start another workflow instance that must detect the duplicate via createInvocationIdempotentStep.

Why: During incident triage, this comment could mislead engineers into thinking the dispatcher itself prevents duplicates, when it actually relies on downstream workflow-level deduplication.

Fix: Clarify the comment:

Suggested change
);
logger.error(
{ scheduledTriggerId, err },
'Failed to advance next_run_at after workflow start; next tick will re-dispatch (workflow has invocation-level idempotency)'
);

Refs:

} catch (err) {
logger.error(
{ scheduledTriggerId, err },
'Failed to advance next_run_at after workflow start; next tick will retry (idempotent)'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor: Error log missing context for incident correlation

Issue: The error log includes scheduledTriggerId and err but lacks tenantId, projectId, agentId, and scheduledFor timestamp. During an incident with multiple failing triggers, correlating these logs to specific customers will be difficult.

Why: All these identifiers are already in scope — including them costs nothing and significantly improves debuggability.

Fix:

Suggested change
'Failed to advance next_run_at after workflow start; next tick will retry (idempotent)'
logger.error(
{ scheduledTriggerId, tenantId, projectId, agentId, scheduledFor: trigger.nextRunAt, err },

Refs:

} catch (err) {
logger.error(
{ scheduledTriggerId, err },
'Failed to advance next_run_at after workflow start; next tick will retry (idempotent)'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor: Error log missing correlation context for incident debugging

Issue: This log includes scheduledTriggerId and err but is missing tenantId, projectId, agentId, and scheduledFor — all of which are available in scope on line 64.

Why: During an incident with multiple failing triggers, operators would need to manually correlate trigger IDs back to tenant/project context. Including all identifiers makes debugging significantly easier and costs nothing.

Fix:

Suggested change
'Failed to advance next_run_at after workflow start; next tick will retry (idempotent)'
logger.error(
{ scheduledTriggerId, tenantId, projectId, agentId, scheduledFor: trigger.nextRunAt, err },
'Failed to advance next_run_at after workflow start; next tick will retry (idempotent)'
);

Refs:

return Promise.resolve();
});
mockWithRef.mockImplementation(async (_pool, _ref, fn, _opts) => {
return fn('mock-branch-db');
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor: Missing test for advance-failure-after-workflow-start scenario

Issue: This test covers workflow start failure, but there's no test for when advanceScheduledTriggerNextRunAt fails after workflow start succeeds. In triggerDispatcher.ts:88-106, if withRef throws, the code logs an error but still returns 'dispatched'.

Why: Without this test, it's unclear whether the current behavior (count as dispatched, log error) is intentional or accidental. The next scheduler tick will dispatch the same trigger again, relying on workflow-level idempotency. This test would document the expected behavior and catch regressions.

Fix: Add a test case:

it('counts as dispatched even when advance fails (workflow already started)', async () => {
  mockStart.mockResolvedValue(undefined);
  mockWithRef.mockRejectedValue(new Error('db write failed'));
  
  const result = await dispatchDueTriggers();
  
  expect(result).toEqual({ dispatched: 1 });
  expect(mockStart).toHaveBeenCalledTimes(1);
});

Refs:

Comment on lines +104 to +126
it('handles DST spring-forward transition', () => {
const result = computeNextRunAt({
cronExpression: '30 2 * * *',
cronTimezone: 'America/New_York',
lastScheduledFor: '2026-03-07T07:30:00.000Z',
});

expect(result).toBeDefined();
const nextDate = new Date(result!);
expect(nextDate.getTime()).toBeGreaterThan(new Date('2026-03-07T07:30:00.000Z').getTime());
});

it('handles DST fall-back transition', () => {
const result = computeNextRunAt({
cronExpression: '30 1 * * *',
cronTimezone: 'America/New_York',
lastScheduledFor: '2026-10-31T05:30:00.000Z',
});

expect(result).toBeDefined();
const nextDate = new Date(result!);
expect(nextDate.getTime()).toBeGreaterThan(new Date('2026-10-31T05:30:00.000Z').getTime());
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 Consider: DST tests use weak assertions

Issue: These DST tests only assert that the result is defined and later than the input. They don't verify the exact expected next run time, meaning bugs in DST handling (e.g., skipping to the wrong day, firing twice during fall-back) would pass.

Why: DST bugs in scheduler systems are notoriously hard to debug in production. Spring-forward and fall-back transitions can cause triggers to fire at unexpected times or not at all.

Fix: Strengthen assertions to verify exact expected times. For example, for the spring-forward test (March 8, 2026 in America/New_York):

// 2:30 AM doesn't exist during spring-forward, so next valid occurrence
// after 2026-03-07T07:30:00Z should be 2026-03-09T07:30:00Z
expect(result).toBe('2026-03-09T07:30:00.000Z');

Refs:

@pullfrog
Copy link
Copy Markdown
Contributor

pullfrog Bot commented Mar 24, 2026

TL;DR — Replaces the per-trigger daisy-chaining workflow model with a single centralized scheduler workflow that polls the runtime DB every 60 seconds and dispatches one-shot workflows for due triggers. This moves the scheduled_triggers table from the manage DB (DoltgreSQL) to the runtime DB (Postgres) so triggers always run against the latest code, and adds a next_run_at column and a ref column so triggers can target specific branches.

Key changes

  • Centralized scheduler workflow — A single long-lived schedulerWorkflow ticks every 60 seconds, queries all due triggers across projects, and dispatches independent one-shot runner workflows for each
  • Scheduled triggers moved to runtime DB — The scheduled_triggers table is dropped from manage-schema (DoltgreSQL) and recreated in runtime-schema (Postgres) with new next_run_at and ref columns
  • One-shot trigger runnerscheduledTriggerRunnerWorkflow is simplified from a daisy-chaining loop to a single-invocation executor that receives scheduledFor and ref from the dispatcher
  • Deploy restart endpoint — New POST /api/deploy/restart-scheduler route lets CI restart the scheduler on the latest deployment, with a matching Vercel production workflow step
  • Branch-aware triggers — Triggers store a ref that controls which branch's agent config is used at execution time; branch deletion cascades to clean up associated triggers
  • Removed per-trigger workflow management — Eliminates scheduled_workflows table, ScheduledWorkflow entity, startScheduledTriggerWorkflow/signalStopScheduledTriggerWorkflow/restartScheduledTriggerWorkflow, and the reconciliation handler for scheduled triggers

Summary | 60 files | 23 commits | base: mainfeat/manage-table-cron-dispatcher-v2


Scheduler architecture overhaul

Before: Each scheduled trigger had its own long-running workflow that daisy-chained iterations, tracked state via a scheduled_workflows table in DoltgreSQL, and required complex adoption/supersession logic when workflows crashed or deployments changed.
After: A single schedulerWorkflow runs as a long-lived loop, ticking every 60 seconds. It queries scheduled_triggers.next_run_at in the runtime DB, dispatches independent one-shot scheduledTriggerRunnerWorkflow instances for due triggers, and advances next_run_at after dispatch.

The scheduler registers itself in a new scheduler_state singleton table. On each tick it checks whether it is still the active scheduler — if a newer instance has started (e.g., after a deploy), the old one gracefully exits. For Vercel deployments, a post-deploy CI step calls the restart endpoint to ensure the scheduler runs on the latest code. For postgres-world/local, the scheduler starts on boot after orphan recovery.

schedulerWorkflow.ts · schedulerSteps.ts · SchedulerService.ts · schedulerState.ts


Trigger dispatcher and one-shot runner

Before: Each trigger runner workflow computed its own next execution time, slept until then, executed, and daisy-chained to a new workflow — requiring complex state management for adoption, supersession, and cancelled-invocation recovery.
After: dispatchDueTriggers() queries all due triggers in a single cross-project query, starts a one-shot workflow per trigger with Promise.allSettled, and advances next_run_at (or disables for one-time triggers). The runner is a simple linear flow: check enabled → create invocation → execute with retries → mark completed or failed.

The dispatcher computes next_run_at via the new shared computeNextRunAt utility, which is also used by the CRUD routes when creating/updating triggers.

triggerDispatcher.ts · scheduledTriggerRunner.ts · compute-next-run-at.ts


Schema migration: manage DB → runtime DB

Before: scheduled_triggers and scheduled_workflows lived in the manage DB (DoltgreSQL), requiring branch-scoped withRef queries and AS OF semantics for every read/write.
After: scheduled_triggers moves to the runtime DB (Postgres) with new columns next_run_at and ref. scheduled_workflows is dropped entirely. All CRUD routes now use runDbClient directly instead of c.get('db') with branch scoping.

The manage DB migration (0013) drops the old tables. The runtime DB migration (0025) creates the new scheduled_triggers table with indexes on (enabled, next_run_at) for the dispatcher query, (tenant_id, project_id, agent_id) for scoped lookups, and (ref) for branch deletion cleanup. A new scheduler_state singleton table is also created.

manage-schema.ts · runtime-schema.ts · 0025_long_shard.sql · 0013_married_rage.sql


Branch-aware trigger execution

Before: Triggers always resolved against the main branch via getProjectScopedRef(tenantId, projectId, 'main').
After: Triggers store a ref column. At execution time, executeScheduledTriggerStep resolves the trigger's ref (defaulting to main if null), so triggers can run against specific branch configurations.

When a branch is deleted, deleteScheduledTriggersByRef cleans up all triggers targeting that branch in the runtime DB. The dispatcher passes ref through to each one-shot workflow payload.

scheduledTriggerSteps.ts · branches.ts · runtime/scheduledTriggers.ts


Deploy restart endpoint and CI integration

Before: No mechanism to ensure the scheduler moves to a new deployment after a Vercel deploy.
After: A new POST /api/deploy/restart-scheduler endpoint (authed via INKEEP_AGENTS_RUN_API_BYPASS_SECRET) starts a fresh scheduler workflow. The Vercel production workflow adds a restart-scheduler job that calls this endpoint after deploy + promote.

The endpoint uses constant-time comparison for the bearer token and the noAuth() permission (public route with manual auth). On postgres-world/local, the scheduler starts automatically on server boot.

restartScheduler.ts · createApp.ts · index.ts · vercel-production.yml


CRUD routes and service simplification

Before: ScheduledTriggerService exposed startScheduledTriggerWorkflow, signalStopScheduledTriggerWorkflow, restartScheduledTriggerWorkflow, onTriggerCreated, onTriggerDeleted, and onTriggerUpdated. Routes called these lifecycle hooks on every mutation and required branch-scoped manage DB connections.
After: The service is reduced to onTriggerUpdated (for cancelling stale pending invocations on re-enable or schedule change). CRUD routes write directly to runDbClient, computing nextRunAt inline at create/update time. No workflow lifecycle management on mutation.

ScheduledTriggerService.ts · scheduledTriggers.ts (routes) · manage/scheduledTriggers.ts (deleted)


Removed reconciliation and audit infrastructure

Before: A scheduled_triggers reconciliation handler checked for missing workflows, orphaned workflows, stale workflows, and dead workflows — repairing them by starting or stopping per-trigger workflows.
After: The entire scheduledTriggersHandlers reconciliation handler is deleted. The centralized scheduler inherently recovers from any state — if a trigger is due and enabled, it will be dispatched on the next tick.

scheduled-triggers.ts (deleted) · registry.ts


UI and cleanup updates

The scheduled triggers table component is rewritten from a DataTable/ColumnDef pattern to a direct Table component, adds a branch (ref) column with a GitBranch icon, and simplifies state management. Trigger cleanup for user deletion now deletes from the runtime DB before the manage DB branch-scoped cleanup.

project-scheduled-triggers-table.tsx · triggerCleanup.ts

Pullfrog  | View workflow run | Triggered by Pullfrogpullfrog.com𝕏

@pullfrog
Copy link
Copy Markdown
Contributor

pullfrog Bot commented Mar 24, 2026

TL;DR — Replaces the per-trigger daisy-chaining workflow model with a single centralized scheduler workflow that polls the runtime DB every 60 seconds and dispatches one-shot workflows for due triggers. This moves the scheduled_triggers table from the manage DB (DoltgreSQL) to the runtime DB (Postgres) so triggers always run against the latest code, and adds a next_run_at column and a ref column so triggers can target specific branches.

Key changes

  • Centralized scheduler workflow — A single long-lived schedulerWorkflow ticks every 60 seconds, queries all due triggers across projects, and dispatches independent one-shot runner workflows for each
  • Scheduled triggers moved to runtime DB — The scheduled_triggers table is dropped from manage-schema (DoltgreSQL) and recreated in runtime-schema (Postgres) with new next_run_at and ref columns
  • One-shot trigger runnerscheduledTriggerRunnerWorkflow is simplified from a daisy-chaining loop to a single-invocation executor that receives scheduledFor and ref from the dispatcher
  • Deploy restart endpoint — New POST /api/deploy/restart-scheduler route lets CI restart the scheduler on the latest deployment, with a matching Vercel production workflow step
  • Branch-aware triggers — Triggers store a ref that controls which branch's agent config is used at execution time; branch deletion cascades to clean up associated triggers
  • Removed per-trigger workflow management — Eliminates scheduled_workflows table, ScheduledWorkflow entity, startScheduledTriggerWorkflow/signalStopScheduledTriggerWorkflow/restartScheduledTriggerWorkflow, and the reconciliation handler for scheduled triggers

Summary | 60 files | 23 commits | base: mainfeat/manage-table-cron-dispatcher-v2


Scheduler architecture overhaul

Before: Each scheduled trigger had its own long-running workflow that daisy-chained iterations, tracked state via a scheduled_workflows table in DoltgreSQL, and required complex adoption/supersession logic when workflows crashed or deployments changed.
After: A single schedulerWorkflow runs as a long-lived loop, ticking every 60 seconds. It queries scheduled_triggers.next_run_at in the runtime DB, dispatches independent one-shot scheduledTriggerRunnerWorkflow instances for due triggers, and advances next_run_at after dispatch.

The scheduler registers itself in a new scheduler_state singleton table. On each tick it checks whether it is still the active scheduler — if a newer instance has started (e.g., after a deploy), the old one gracefully exits. For Vercel deployments, a post-deploy CI step calls the restart endpoint to ensure the scheduler runs on the latest code. For postgres-world/local, the scheduler starts on boot after orphan recovery.

schedulerWorkflow.ts · schedulerSteps.ts · SchedulerService.ts · schedulerState.ts


Trigger dispatcher and one-shot runner

Before: Each trigger runner workflow computed its own next execution time, slept until then, executed, and daisy-chained to a new workflow — requiring complex state management for adoption, supersession, and cancelled-invocation recovery.
After: dispatchDueTriggers() queries all due triggers in a single cross-project query, starts a one-shot workflow per trigger with Promise.allSettled, and advances next_run_at (or disables for one-time triggers). The runner is a simple linear flow: check enabled → create invocation → execute with retries → mark completed or failed.

The dispatcher computes next_run_at via the new shared computeNextRunAt utility, which is also used by the CRUD routes when creating/updating triggers.

triggerDispatcher.ts · scheduledTriggerRunner.ts · compute-next-run-at.ts


Schema migration: manage DB → runtime DB

Before: scheduled_triggers and scheduled_workflows lived in the manage DB (DoltgreSQL), requiring branch-scoped withRef queries and AS OF semantics for every read/write.
After: scheduled_triggers moves to the runtime DB (Postgres) with new columns next_run_at and ref. scheduled_workflows is dropped entirely. All CRUD routes now use runDbClient directly instead of c.get('db') with branch scoping.

The manage DB migration (0013) drops the old tables. The runtime DB migration (0025) creates the new scheduled_triggers table with indexes on (enabled, next_run_at) for the dispatcher query, (tenant_id, project_id, agent_id) for scoped lookups, and (ref) for branch deletion cleanup. A new scheduler_state singleton table is also created.

manage-schema.ts · runtime-schema.ts · 0025_long_shard.sql · 0013_married_rage.sql


Branch-aware trigger execution

Before: Triggers always resolved against the main branch via getProjectScopedRef(tenantId, projectId, 'main').
After: Triggers store a ref column. At execution time, executeScheduledTriggerStep resolves the trigger's ref (defaulting to main if null), so triggers can run against specific branch configurations.

When a branch is deleted, deleteScheduledTriggersByRef cleans up all triggers targeting that branch in the runtime DB. The dispatcher passes ref through to each one-shot workflow payload.

scheduledTriggerSteps.ts · branches.ts · runtime/scheduledTriggers.ts


Deploy restart endpoint and CI integration

Before: No mechanism to ensure the scheduler moves to a new deployment after a Vercel deploy.
After: A new POST /api/deploy/restart-scheduler endpoint (authed via INKEEP_AGENTS_RUN_API_BYPASS_SECRET) starts a fresh scheduler workflow. The Vercel production workflow adds a restart-scheduler job that calls this endpoint after deploy + promote.

The endpoint uses constant-time comparison for the bearer token and the noAuth() permission (public route with manual auth). On postgres-world/local, the scheduler starts automatically on server boot.

restartScheduler.ts · createApp.ts · index.ts · vercel-production.yml


CRUD routes and service simplification

Before: ScheduledTriggerService exposed startScheduledTriggerWorkflow, signalStopScheduledTriggerWorkflow, restartScheduledTriggerWorkflow, onTriggerCreated, onTriggerDeleted, and onTriggerUpdated. Routes called these lifecycle hooks on every mutation and required branch-scoped manage DB connections.
After: The service is reduced to onTriggerUpdated (for cancelling stale pending invocations on re-enable or schedule change). CRUD routes write directly to runDbClient, computing nextRunAt inline at create/update time. No workflow lifecycle management on mutation.

ScheduledTriggerService.ts · scheduledTriggers.ts (routes) · manage/scheduledTriggers.ts (deleted)


Removed reconciliation and audit infrastructure

Before: A scheduled_triggers reconciliation handler checked for missing workflows, orphaned workflows, stale workflows, and dead workflows — repairing them by starting or stopping per-trigger workflows.
After: The entire scheduledTriggersHandlers reconciliation handler is deleted. The centralized scheduler inherently recovers from any state — if a trigger is due and enabled, it will be dispatched on the next tick.

scheduled-triggers.ts (deleted) · registry.ts


UI and cleanup updates

The scheduled triggers table component is rewritten from a DataTable/ColumnDef pattern to a direct Table component, adds a branch (ref) column with a GitBranch icon, and simplifies state management. Trigger cleanup for user deletion now deletes from the runtime DB before the manage DB branch-scoped cleanup.

project-scheduled-triggers-table.tsx · triggerCleanup.ts

Pullfrog  | View workflow run | Triggered by Pullfrogpullfrog.com𝕏

@github-actions github-actions Bot removed the stale label Mar 25, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 1, 2026

This pull request has been automatically marked as stale because it has not had recent activity.
It will be closed in 7 days if no further activity occurs.

If this PR is still relevant:

  • Rebase it on the latest main branch
  • Add a comment explaining its current status
  • Request a review if it's ready

Thank you for your contributions!

@github-actions github-actions Bot added the stale label Apr 1, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 9, 2026

This pull request has been automatically closed due to inactivity.

If you'd like to continue working on this, please:

  1. Create a new branch from the latest main
  2. Cherry-pick your commits or rebase your changes
  3. Open a new pull request

Thank you for your understanding!

@github-actions github-actions Bot closed this Apr 9, 2026
@github-actions github-actions Bot deleted the feat/manage-table-cron-dispatcher-v2 branch April 9, 2026 00:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant