Skip to content

fix: Make OpenTelemetry startup idempotent#2684

Merged
mike-inkeep merged 9 commits intomainfrom
fix/hot_reload_metricreader_error
Mar 19, 2026
Merged

fix: Make OpenTelemetry startup idempotent#2684
mike-inkeep merged 9 commits intomainfrom
fix/hot_reload_metricreader_error

Conversation

@mike-inkeep
Copy link
Copy Markdown
Contributor

Vite hot reload frequently tries to re-run the otel SDK startup, which is not idempotent and leads to a 500 error on all subsequent requests. This requires restarting pnpm dev.
The error shows up as: 10:04:53 AM [vite] Internal server error: MetricReader can not be bound to a MeterProvider again.

This PR makes that operation idempotent by using a global var to keep track of whether otel was started up already.

@vercel
Copy link
Copy Markdown

vercel Bot commented Mar 13, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agents-api Ready Ready Preview, Comment Mar 19, 2026 8:20pm
agents-docs Ready Ready Preview, Comment Mar 19, 2026 8:20pm
agents-manage-ui Ready Ready Preview, Comment Mar 19, 2026 8:20pm

Request Review

@vercel vercel Bot temporarily deployed to Preview – agents-docs March 13, 2026 14:11 Inactive
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Mar 13, 2026

🦋 Changeset detected

Latest commit: 9c711d1

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 10 packages
Name Type
@inkeep/agents-api Patch
@inkeep/agents-manage-ui Patch
@inkeep/agents-cli Patch
@inkeep/agents-core Patch
@inkeep/agents-email Patch
@inkeep/agents-mcp Patch
@inkeep/agents-sdk Patch
@inkeep/agents-work-apps Patch
@inkeep/ai-sdk-provider Patch
@inkeep/create-agents Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Copy Markdown
Contributor

@pullfrog pullfrog Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idempotency approach is sound for the Vite HMR use case, but there's a breaking change that needs addressing before merge.

Pullfrog  | Fix all ➔Fix 👍s ➔View workflow runpullfrog.com𝕏

Comment thread agents-api/src/instrumentation.ts Outdated
});

export const defaultSDK = new NodeSDK({
const defaultSDK = new NodeSDK({
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the export from defaultSDK is a breaking change. create-agents-template/apps/agents-api/src/instrumentation.ts:19 does:

const { defaultSDK } = await import('@inkeep/agents-api/instrumentation');
defaultSDK.start();

@inkeep/agents-api/instrumentation is a declared subpath export in agents-api/package.json. The template (and any downstream consumer using that subpath) will break at runtime.

Either:

  1. Update the template in this PR to use startOpenTelemetrySDK(), or
  2. Continue exporting defaultSDK (possibly deprecated) alongside the new function.

Comment thread agents-api/src/instrumentation.ts Outdated
Comment on lines +93 to +110
const OTEL_STARTED_KEY = Symbol.for('@inkeep/agents-api.otelStarted');

export function startOpenTelemetrySDK(): void {
if ((globalThis as unknown as { [key: symbol]: boolean })[OTEL_STARTED_KEY]) {
return;
}
try {
defaultSDK.start();
(globalThis as unknown as { [key: symbol]: boolean })[OTEL_STARTED_KEY] = true;
} catch (error) {
const msg = error instanceof Error ? error.message : String(error);
if (msg.includes('MetricReader') && msg.includes('can not be bound')) {
(globalThis as unknown as { [key: symbol]: boolean })[OTEL_STARTED_KEY] = true;
return;
}
throw error;
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new NodeSDK(...) constructor at line 85 still runs at module scope on every HMR re-evaluation — allocating fresh span processors, context managers, and instrumentations each time. The guard only prevents start() from firing twice, but the stale SDK objects from prior evaluations are leaked.

Consider moving the new NodeSDK(...) construction inside startOpenTelemetrySDK() behind the same guard, or caching the instance on globalThis so repeated module evaluations reuse it.

Comment thread agents-api/src/instrumentation.ts
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(4) Total Issues | Risk: Medium

🟠⚠️ Major (2) 🟠⚠️

🟠 1) create-agents-template/apps/agents-api/src/instrumentation.ts:19-20 Breaking change: defaultSDK export removed but template still imports it

Issue: This PR removes the export keyword from defaultSDK (making it private), but create-agents-template still imports this export directly:

// create-agents-template/apps/agents-api/src/instrumentation.ts:19-20
const { defaultSDK } = await import('@inkeep/agents-api/instrumentation');
defaultSDK.start();

Why: When users upgrade @inkeep/agents-api or scaffold a new project using the template, the code will fail at runtime with an import error. This is a breaking change that affects the template's compatibility with the package.

Fix: Update the template to use the new idempotent function:

// create-agents-template/apps/agents-api/src/instrumentation.ts
const { startOpenTelemetrySDK } = await import('@inkeep/agents-api/instrumentation');
startOpenTelemetrySDK();

Refs:


🟠 2) agents-api/src/instrumentation.ts:95-110 Missing test coverage for idempotency logic

Issue: The new startOpenTelemetrySDK() function has zero test coverage. It has three distinct code paths: (1) early return when already started, (2) successful SDK start with flag set, and (3) error handling for MetricReader binding errors — none of which are tested.

Why: Without tests, future refactoring could break the HMR safety mechanism, causing the original 500 error to resurface. The codebase already has a test pattern for this exact scenario in socket-mode.test.ts:

// packages/agents-work-apps/src/__tests__/slack/socket-mode.test.ts:121
it('should skip if client already exists (HMR safety)', async () => {
  (globalThis as Record<string, unknown>).__inkeep_slack_socket_mode_client__ = {};
  // ...
  expect(mockSocketModeStart).not.toHaveBeenCalled();
});

Fix: Add a test file at agents-api/src/__tests__/instrumentation.test.ts with cases for:

  1. should skip SDK start if already started (HMR safety) — verify early return
  2. should set global flag after successful start — verify flag is set
  3. should catch MetricReader binding error and set flag — verify error suppression
  4. should rethrow non-MetricReader errors — verify other errors propagate

Refs:

🟡 Minor (1) 🟡

Inline Comments:

  • 🟡 Minor: instrumentation.ts:104-107 Silent error suppression without logging

💭 Consider (2) 💭

💭 1) agents-api/src/instrumentation.ts:104 Brittle string matching for error detection

Issue: The error check msg.includes('MetricReader') && msg.includes('can not be bound') relies on specific error message text from @opentelemetry/sdk-node. If OpenTelemetry changes the error message wording in a future version, this check would fail.

Why: A dependency update could silently break the hot-reload fix. Since this is dev-only, it might not be caught until it disrupts developer workflow.

Fix: Add a version comment documenting the expected error format:

// Known error from @opentelemetry/sdk-node when SDK.start() is called multiple times.
// Verified against @opentelemetry/[email protected]. Update if error message changes.
if (msg.includes('MetricReader') && msg.includes('can not be bound')) {

Refs:

💭 2) agents-api/src/instrumentation.ts:102-108 Catch block sets flag even when SDK may be in partial state

Issue: If the MetricReader error is thrown, it means defaultSDK.start() partially succeeded before failing. Setting the flag to true prevents future retries, but the SDK may be in an inconsistent state.

Why: In the HMR context this is fine because the previous SDK instance's tracers are still working. But it's worth noting that subsequent calls to shutdown or flush on this new instance would operate on a broken SDK.

Fix: This is acceptable as-is for the HMR use case. Consider adding a code comment documenting this edge case for future maintainers.


🚫 REQUEST CHANGES

Summary: The approach using Symbol.for() + globalThis for idempotency is correct and follows established patterns in this codebase (see Slack Socket Mode). However, two issues need addressing before merge: (1) the template breaking change — create-agents-template imports the now-private defaultSDK export, and (2) the fix lacks test coverage for a behavior-critical change. Both are straightforward to fix.

Discarded (1)
Location Issue Reason Discarded
agents-docs/content/guides/observability/langfuse-usage.mdx:50 Documentation shows old OTEL pattern Pre-existing issue, docs show user-maintained code example, LOW confidence
Reviewers (5)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
pr-review-sre 3 1 0 0 1 0 1
pr-review-tests 2 1 1 0 0 0 0
pr-review-errors 2 0 1 0 1 0 0
pr-review-precision 2 1 1 0 0 0 0
pr-review-standards 0 0 0 0 0 0 0
Total 9 3 3 0 2 0 1

Note: Findings were deduplicated across reviewers — template breaking change was flagged by both pr-review-sre and pr-review-precision; silent error suppression was flagged by both pr-review-sre and pr-review-errors.

Comment thread agents-api/src/instrumentation.ts Outdated
Comment on lines +104 to +107
if (msg.includes('MetricReader') && msg.includes('can not be bound')) {
(globalThis as unknown as { [key: symbol]: boolean })[OTEL_STARTED_KEY] = true;
return;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor: Silent error suppression without logging

Issue: When the MetricReader binding error is caught and suppressed, there's no logging to indicate this happened.

Why: This follows a different pattern than createSafeBatchProcessor() (line 37) which logs warnings for initialization failures. Without logging, debugging OTEL issues becomes harder — there's no trace that startup was skipped due to a binding conflict.

Fix: Add a debug/info log to maintain observability:

Suggested change
if (msg.includes('MetricReader') && msg.includes('can not be bound')) {
(globalThis as unknown as { [key: symbol]: boolean })[OTEL_STARTED_KEY] = true;
return;
}
if (msg.includes('MetricReader') && msg.includes('can not be bound')) {
logger.debug({}, 'OpenTelemetry SDK already started (MetricReader already bound), skipping re-initialization');
(globalThis as unknown as { [key: symbol]: boolean })[OTEL_STARTED_KEY] = true;
return;
}

Refs:

@github-actions github-actions Bot deleted a comment from claude Bot Mar 13, 2026
@itoqa
Copy link
Copy Markdown

itoqa Bot commented Mar 13, 2026

Ito Test Report ✅

19 test cases ran. 19 passed.

The verification run completed with stable API and UI behavior across baseline checks, stress scenarios, auth/session handling, navigation edge cases, and post-stress resilience validation. 🔍 Code review of the PR-scoped startup-idempotency change and observed runtime outcomes did not surface a code-supported defect under the exercised paths.

✅ Passed (19)
Test Case Summary Timestamp Screenshot
ROUTE-1 Six consecutive /health probes returned 204 with no 5xx responses. 0:00 ROUTE-1_0-00.png
ROUTE-2 The /ready endpoint returned 200 with expected healthy schema fields and no server error. 0:00 ROUTE-2_0-00.png
ROUTE-3 Root navigation bootstrapped into tenant projects successfully with no observed auth/bootstrap 500 responses. 0:58 ROUTE-3_0-58.png
ROUTE-4 Tenant route navigation remained functional across settings and projects views with no server error page during transitions. 1:59 ROUTE-4_1-59.png
LOGIC-1 Completed 20 hard reload cycles on authenticated tenant route and app stayed usable without persistent server-error state. 16:05 LOGIC-1_16-05.png
LOGIC-2 Ran 10-tab concurrent load and refresh sequence; all tabs remained on valid tenant projects page without 5xx behavior. 16:05 LOGIC-2_16-05.png
LOGIC-3 No 5xx occurred across repeated auth-session probes and a valid dev-session bootstrap returned 200 with session cookie issuance. 18:03 LOGIC-3_18-03.png
EDGE-1 Direct deep-link to tenant projects from a cleared state resolved successfully with controlled auth bootstrap and no 500 response. 2:13 EDGE-1_2-13.png
EDGE-2 Completed three back/back/forward/forward cycles across tenant routes; no stale 500/internal-error state appeared during traversal. 20:15 EDGE-2_20-15.png
EDGE-3 Mobile viewport bootstrap completed into tenant projects with visible primary UI and no observed startup-related 500 responses. 2:34 EDGE-3_2-34.png
EDGE-4 Repeated refresh interruptions during route activity recovered to stable tenant page each time with no persistent server-error loop. 16:05 EDGE-4_16-05.png
EDGE-5 Completed 30 open/close churn iterations and periodic health probes; API health remained 204 throughout and UI stayed functional. 16:05 EDGE-5_16-05.png
ADV-1 POST /api/auth/dev-session without Authorization returned controlled 401 response. 0:00 ADV-1_0-00.png
ADV-2 Ten rapid requests with incorrect bearer token consistently returned 401 without any 5xx. 0:00 ADV-2_0-00.png
ADV-3 Malformed and oversized Authorization header variants were rejected with controlled non-5xx statuses. 0:00 ADV-3_0-00.png
ADV-4 After deleting the auth cookie, protected-route fan-out completed with controlled route handling and no raw 500 error pages. 18:05 ADV-4_18-05.png
ADV-5 Encoded script/traversal path variants were handled in a controlled way (error/not-found style behavior) without 500 responses or raw script payload reflection. 20:31 ADV-5_20-31.png
ADV-6 High-concurrency mixed probes completed with only expected 200/204/401 statuses and no 5xx responses. 18:08 ADV-6_18-08.png
RESILIENCE-1 After a 2-minute post-stress interval, tenant UI routes remained functional, /health returned 204, /ready returned 200 with valid contract, and no delayed 500 regression was observed. 22:29 RESILIENCE-1_22-29.png
📋 View Recording

Screen Recording

@mike-inkeep
Copy link
Copy Markdown
Contributor Author

@claude add export back to defaultSDK

move new NodeSDK() into startOpenTelemetrySDK behind the same guard, or cache the instance on globalThis so repeated module evaluations reuse it

@inkeep-internal-ci
Copy link
Copy Markdown
Contributor

inkeep-internal-ci Bot commented Mar 16, 2026

Claude finished @mike-inkeep's task in 1m 43s —— View job


Changes implemented

  • Read current instrumentation.ts and template file
  • Add export back to defaultSDK (fixes breaking change for create-agents-template)
  • Move new NodeSDK() behind a globalThis guard via getOrCreateSDK() — repeated HMR module evaluations now reuse the cached instance instead of leaking fresh SDK objects
  • Commit and push

What changed in agents-api/src/instrumentation.ts:

  • Added OTEL_SDK_KEY symbol on globalThis to cache the NodeSDK instance
  • getOrCreateSDK() checks the cache first, only constructs a new NodeSDK if none exists
  • export const defaultSDK = getOrCreateSDK() — export is restored, template import is unbroken
  • startOpenTelemetrySDK() uses a shared OtelGlobal type alias (cleaner casts)
    ·

@itoqa
Copy link
Copy Markdown

itoqa Bot commented Mar 16, 2026

Ito Test Report ✅

19 test cases ran. 19 passed.

✅ The local dev stability run stayed healthy across API route checks, auth bootstrap flows, multi-tab/mobile stress, and adversarial auth requests. Code-change logs only show non-product test-run configuration tweaks (dev server host/overlay indicators), and no test output provided a code-supported defect signal, so all executed cases are reported as passed.

✅ Passed (19)
Test Case Summary Timestamp Screenshot
ROUTE-1 Twenty sequential /health checks returned HTTP 204 with no HTTP 500 responses. 0:00 ROUTE-1_0-00.png
ROUTE-2 Readiness probes stayed healthy after route churn and returned HTTP 200 without HTTP 500. 0:00 ROUTE-2_0-00.png
ROUTE-3 Clean-context navigation to /default/projects rendered successfully with protected UI available. 1:08 ROUTE-3_1-08.png
ROUTE-4 OpenAPI remained parseable across repeated checks and docs route stayed reachable with no HTTP 500 status responses. 0:00 ROUTE-4_0-00.png
ROUTE-5 Ten rapid refreshes completed on /default/projects with stable page availability and no persistent failure pattern. 1:21 ROUTE-5_1-21.png
LOGIC-1 Repeated first-hit style root requests with periodic readiness/openapi probes stayed stable and did not show HTTP 500 responses. 0:00 LOGIC-1_0-00.png
LOGIC-2 Parallel bursts across core routes stayed responsive and did not produce HTTP 500 status responses. 0:00 LOGIC-2_0-00.png
LOGIC-3 Forced recovery probe completed with five healthy checks and successful return to protected UI; no poisoned state persisted. 7:25 LOGIC-3_7-25.png
EDGE-1 Thirty reloads with periodic readiness probes remained stable; all sampled health/readiness checks stayed valid. 1:38 EDGE-1_1-38.png
EDGE-2 Five-tab concurrent navigation and reload sequence completed with responsive API endpoints and no persistent crash state. 2:31 EDGE-2_2-31.png
EDGE-3 Five clean-cookie deep-link interruption cycles converged to rendered stats page without persistent 500 behavior. 2:47 EDGE-3_2-47.png
EDGE-4 Three history replay cycles across mixed routes completed and follow-up health/readiness checks remained healthy. 3:05 EDGE-4_3-05.png
ADV-1 All 10 unauthorized dev-session POST requests returned 401, and follow-up health/readiness checks stayed healthy (204/200) with no 500. 9:24 ADV-1_9-24.png
ADV-2 All malformed Authorization variants returned safe 401 responses, and service remained responsive afterward (health 204, ready 200) with no 500. 9:26 ADV-2_9-26.png
ADV-3 Executed 50 mixed requests plus protected-route reloads; service stayed responsive with no persistent 500 cascade. 7:23 ADV-3_7-23.png
ADV-4 Fresh-cookie protected-route probing completed without backend crash and service health stayed available. 3:21 ADV-4_3-21.png
RAPID-1 Thirty rapid route toggles plus periodic reloads completed with recoverable UI state and healthy backend probe. 3:42 RAPID-1_3-42.png
MOBILE-1 At 390x844, repeated refresh and navigation toggles remained operable with no persistent startup-failure cascade. 4:05 MOBILE-1_4-05.png
JOURNEY-1 Repeated mid-journey refresh interruptions converged to a stable rendered state with backend health intact. 4:15 JOURNEY-1_4-15.png
📋 View Recording

Screen Recording

@mike-inkeep
Copy link
Copy Markdown
Contributor Author

@claude re-review please

@mike-inkeep mike-inkeep force-pushed the fix/hot_reload_metricreader_error branch from d5f8ac1 to 2c8ea6c Compare March 17, 2026 16:38
@vercel vercel Bot temporarily deployed to Preview – agents-docs March 17, 2026 16:38 Inactive
Copy link
Copy Markdown
Contributor

@pullfrog pullfrog Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idempotency guard for NodeSDK.start() is sound and correctly solves the Vite HMR crash. Two items worth addressing: (1) module-level singletons (defaultBatchProcessor, otlpExporter, etc.) are re-created on every reload but the cached SDK still references the originals — flushBatchProcessor() will silently flush the wrong instance; (2) the OtelGlobal type could be tighter to avoid the as NodeSDK cast.

Pullfrog  | Fix all ➔Fix 👍s ➔View workflow runpullfrog.com𝕏

Comment thread agents-api/src/instrumentation.ts Outdated
const OTEL_SDK_KEY = Symbol.for('@inkeep/agents-api.otelSDK');
const OTEL_STARTED_KEY = Symbol.for('@inkeep/agents-api.otelStarted');

type OtelGlobal = { [key: symbol]: boolean | NodeSDK };
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OtelGlobal type allows assigning boolean to OTEL_SDK_KEY or NodeSDK to OTEL_STARTED_KEY without error. Consider using a stricter interface with the specific symbol constants as keys:

type OtelGlobal = {
  [OTEL_SDK_KEY]?: NodeSDK;
  [OTEL_STARTED_KEY]?: boolean;
};

This eliminates the as NodeSDK cast on line 101 too.

Comment thread agents-api/src/instrumentation.ts Outdated
Comment on lines +85 to +104
const OTEL_SDK_KEY = Symbol.for('@inkeep/agents-api.otelSDK');
const OTEL_STARTED_KEY = Symbol.for('@inkeep/agents-api.otelStarted');

type OtelGlobal = { [key: symbol]: boolean | NodeSDK };

function getOrCreateSDK(): NodeSDK {
const g = globalThis as unknown as OtelGlobal;
if (!g[OTEL_SDK_KEY]) {
g[OTEL_SDK_KEY] = new NodeSDK({
resource: defaultResource,
contextManager: defaultContextManager,
textMapPropagator: defaultTextMapPropagator,
spanProcessors: defaultSpanProcessors,
instrumentations: defaultInstrumentations,
});
}
return g[OTEL_SDK_KEY] as NodeSDK;
}

export const defaultSDK = getOrCreateSDK();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SDK instance is cached on globalThis, but the module-level objects it was constructed with (defaultBatchProcessor, defaultContextManager, otlpExporter, etc.) are re-created on every Vite HMR re-evaluation since they're not behind a similar guard. This means the exported defaultBatchProcessor (used in flushBatchProcessor) will be a new instance unrelated to the one inside the cached SDK — so flushBatchProcessor() will silently flush nothing. Consider guarding those module-level singletons the same way, or deriving them from the cached SDK.

Comment thread agents-api/src/instrumentation.ts
@inkeep-internal-ci
Copy link
Copy Markdown
Contributor

inkeep-internal-ci Bot commented Mar 17, 2026

Claude finished @mike-inkeep's task in 1m 12s —— View job


Re-review of PR #2684

  • Read current state of changed files
  • Review changes against previous feedback
  • Post review findings

Overall: The follow-up commit addresses the two major issues from the first review (re-exporting defaultSDK and caching the NodeSDK instance on globalThis). The idempotency approach is solid. A few items remain:

Resolved from previous reviews

  • defaultSDK export restoredcreate-agents-template import is no longer broken. ✅
  • NodeSDK construction moved behind globalThis guardgetOrCreateSDK() prevents leaking fresh SDK objects on HMR. ✅

Remaining issues

1. create-agents-template still uses the old patterncreate-agents-template/apps/agents-api/src/instrumentation.ts:19-20 still calls defaultSDK.start() directly. While this won't break (since defaultSDK is still exported), it bypasses the idempotency guard — meaning template users will still hit the MetricReader error on HMR. The template should be updated to use startOpenTelemetrySDK():

// create-agents-template/apps/agents-api/src/instrumentation.ts
const { startOpenTelemetrySDK } = await import('@inkeep/agents-api/instrumentation');
startOpenTelemetrySDK();

Fix this →

2. Module-level singletons are still re-created on HMRdefaultBatchProcessor, otlpExporter, defaultContextManager, defaultSpanProcessors, etc. (lines 25–83) are re-instantiated on every Vite HMR re-evaluation since they're declared at module scope without a globalThis guard. The cached SDK (via getOrCreateSDK()) still references the original instances, so:

  • flushBatchProcessor() calls forceFlush() on a new defaultBatchProcessor that has no buffered spans — effectively a no-op after HMR
  • The old otlpExporter, defaultContextManager, etc. are leaked each reload

This is a dev-only issue (HMR doesn't happen in prod), but it means spans produced after a hot reload may silently fail to flush. Consider either guarding these singletons behind globalThis the same way as the SDK, or deriving defaultBatchProcessor from the cached SDK instance.

3. OtelGlobal type could be stricter (nit) — As pullfrog noted, the current type { [key: symbol]: boolean | NodeSDK } allows assigning the wrong type to either key. A tighter type would be:

type OtelGlobal = {
  [OTEL_SDK_KEY]?: NodeSDK;
  [OTEL_STARTED_KEY]?: boolean;
};

This also eliminates the as NodeSDK cast on line 101.

4. Silent error suppression in catch block (minor) — When the MetricReader error catch fires (line 116-118), there's no log output. Adding a logger.debug or logger.warn here would help distinguish clean idempotency from error-recovery idempotency, consistent with createSafeBatchProcessor() at line 37.

Summary

Item Severity Status
Template still uses defaultSDK.start() Medium Open — bypasses idempotency guard
Module-level singletons re-created on HMR Low Open — dev-only, causes silent flush no-ops
OtelGlobal type looseness Nit Open
Missing log in catch block Nit Open
defaultSDK export removed Major ✅ Fixed in 2c8ea6c
NodeSDK leaked on HMR Major ✅ Fixed in 2c8ea6c

Item 1 is the most impactful — template users won't benefit from this fix without that change.


Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(3) Total Issues | Risk: Low

🟠⚠️ Major (1) 🟠⚠️

🟠 1) create-agents-template/apps/agents-api/src/instrumentation.ts:19-20 Template still calls defaultSDK.start() directly — inconsistent idempotency

Issue: The template imports defaultSDK and calls .start() directly, bypassing the new idempotent startOpenTelemetrySDK() wrapper:

const { defaultSDK } = await import('@inkeep/agents-api/instrumentation');
defaultSDK.start();

While commit 2c8ea6c0d correctly restored the defaultSDK export to avoid breaking this import, the template now has inconsistent behavior compared to agents-api/src/index.ts which uses the idempotent wrapper.

Why: The template represents a real usage pattern for users scaffolding new projects. If a template user runs Vite HMR, they will still experience the MetricReader binding error because defaultSDK.start() bypasses the idempotency guards in startOpenTelemetrySDK(). The fix was applied to the main entry point but not propagated to the template.

Fix: Update the template to use the idempotent wrapper:

const { startOpenTelemetrySDK } = await import('@inkeep/agents-api/instrumentation');
startOpenTelemetrySDK();

Refs:

🟡 Minor (1) 🟡

Inline Comments:

  • 🟡 Minor: instrumentation.ts:116-118 Silent error suppression without logging

🕐 Pending Recommendations (1)


✅ What's Working Well

The second commit (2c8ea6c0d) successfully addressed the breaking change from the first review:

  1. defaultSDK export restored — Template imports no longer break
  2. SDK instance caching via OTEL_SDK_KEY symbol on globalThis — Fixes the HMR memory leak issue where new SDK objects were created on each module re-evaluation
  3. Dual-guard pattern (flag check + error catch) — Appropriate belt-and-suspenders approach: flag guard handles the common case efficiently, error catch handles edge cases like race conditions or external code calling defaultSDK.start() directly

💡 APPROVE WITH SUGGESTIONS

Summary: The core fix is sound — SDK instance caching and the idempotent startOpenTelemetrySDK() wrapper correctly address the HMR crash. The breaking change from the first review has been fixed. Two items remain from the previous review (test coverage, logging), and one new item surfaced (template not updated). None of these are blockers, but addressing the template inconsistency would ensure all consumers benefit from the fix.

Discarded (2)
Location Issue Reason Discarded
instrumentation.ts:25-79 Module-level OTEL objects still recreated on HMR These objects don't cause the crash — they're a minor resource leak during dev. The fix correctly targets the MetricReader binding error which is the stated goal.
instrumentation.ts:90-102 getOrCreateSDK() caching untested Lower priority — if caching breaks, the error catch in startOpenTelemetrySDK() still handles it. Defense-in-depth, not critical path.
Reviewers (4)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
pr-review-precision 3 1 0 0 0 0 2
pr-review-tests 2 0 0 0 0 1 1
pr-review-sre 2 0 0 0 1 0 0
pr-review-standards 0 0 0 0 0 0 0
Total 7 1 0 0 1 1 3

Note: Test coverage finding routed to Pending Recommendations as it was raised in the previous review and remains unaddressed.

Comment thread agents-api/src/instrumentation.ts Outdated
Comment on lines +116 to +118
if (msg.includes('MetricReader') && msg.includes('can not be bound')) {
g[OTEL_STARTED_KEY] = true;
return;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor: Silent error suppression without logging

Issue: When the MetricReader binding error is caught and suppressed, there's no logging to indicate this happened.

Why: This follows a different pattern than createSafeBatchProcessor() (line 37) which logs warnings for initialization failures. Without logging, debugging OTEL issues becomes harder — there's no trace that startup was skipped due to a binding conflict. The socket-mode HMR pattern logs 'Socket Mode client already running (HMR reload detected), skipping' for visibility.

Fix: Add debug-level logging for observability:

Suggested change
if (msg.includes('MetricReader') && msg.includes('can not be bound')) {
g[OTEL_STARTED_KEY] = true;
return;
if (msg.includes('MetricReader') && msg.includes('can not be bound')) {
logger.debug({}, 'OpenTelemetry SDK already started (MetricReader already bound), skipping re-initialization');
g[OTEL_STARTED_KEY] = true;
return;
}

Refs:

@github-actions github-actions Bot deleted a comment from claude Bot Mar 17, 2026
@mike-inkeep
Copy link
Copy Markdown
Contributor Author

@claude please address the following. After addressing create-agents-template, decide whether we can remove the export from defaultSDK if nothing needs it anymore

Remaining issues

  1. create-agents-template still uses the old pattern — create-agents-template/apps/agents-api/src/instrumentation.ts:19-20 still calls defaultSDK.start() directly. While this won't break (since defaultSDK is still exported), it bypasses the idempotency guard — meaning template users will still hit the MetricReader error on HMR. The template should be updated to use startOpenTelemetrySDK():

// create-agents-template/apps/agents-api/src/instrumentation.ts
const { startOpenTelemetrySDK } = await import('@inkeep/agents-api/instrumentation');
startOpenTelemetrySDK();
Fix this →

  1. Module-level singletons are still re-created on HMR — defaultBatchProcessor, otlpExporter, defaultContextManager, defaultSpanProcessors, etc. (lines 25–83) are re-instantiated on every Vite HMR re-evaluation since they're declared at module scope without a globalThis guard. The cached SDK (via getOrCreateSDK()) still references the original instances, so:

flushBatchProcessor() calls forceFlush() on a new defaultBatchProcessor that has no buffered spans — effectively a no-op after HMR
The old otlpExporter, defaultContextManager, etc. are leaked each reload
This is a dev-only issue (HMR doesn't happen in prod), but it means spans produced after a hot reload may silently fail to flush. Consider either guarding these singletons behind globalThis the same way as the SDK, or deriving defaultBatchProcessor from the cached SDK instance.

  1. OtelGlobal type could be stricter (nit) — As pullfrog noted, the current type { [key: symbol]: boolean | NodeSDK } allows assigning the wrong type to either key. A tighter type would be:

type OtelGlobal = {
[OTEL_SDK_KEY]?: NodeSDK;
[OTEL_STARTED_KEY]?: boolean;
};
This also eliminates the as NodeSDK cast on line 101.

  1. Silent error suppression in catch block (minor) — When the MetricReader error catch fires (line 116-118), there's no log output. Adding a logger.debug or logger.warn here would help distinguish clean idempotency from error-recovery idempotency, consistent with createSafeBatchProcessor() at line 37.

@inkeep-internal-ci
Copy link
Copy Markdown
Contributor

inkeep-internal-ci Bot commented Mar 17, 2026

Claude finished @mike-inkeep's task in 9m 31s —— View job


Addressing remaining review items

  • Read current state of changed files
  • Fix 1: Update create-agents-template to use startOpenTelemetrySDK()
  • Fix 2: Guard module-level singletons behind globalThis
  • Fix 3: Make OtelGlobal type stricter (used typed OtelCache interface with single cache object)
  • Fix 4: Add logging to catch block (logger.debug)
  • Determine if defaultSDK export can be removed
  • Commit and push — committed locally but unable to push (see below)

⚠️ Push blocked: The remote branch has a newer commit (e4f5935) that isn't in my local checkout. My CI environment's --allowedTools configuration is blocking git fetch/git pull/git rebase commands, so I can't sync before pushing. The commit is ready locally.

To apply these changes, please re-trigger me on this PR or run manually:

git fetch origin && git rebase origin/fix/hot_reload_metricreader_error && git push

Changes implemented (local commit 825d3a0)

1. Template updated (create-agents-template/apps/agents-api/src/instrumentation.ts)

  • Replaced defaultSDK.start() with startOpenTelemetrySDK() so template users get HMR idempotency

2. All OTEL singletons cached on globalThis (agents-api/src/instrumentation.ts)

  • Consolidated all globalThis state into a single OtelCache interface behind one Symbol.for('@inkeep/agents-api.otelCache') key
  • OTLPTraceExporter, BatchSpanProcessor, AsyncLocalStorageContextManager, CompositePropagator, NodeSDK, and started flag all cached via getOrCreate*() helpers
  • flushBatchProcessor() now always references the same BatchSpanProcessor instance the SDK uses — no more silent no-op after HMR

3. Stricter typing

  • Replaced { [key: symbol]: boolean | NodeSDK } with a typed OtelCache interface — proper types per field, no as NodeSDK casts

4. Logging added to catch block

  • Added logger.debug({}, 'OpenTelemetry SDK already started (MetricReader already bound), skipping re-initialization') consistent with createSafeBatchProcessor() pattern

5. defaultSDK export kept

  • After updating the template, no external code imports defaultSDK — only startOpenTelemetrySDK and flushBatchProcessor are imported elsewhere. However, since defaultSDK is a declared subpath export in agents-api/package.json, removing it could break unknown external consumers. Kept the export — can be deprecated in a future PR.
    | Branch: fix/hot_reload_metricreader_error

Copy link
Copy Markdown
Contributor

@pullfrog pullfrog Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idempotency guard on startOpenTelemetrySDK() is correct and the error-catch is appropriately narrow. One concern: the SDK singleton on globalThis may hold stale references to module-level objects that get recreated during HMR — see inline comment.

Pullfrog  | Fix all ➔Fix 👍s ➔View workflow runpullfrog.com𝕏

Comment thread agents-api/src/instrumentation.ts Outdated
Comment on lines +90 to +102
function getOrCreateSDK(): NodeSDK {
const g = globalThis as unknown as OtelGlobal;
if (!g[OTEL_SDK_KEY]) {
g[OTEL_SDK_KEY] = new NodeSDK({
resource: defaultResource,
contextManager: defaultContextManager,
textMapPropagator: defaultTextMapPropagator,
spanProcessors: defaultSpanProcessors,
instrumentations: defaultInstrumentations,
});
}
return g[OTEL_SDK_KEY] as NodeSDK;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SDK instance is cached on globalThis, but defaultResource, defaultSpanProcessors, defaultContextManager, etc. are module-level constants that get recreated on each HMR re-evaluation. On the second load, getOrCreateSDK() returns the old cached SDK holding references to the first set of processors/context-manager, while the rest of the module exports the new ones.

This means flushBatchProcessor() (which calls defaultBatchProcessor.forceFlush() directly) would flush the new processor that the cached SDK doesn't actually use. Consider either (a) also caching the config objects on globalThis, or (b) deriving defaultBatchProcessor from the SDK instance rather than a separate module-level var.

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(0) Total Issues | Risk: Low

✅ Delta Review: Simplified Approach is Sound

The last 2 commits (4b09d0f58, 62d658804) dramatically simplified the HMR idempotency fix:

What was removed (~149 lines):

  • Complex globalThis caching with 9 Symbol keys
  • Type-safe OtelGlobal interface
  • getOrCreate*() helpers for every singleton
  • Primary flag guard + fallback error catch

What replaced it (~17 lines):

export function startOpenTelemetrySDK(): void {
  try {
    defaultSDK.start();
  } catch (error) {
    const msg = error instanceof Error ? error.message : String(error);
    const allowDuplicateStartErrors =
      env.ENVIRONMENT === 'development' || env.NODE_ENV === 'development';
    if (
      allowDuplicateStartErrors &&
      ((msg.includes('MetricReader') && msg.includes('can not be bound')) ||
        msg.includes('Attempted duplicate registration of API'))
    ) {
      logger.debug({}, 'OpenTelemetry SDK already started');
      return;
    }
    throw error;
  }
}

Why This Works

  1. Production safety preserved — The allowDuplicateStartErrors check ensures errors are only suppressed in development mode. Production startup failures propagate correctly.

  2. Narrow error matching — Only suppresses two specific known HMR-related OpenTelemetry errors. Any other error (config issues, connection failures, etc.) is re-thrown.

  3. Observability maintainedlogger.debug() call provides visibility when startup is skipped.

  4. Simplicity wins — The previous implementation was over-engineered for a dev-only problem. This approach is ~9x smaller and equally effective at fixing the HMR crash.

Known Tradeoff (Acknowledged)

The stale singleton concern (raised by pullfrog in prior reviews) — where flushBatchProcessor() may flush the wrong instance after HMR — was deliberately accepted by the PR author. The commit message explicitly states: "Simplify to just suppress the error since it's not an issue in prod, only local".

This is an acceptable tradeoff because:

  • HMR only occurs in development
  • Production telemetry is never affected
  • The fix achieves its stated goal (prevent 500 errors requiring dev server restart)

✅ APPROVE

Summary: The simplified approach correctly fixes the Vite HMR crash with minimal code. The delta removes unnecessary complexity while preserving production safety. The known dev-only tradeoff (flushBatchProcessor() may no-op after HMR) is explicitly acknowledged and acceptable. Ship it! 🚀

Discarded (1)
Location Issue Reason Discarded
instrumentation.ts:42,112-118 Stale singleton regression — flushBatchProcessor() may flush wrong instance after HMR Deliberately addressed by author in commit 4b09d0f58: "not an issue in prod, only local". Dev-only tradeoff explicitly accepted.
Reviewers (4)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
pr-review-standards 0 0 0 0 0 0 0
pr-review-sre 0 0 0 0 0 0 0
pr-review-precision 1 0 0 0 0 0 1
pr-review-errors 0 0 0 0 0 0 0
Total 1 0 0 0 0 0 1

Note: The precision reviewer's finding about stale singletons was discarded because the PR author explicitly acknowledged and accepted this tradeoff in the commit message.

@github-actions github-actions Bot deleted a comment from claude Bot Mar 17, 2026
@itoqa
Copy link
Copy Markdown

itoqa Bot commented Mar 17, 2026

Ito Test Report ✅

19 test cases ran. 19 passed.

✅ Verification found stable behavior across health/readiness checks, auth flows, navigation stress, and OTEL reload scenarios for the included cases. 🔍 Code review of the authentication/session and instrumentation paths did not surface a production-code defect for the non-passed stale-session case, so the included set contains only confirmed passes.

✅ Passed (19)
Test Case Summary Timestamp Screenshot
ROUTE-1 Baseline and three repeated /health checks all returned HTTP 204 with no server error content. 0:00 ROUTE-1_0-00.png
ROUTE-2 Baseline and repeated /ready checks returned HTTP 200 with expected readiness schema and no unhandled 500. 0:00 ROUTE-2_0-00.png
ROUTE-3 Root navigation resolved on localhost (authenticated route) with no API 500 responses observed in captured network evidence. 2:02 ROUTE-3_2-02.png
ROUTE-4 Login form fields and submit were interactive; invalid credentials produced a recoverable auth error with no white-screen or crash. 2:02 ROUTE-4_2-02.png
LOGIC-1 After three reload cycles, API health remained 204 and the UI consistently recovered to Projects without persistent failure. 3:29 LOGIC-1_3-29.png
LOGIC-2 Template instrumentation path stayed available across reload with /health=204 and /ready=200 before and after the cycle. 3:43 LOGIC-2_3-43.png
EDGE-1 Ten rapid refreshes during an active reload window all converged back to Projects and health checks stayed 204. 4:00 EDGE-1_4-00.png
EDGE-2 Deep-link tab refreshed during reload churn and recovered to a valid projects state without unrecoverable error page. 7:37 EDGE-2_7-37.png
EDGE-3 Repeated back/forward actions during reload pressure converged to valid UI routes without persistent failure loop. 8:02 EDGE-3_8-02.png
EDGE-4 With placeholder OTEL endpoint configured and reloaded, /health and /ready remained operational without persistent errors. 4:26 EDGE-4_4-26.png
EDGE-5 Concurrent polling and midpoint reload stayed stable; no persistent failing state observed after stabilization. 9:09 EDGE-5_9-09.png
EDGE-6 At 390x844 viewport, root/login flow remained usable and submit interaction completed with recoverable response. 9:59 EDGE-6_9-59.png
ADV-1 A 50-request burst during reload returned 204 throughout, including final follow-up, indicating no stuck degraded state. 5:09 ADV-1_5-09.png
ADV-2 Rapid repeated sign-in submissions converged to a stable authenticated projects state without persistent crash behavior. 11:46 ADV-2_11-46.png
ADV-3 Both unauthenticated and wrong-token dev-session requests returned 401 Unauthorized. 11:20 ADV-3_11-20.png
ADV-4 Login with returnUrl=https://evil.example stayed within localhost routes after authentication; no external redirect occurred. 2:02 ADV-4_2-02.png
ADV-6 Two deliberate non-duplicate OTEL misconfiguration attempts during reload did not produce a persistent hidden failure; service health and readiness stayed stable and restoration succeeded. 5:42 ADV-6_5-42.png
ROUTE-5 Projects view remained available while periodic /health probes continued returning stable 204 responses. 13:17 ROUTE-5_13-17.png
ROUTE-6 Dev-session accepted valid bypass secret and browser navigation resolved to authenticated projects route. 13:21 ROUTE-6_13-21.png
📋 View Recording

Screen Recording

Copy link
Copy Markdown
Collaborator

@dimaMachina dimaMachina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@claude add a changeset

@dimaMachina dimaMachina requested a review from amikofalvy March 17, 2026 19:11
@inkeep-internal-ci
Copy link
Copy Markdown
Contributor

inkeep-internal-ci Bot commented Mar 17, 2026

Claude finished @dimaMachina's task in 54s —— View job


Adding changeset

  • Determine changeset parameters
  • Create changeset file
  • Commit and push

Added @inkeep/agents-api patch changeset: "Fix OpenTelemetry SDK startup crash during Vite HMR by making initialization idempotent"
| Branch: fix/hot_reload_metricreader_error

Co-authored-by: Dimitri POSTOLOV <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
Copy link
Copy Markdown
Contributor

@pullfrog pullfrog Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The catch-and-suppress approach works as a quick fix, but it has two structural weaknesses: string-matching is fragile against upstream OTel SDK message changes, and module-level singletons (exporters, processors) are still re-instantiated on every HMR reload — leaking resources even when the error is caught. A globalThis sentinel guard would be more robust and avoids both issues. See inline comments.

Pullfrog  | Fix all ➔Fix 👍s ➔View workflow runpullfrog.com𝕏

Comment on lines +93 to +110
export function startOpenTelemetrySDK(): void {
try {
defaultSDK.start();
} catch (error) {
const msg = error instanceof Error ? error.message : String(error);
const allowDuplicateStartErrors =
env.ENVIRONMENT === 'development' || env.NODE_ENV === 'development';
if (
allowDuplicateStartErrors &&
((msg.includes('MetricReader') && msg.includes('can not be bound')) ||
msg.includes('Attempted duplicate registration of API'))
) {
logger.debug({}, 'OpenTelemetry SDK already started');
return;
}
throw error;
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error-message string matching ('MetricReader', 'can not be bound', 'Attempted duplicate registration of API') is coupled to the exact wording chosen by the OpenTelemetry SDK. If an OTel SDK upgrade rephrases these messages, legitimate duplicate-start errors will slip through the guard and crash the dev server again.

Consider a more resilient approach — either:

  1. Guard with a globalThis sentinel so start() is never called twice:
const OTEL_STARTED = Symbol.for('inkeep.otel.started');
export function startOpenTelemetrySDK(): void {
  if ((globalThis as any)[OTEL_STARTED]) {
    logger.debug({}, 'OpenTelemetry SDK already started (HMR)');
    return;
  }
  defaultSDK.start();
  (globalThis as any)[OTEL_STARTED] = true;
}
  1. Or at minimum, broaden the catch to suppress any error in dev mode whose message mentions "already" or "duplicate" (still fragile, but less so).

Option 1 avoids string matching entirely and is the pattern recommended by OTel maintainers for singleton initialization.

Comment on lines +93 to +110
export function startOpenTelemetrySDK(): void {
try {
defaultSDK.start();
} catch (error) {
const msg = error instanceof Error ? error.message : String(error);
const allowDuplicateStartErrors =
env.ENVIRONMENT === 'development' || env.NODE_ENV === 'development';
if (
allowDuplicateStartErrors &&
((msg.includes('MetricReader') && msg.includes('can not be bound')) ||
msg.includes('Attempted duplicate registration of API'))
) {
logger.debug({}, 'OpenTelemetry SDK already started');
return;
}
throw error;
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even when the error is caught, every Vite HMR reload re-executes the module top-level code, creating new OTLPTraceExporter, BatchSpanProcessor, AsyncLocalStorageContextManager, and NodeSDK instances. The old ones remain registered in the OTel global API — this silently leaks resources (open connections, timers, etc.).

If you adopt a globalThis guard, you can also stash the SDK and processors on globalThis so the module reuses the existing instances on HMR reload rather than creating duplicates.

@itoqa
Copy link
Copy Markdown

itoqa Bot commented Mar 17, 2026

Ito Test Report ✅

19 test cases ran. 19 passed.

All planned API, auth, navigation, resilience, and security-path checks completed successfully, and verification did not surface any code-level defect hypothesis requiring escalation to a Failed classification. 🔍✅

✅ Passed (19)
Test Case Summary Timestamp Screenshot
ROUTE-1 Executed five /health checks; all responses were HTTP 204 with no transient 500. 0:00 ROUTE-1_0-00.png
ROUTE-2 Readiness check returned HTTP 200 with expected JSON contract fields and values. 0:00 ROUTE-2_0-00.png
ROUTE-3 Root navigation resolved to /default/projects and network requests showed no 500 errors. 0:48 ROUTE-3_0-48.png
ROUTE-4 Seeded admin credentials were accepted on /login and redirected successfully to /default/projects. 1:28 ROUTE-4_1-28.png
ROUTE-5 Direct entry to /default/projects in a cleared session resolved cleanly with no 500 responses. 1:46 ROUTE-5_1-46.png
ROUTE-6 Unauthenticated POST to protected run route returned 404 (4xx) with no 500. 3:15 ROUTE-6_3-15.png
LOGIC-1 After 30 rapid reloads, the app remained usable and API checks stayed healthy. 6:47 LOGIC-1_6-47.png
LOGIC-2 Valid Authorization bypass secret returned HTTP 200 and included set-cookie for session creation. 3:15 LOGIC-2_3-15.png
LOGIC-3 GET /openapi.json returned HTTP 200 and repeated requests remained stable. 8:14 LOGIC-3_8-14.png
EDGE-1 Five concurrent tabs bootstrapped successfully to valid authenticated routes without systemic 500 errors. 6:47 EDGE-1_6-47.png
EDGE-2 Back/forward/reload navigation stabilized and did not enter an infinite redirect or server-error loop. 6:47 EDGE-2_6-47.png
EDGE-3 Interrupting login with immediate refresh remained recoverable; retry reached the projects page. 6:47 EDGE-3_6-47.png
EDGE-4 Performed 60 one-second readiness polls; all returned HTTP 200 with no persistent startup failure. 0:00 EDGE-4_0-00.png
ADV-1 POST without Authorization header returned HTTP 401 and did not issue a session cookie. 3:15 ADV-1_3-15.png
ADV-2 Fuzzed Authorization variants produced 4xx responses, and post-fuzz /health remained 204. 3:15 ADV-2_3-15.png
ADV-3 External and javascript returnUrl payloads did not redirect off-site; navigation stayed on internal routes. 10:19 ADV-3_10-19.png
ADV-4 Cross-origin page-context POST without Authorization returned 401 and created no session cookie. 3:15 ADV-4_3-15.png
RAPID-1 Rapid repeated login submission converged to a coherent authenticated session without server errors. 6:47 RAPID-1_6-47.png
MOBILE-1 At 390x844 viewport, login resolved to projects and core UI remained accessible. 10:20 MOBILE-1_10-20.png
📋 View Recording

Screen Recording

@mike-inkeep mike-inkeep enabled auto-merge March 19, 2026 20:17
@pullfrog
Copy link
Copy Markdown
Contributor

pullfrog Bot commented Mar 19, 2026

TL;DR — Wraps the OpenTelemetry SDK start() call in an idempotent helper that catches and swallows duplicate-registration errors during Vite HMR in development, preventing a crash that required restarting pnpm dev.

Key changes

  • startOpenTelemetrySDK() — New wrapper function in instrumentation.ts that catches MetricReader / duplicate API registration errors in development mode instead of letting them crash the process.
  • index.ts / template instrumentation.ts — Callers updated from defaultSDK.start() to startOpenTelemetrySDK().

Summary | 4 files | 9 commits | base: mainfix/hot_reload_metricreader_error


startOpenTelemetrySDK() wrapper

Before: defaultSDK.start() was called directly — a second invocation during Vite HMR threw an unhandled MetricReader can not be bound to a MeterProvider again error, taking down the dev server.
After: startOpenTelemetrySDK() wraps the call in a try/catch that, in development only, silently absorbs known duplicate-start errors and logs at debug level.

The guard is scoped to two specific error messages (MetricReader…can not be bound and Attempted duplicate registration of API) and only active when ENVIRONMENT or NODE_ENV is development — all other errors and all production calls re-throw unchanged.

agents-api/src/instrumentation.ts · agents-api/src/index.ts · create-agents-template/.../instrumentation.ts

Pullfrog  | View workflow run | Triggered by Pullfrogpullfrog.com𝕏

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(0) Total Issues | Risk: Low

✅ Delta Review: No New Issues

The delta since the last automated review (62d658804) consists of:

  1. Changeset addition (93f033c95) — Correctly scoped @inkeep/agents-api: patch with clear message
  2. Merge with main (9c711d1f5) — Integrates SSO configuration changes; no conflicts with PR scope

Implementation Summary (unchanged since last review)

The final implementation is clean and effective:

export function startOpenTelemetrySDK(): void {
  try {
    defaultSDK.start();
  } catch (error) {
    const msg = error instanceof Error ? error.message : String(error);
    const allowDuplicateStartErrors =
      env.ENVIRONMENT === 'development' || env.NODE_ENV === 'development';
    if (
      allowDuplicateStartErrors &&
      ((msg.includes('MetricReader') && msg.includes('can not be bound')) ||
        msg.includes('Attempted duplicate registration of API'))
    ) {
      logger.debug({}, 'OpenTelemetry SDK already started');
      return;
    }
    throw error;
  }
}

Why this works:

  • Production safety preserved — Errors only suppressed in development mode
  • Narrow error matching — Only known HMR-related OTel errors are caught
  • Observability maintainedlogger.debug() call for visibility
  • Template consistency — Both agents-api/src/index.ts and create-agents-template use the idempotent wrapper
  • Changeset included — Proper patch version bump with descriptive message

Prior Review Feedback Status

Issue Status
Breaking change: template imports removed defaultSDK export ✅ Fixed — template updated to use startOpenTelemetrySDK()
Module-level singletons re-created on HMR ✅ Acknowledged — dev-only tradeoff explicitly accepted
Silent error suppression ✅ Fixed — logger.debug() added
Missing test coverage ⚠️ Open — noted but not blocking (dev-only code path)

🕐 Pending Recommendations (1)

  • 🟡 Previous review: Missing test coverage for startOpenTelemetrySDK() — The function has three code paths (success, error suppression, error rethrow) with no unit tests. Not blocking, but would help prevent regressions.

✅ APPROVE

Summary: The implementation is sound, the changeset is properly formatted, and all blocking issues from prior reviews have been addressed. The simplified try-catch approach is an appropriate solution for a dev-only HMR problem. Ship it! 🚀

Reviewers (1)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
orchestrator 1 0 0 0 0 1 0
Total 1 0 0 0 0 1 0

Note: Delta review — no sub-agents dispatched as the only PR-specific change was a changeset addition.

@github-actions github-actions Bot deleted a comment from claude Bot Mar 19, 2026
@mike-inkeep mike-inkeep added this pull request to the merge queue Mar 19, 2026
Merged via the queue into main with commit fddbd38 Mar 19, 2026
17 checks passed
@mike-inkeep mike-inkeep deleted the fix/hot_reload_metricreader_error branch March 19, 2026 20:34
@itoqa
Copy link
Copy Markdown

itoqa Bot commented Mar 19, 2026

Ito Test Report ❌

15 test cases ran. 14 passed, 1 failed.

This run shows strong overall stability for routing, auth endpoint hardening, stress handling, and startup guard behavior ✅. One verified regression remains in the template OTEL startup path: the template now calls a newer instrumentation API while still depending on an older @inkeep/agents-api version, which can cause OTEL initialization to fail at runtime ❌.

✅ Passed (14)
Test Case Summary Timestamp Screenshot
ROUTE-1 Health endpoint returned 204 at baseline, across 10 stress loop checks, and post-stress with no degradation observed. 0:00 ROUTE-1_0-00.png
ROUTE-2 Readiness endpoint returned 200 with expected health payload at baseline and after stress activity. 0:00 ROUTE-2_0-00.png
ROUTE-4 Valid bearer secret returns 200 with session cookie and browser navigation reaches protected projects page. 3:56 ROUTE-4_3-56.png
LOGIC-1 After 10 reload stress cycles, service availability remained stable with repeated /health=204 and /ready=200 responses and no persistent degradation. 23:04 LOGIC-1_23-04.png
LOGIC-2 Under repeated protected-route churn and reload stress, API and UI remained healthy with no persistent outage cascade. 23:29 LOGIC-2_23-29.png
LOGIC-3 Validated via direct instrumentation fault injection: startOpenTelemetrySDK re-threw a synthetic non-duplicate startup failure. 30:24 LOGIC-3_30-24.png
LOGIC-4 Production-mode fault injection re-threw duplicate registration error and did not suppress it. 30:24 LOGIC-4_30-24.png
EDGE-2 Placeholder OTEL headers skipped full telemetry startup while template API remained healthy. 20:38 EDGE-2_20-38.png
EDGE-3 Invalid OTEL headers logged initialization failure while API continued serving /health=204. 20:38 EDGE-3_20-38.png
EDGE-4 On 390x844 viewport, repeated reloads kept the UI usable and /health remained healthy with HTTP 204. 1:27 EDGE-4_1-27.png
EDGE-5 Back/forward and refresh loop remained deterministic without redirect trap or blank screen. 1:44 EDGE-5_1-44.png
ADV-1 Missing Authorization was rejected with 401 and did not establish a session. 4:07 ADV-1_4-07.png
ADV-2 Empty, unicode, and oversized bearer token variants were rejected with 401 and did not authenticate. 4:26 ADV-2_4-26.png
ADV-4 Sustained high-churn activity did not cause persistent API degradation and UI remained usable. 10:20 ADV-4_10-20.png
❌ Failed (1)
Test Case Summary Timestamp Screenshot
ROUTE-5 Template API remained reachable, but OTEL initialization errored (startOpenTelemetrySDK is not a function) under non-placeholder configuration. 20:37 ROUTE-5_20-37.png
Template app still boots when OTEL is configured – Failed
  • Where: Template app OTEL bootstrap path (create-agents-template API startup).

  • Steps to reproduce: Start template API with real (non-placeholder) OTEL endpoint/headers so instrumentation path executes.

  • What failed: OTEL initialization hits startOpenTelemetrySDK is not a function; expected behavior is successful OTEL startup (or at minimum calling a valid exported function).

  • Code analysis: The template imports and calls startOpenTelemetrySDK from @inkeep/agents-api/instrumentation, but the template dependency is pinned to @inkeep/agents-api ^0.58.5. This creates a compatibility gap when template code expects a newer export surface than the pinned package guarantees.

  • Relevant code:

    create-agents-template/apps/agents-api/src/instrumentation.ts (lines 17–23)

    if (isRealEndpoint) {
      try {
        const { startOpenTelemetrySDK } = await import('@inkeep/agents-api/instrumentation');
        startOpenTelemetrySDK();
        console.log(`[OTEL] Instrumentation enabled (endpoint: ${otlpEndpoint})`);
      } catch (error) {

    create-agents-template/apps/agents-api/package.json (lines 12–15)

    "dependencies": {
      "@hono/node-server": "^1.19.10",
      "@hono/zod-openapi": "^1.1.5",
      "@inkeep/agents-api": "^0.58.5",

    agents-api/src/instrumentation.ts (lines 93–100)

    export function startOpenTelemetrySDK(): void {
      try {
        defaultSDK.start();
      } catch (error) {
        const msg = error instanceof Error ? error.message : String(error);
        const allowDuplicateStartErrors =
          env.ENVIRONMENT === 'development' || env.NODE_ENV === 'development';
  • Why this is likely a bug: The template invokes a symbol that may not exist in its declared dependency version, producing a deterministic runtime type error in production code paths.

  • Introduced by this PR: Yes – this PR modified the relevant code.

  • Timestamp: 20:37

📋 View Recording

Screen Recording

dimaMachina added a commit that referenced this pull request Mar 20, 2026
* fix: Make OpenTelemetry startup idempotent

* fix: Re-export defaultSDK and cache NodeSDK instance on globalThis

Restores the export on defaultSDK to avoid breaking the
create-agents-template subpath import. Moves the new NodeSDK()
construction behind a globalThis guard (getOrCreateSDK) so
repeated Vite HMR module evaluations reuse the same instance instead
of leaking fresh SDK objects.

Co-authored-by: mike-inkeep <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(template): use idempotent startOpenTelemetrySDK() in instrumentation

* fix: guard all OTel singletons behind globalThis for HMR idempotency

- Cache otlpExporter, batchProcessor, resource, instrumentations,
  spanProcessors, contextManager, and propagator on globalThis via
  Symbol keys and getOrCreate* helpers so HMR re-evaluation reuses
  existing instances instead of leaking new ones
- Make OtelGlobal type strict with per-key types, eliminating the
  loose `boolean | NodeSDK` union and the `as NodeSDK` cast
- Add logger.debug in the MetricReader catch block to distinguish
  clean idempotency from error-recovery idempotency
- Remove defaultSDK export (now module-private) since all consumers
  use startOpenTelemetrySDK() instead

* Fix type errors

* Simplify to just suppress the error since it's not an issue in prod, only local

* Limit to dev mode

* Add changeset for OTel HMR idempotency fix

Co-authored-by: Dimitri POSTOLOV <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: mike-inkeep <[email protected]>
Co-authored-by: Claude Opus 4.6 <[email protected]>
Co-authored-by: inkeep[bot] <257615677+inkeep[bot]@users.noreply.github.com>
Co-authored-by: Dimitri POSTOLOV <[email protected]>
github-merge-queue Bot pushed a commit that referenced this pull request Mar 30, 2026
* skill generator

* polish skill generator

* skills tests

* upd

* upd

* generation.test wip

* add generation.test

* tree node

* skill page

* skill loader

* skill loader refactor

* skill loader

* move skills sidebar to layout

* use pure monaco-editor component since we can have different file extension

* add shadcn context menu component

* format context menu

* skills files and edit pages

* dry

* update layout

* add docs

* add a changeset

* redirect to first skill

* skill files utils

* skill selector

* upd treenode

* skill files

* skill file editor

* delete skill confirmation

* add skill files actions

* skills data

* rm

* up skills route

* upd

* upd

* better project error message on dev

* types

* skill files

* skill loader

* format

* project test

* entities

* project full tests

* upd introspect

* upd cliiii

* nested skills tests

* remove edit page

* remove edit page

* update files page

* upd

* upd file editor

* add SkillFileInsertSchema

* superRefine

* add transform

* rm some cases in superRefine

* use pipe

* use pipe

* upd skill loader

* validation skills

* upd

* rm

* upd

* data access tests

* skills db changes

* add

* skill files

* upd

* upd

* upd skill update

* SkillUpdateSchema has required files

* upd skills manage

* upd

* upd layout and page

* style: auto-format with biome

* move empty state comp to page

* upd schemas

* update schemas

* move to with-sidebar

* polish

* upd

* upd skill generator

* Make webhooks docs user friendly (#2752)

* shaping a2a webhooks page

* moved triggers to visual builder

* vb webhooks wip

* numbered TOC steps

* added step circles

* indented toc steps more

* added newsletter signup to docs

* added share feedback button

* moved newsletter subscribe route to agent docs

* subscribe confirm polish

* improved spacing

* improved spacing

* added high quality images

* added verification step

* Sync lockfile after rebase

* Use tag reference for pullfrog action instead of pinned SHA (#2757)

The action is actively under development, so referencing the v0 tag
allows picking up updates automatically.

https://claude.ai/code/session_01QZyvEs97scVf1ahTG8C1rV

Co-authored-by: Claude <[email protected]>

* ci: provision PR preview environments in Railway (#2681)

* ci: add preview env diagnostics

* ci: probe preview env schema before deploy

* ci: probe preview env schema before deploy

* ci: harden preview api env defaults

* ci: attach git metadata to preview deploys

* ci: harden preview workflow operations

* ci: broaden preview log redaction

* ci: extract preview workflow scripts

* ci: harden preview script extraction

* fix(ci): correct Playwright cache restore-key prefix mismatch (#2760)

The restore-keys used `${{ runner.os }}-playwright-` but primary keys
used `playwright-${{ runner.os }}-`, so the prefix never matched on
cache miss, forcing a full browser download (~8.5 min) instead of a
cache restore (~13 sec).

Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>

* fix(ci): replace full git clone with shallow checkout in CI job (#2761)

Remove fetch-depth: 0 from the ci job's checkout step, which cloned the
entire git history (1.5-5 min overhead). Only the OpenAPI change detection
step needs the base branch ref, so fetch it on-demand with --depth=1.

Also switches the diff from three-dot merge-base syntax to a two-dot
pathspec-filtered diff against the fetched base ref, which works correctly
with shallow clones.

Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>

* format

* rm migration

* add new migrations

* validation for skill is ok

* move empty state to page

* delete skill

* delete skill revalidate path

* move skills schemas to own file

* upd

* upd

* upd

* upd

* upd

* upd

* upd

* upd

* upd

* upd

* more typecheck fixes

* more typecheck fixes

* fix

* fix isRequired

* f1x

* move skill sidebar

* refactor skill sidebar

* add collapse file tree button

* upd

* upd

* upd

* deleteSkillFile

* upd

* deleteSkillFile

* fileId

* fileId

* upd schemas

* DeleteSkillFileConfirmation

* updateSkillFile

* rm simplematter from sdk

* Get Skill File

* getSkillFileById

* add new skill file page

* update skill file editor

* format

* Create Skill File

* upd

* createSkillFileAction

* createSkillFileById

* fix: Make OpenTelemetry startup idempotent (#2684)

* fix: Make OpenTelemetry startup idempotent

* fix: Re-export defaultSDK and cache NodeSDK instance on globalThis

Restores the export on defaultSDK to avoid breaking the
create-agents-template subpath import. Moves the new NodeSDK()
construction behind a globalThis guard (getOrCreateSDK) so
repeated Vite HMR module evaluations reuse the same instance instead
of leaking fresh SDK objects.

Co-authored-by: mike-inkeep <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(template): use idempotent startOpenTelemetrySDK() in instrumentation

* fix: guard all OTel singletons behind globalThis for HMR idempotency

- Cache otlpExporter, batchProcessor, resource, instrumentations,
  spanProcessors, contextManager, and propagator on globalThis via
  Symbol keys and getOrCreate* helpers so HMR re-evaluation reuses
  existing instances instead of leaking new ones
- Make OtelGlobal type strict with per-key types, eliminating the
  loose `boolean | NodeSDK` union and the `as NodeSDK` cast
- Add logger.debug in the MetricReader catch block to distinguish
  clean idempotency from error-recovery idempotency
- Remove defaultSDK export (now module-private) since all consumers
  use startOpenTelemetrySDK() instead

* Fix type errors

* Simplify to just suppress the error since it's not an issue in prod, only local

* Limit to dev mode

* Add changeset for OTel HMR idempotency fix

Co-authored-by: Dimitri POSTOLOV <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: mike-inkeep <[email protected]>
Co-authored-by: Claude Opus 4.6 <[email protected]>
Co-authored-by: inkeep[bot] <257615677+inkeep[bot]@users.noreply.github.com>
Co-authored-by: Dimitri POSTOLOV <[email protected]>

* Fix scheduled trigger invocations being skipped (#2777)

* Fix scheduled trigger invocations being skipped when trigger is edited without changing the next execution time

* claude comments

* adding app id (#2779)

* Update pullfrog to latest SHA and add daily dependabot group for high-frequency deps (#2780)

* Update pullfrog to latest SHA and add daily dependabot group for high-frequency deps

- Update pullfrog from v0.0.178 SHA to v0.0.181 SHA (30d68e5) to stay on
  commit-pinned references for security (action has write permissions + 9 API keys)
- Split dependabot github-actions config into a "high-frequency" group for
  pullfrog with daily schedule, so SHA pins get updated automatically
- This supersedes PR #2757's approach of moving to mutable tag references

https://claude.ai/code/session_01D3ZGYHG8VhsZwqjjXqy2Ap

* Split dependabot github-actions into daily (pullfrog) and monthly (rest)

Separate into two ecosystem entries so pullfrog gets daily SHA updates
while other GitHub Actions stay on a monthly cadence.

https://claude.ai/code/session_01D3ZGYHG8VhsZwqjjXqy2Ap

* Fix invalid dependabot config: merge duplicate github-actions entries

Dependabot disallows duplicate ecosystem+directory pairs. Use a single
entry with two groups instead: high-frequency (pullfrog) and github-actions
(everything else via exclude-patterns).

https://claude.ai/code/session_01D3ZGYHG8VhsZwqjjXqy2Ap

---------

Co-authored-by: Claude <[email protected]>

* ci: seed preview auth in PR previews (#2775)

* ci: bootstrap preview auth

* ci: require secure preview auth config

* ci: recover preview auth runtime vars

* ci: install railway in preview bootstrap

* ci: provision preview db tcp proxies

* ci: proxy preview spicedb bootstrap

* ci: harden preview retry and error logging

---------

Co-authored-by: Andrew Mikofalvy <[email protected]>

* Fix scopes placeholder to show correct Nango format (#2784)

* Fix misleading scopes placeholder in credential form

The Nango API validates scopes against a strict comma-separated pattern
with no spaces. Updated placeholder and help text to show the correct
format and prevent 400 errors when users enter multiple scopes.

Made-with: Cursor

* Add changeset for scopes placeholder fix

Made-with: Cursor

* fix(manage-ui): fix URL validation bypass and permission guard in credential provider setup (#2776)

* fix(manage-ui): fix URL validation bypass and permission guard in credential provider setup

Reorder Zod schema construction so custom validators (e.g. URL protocol
allowlist) are chained after required/optional base schema instead of
being overwritten. Move all React hooks above the canEdit early-return
guard to satisfy Rules of Hooks, with canEdit checks inside hook bodies.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* fix(manage-ui): add server-side URL protocol validation in buildCredentialsPayload

Validate app_link against HTTP/HTTPS allowlist in the server action to
prevent bypassing client-side form validation.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* Update agents-manage-ui/src/components/credentials/views/generic-auth-form.tsx

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

* fix err

---------

Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

* feat(pdf): Support PDF attachments (#2709)

* feat(pdf): Support PDF attachments

* Add tests and other review feedback

* Fix doc

* More renaming and cleanup

* refactor: extract Vercel content part schemas to types/chat.ts for reuse

Move inline Zod schemas from chatDataStream.ts and message-parts.ts into
types/chat.ts as shared, exported schemas. This eliminates duplicate
definitions and makes schema management easier.

Co-authored-by: Andrew Mikofalvy <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Andrew Mikofalvy <[email protected]>
Co-authored-by: Claude Opus 4.6 <[email protected]>

* feat: Composio connected account ID pinning (#2783)

* feat: Composio connected account ID pinning

Pin connected_account_id to Composio MCP URLs to prevent cross-project
credential leakage. Implements "both or none" policy — user_id and
connected_account_id are injected together or not at all.

- Add ComposioCredentialStore for credential lifecycle management
- Update AgentMcpManager and discoverToolsFromServer with pinning logic
- Mark Composio tools without connectedAccountId as needs_auth
- Add generic disconnect credential UI (works for all credential types)
- Store authScheme in credential retrievalParams for display
- Update OAuth login flow to create credential references post-connect
- Add unit tests for new credential store, composio client, and pinning

Made-with: Cursor

* feedback

* fix test

* Version Packages (#2778)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Add back link to projects sidebar, add org settings link to user drop… (#2787)

* Add back link to projects sidebar, add org settings link to user dropdown, adjust sidebar highlight color in dark mode

* Apply suggestion from @claude[bot]

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

* Fix bad claude formatting

---------

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

* fix: return FileUIPart-compliant file parts from /run conversations endpoint (#2782)

* fix: return Vercel AI SDK FileUIPart-compliant file parts from /run conversations endpoint

- Resolve blob:// URIs to proxy HTTP URLs via resolveMessagesListBlobUris()
- Reshape file parts from { data, metadata.mimeType } to { url, mediaType, filename? }
- Matches Vercel AI SDK FileUIPart spec for useChat() compatibility

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* Skip malformed file parts

---------

Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
Co-authored-by: Mike Rashkovsky <[email protected]>

* fix: treat load_skill as internal tool to suppress false-positive Sentry errors (#2756)

* fix: provide relationshipId for load_skill tool calls in graph events

* fix: treat load_skill as internal tool, suppress chat/graph streaming events

* fix for fetch trace (#2791)

* fix for fetch trace

* fix for fetch trace

* Fix empty breadcrumb on `/[tenantId]/profile` page and replace prop-drilled permission flags (`readOnly`, `canEdit`, `canUse`) with direct hook call `useProjectPermissionsQuery()` (#2792)

* upd

* upd

* format

* format

* format

* format

* format

* format

* format

* format

* format

* fix review

* fix breadcrumb on profile page

* Apply suggestions from code review

Co-authored-by: Dimitri POSTOLOV <[email protected]>

* Update agents-manage-ui/src/lib/api/projects.ts

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

* Update agents-manage-ui/src/app/[tenantId]/profile/layout.tsx

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

* style: auto-format with biome

* fix review

---------

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* fix(manage-ui): fix user-scoped MCP credential card not refreshing after connect/disconnect (#2794)

Fetch user-scoped credential server-side in page.tsx (matching the
project-scoped pattern) instead of via a client-side React Query hook.
This ensures router.refresh() after OAuth connect or credential delete
re-fetches the credential data, so the "Your Connection" card updates
without a manual page refresh.

Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>

* reuse `useProjectsQuery` instead of `fetchProjectsAction` in `useEffect` (#2793)

* reuse `useProjectsQuery` instead of `fetchProjectsAction` in `useEffect`

* format

* upd

* fix lint

* Create little-hounds-battle.md

* upd

* upd skill file editor

* polish skill editor like in github

* remove canEdit

* upd

* move skill metadata under collapsible advanced section

* reuse DeleteConfirmation

* upd skill file editor

* upd skill file editor2

* add useInitialCollapsedSidebar

* add useInitialCollapsedSidebar

* upd skill file editor

* rm

* // Avoid including metadata in the frontmatter when it's null

* fetchSkillFile and createSkillFile

* refactor skill breadcrumb

* format

* polish

* upd

* skills integration tests

* fix validation tests

* update skill form

* upd api skills in manage ui

* upd entities

* partial

* fix skill loader test

* chore: update OpenAPI snapshot

* polish skill file editor

* upd core skills tests

* upd core skills tests

* add SkillCreateDataSchema

* update skills data manage

* remove redundant

* lint

* lint

* typecheck

* typecheck

* typecheck

* knip

* lint

* rollback skill modals

* make modal opens in skill selector

* fix typecheck

* this should fix cypress

* fix sdk tests

* split permissions call

* add folder feature

* findNodeByPath

* SkillDirectoryBrowser

* upd

* polish

* fix

* fix edge case metadata validation

* fix

* fix cli test

* format

* upd

* upd

* upd

* chore: update OpenAPI snapshot

* fix skill generator

* add button group

* connect submit logic with extension select

* polish

* update skill generator tests

* update generation test

* polish skill generator

* format

* format

* fixes for tests

* typecheck

* fix review

* format

* new migration

* upd

* rm migration

* add migrations

* fix migration and add

* rm outdated

* Apply suggestions from code review

Co-authored-by: pullfrog[bot] <226033991+pullfrog[bot]@users.noreply.github.com>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

* polish

* Add detailed changeset for nested skill files feature

Co-authored-by: Dimitri POSTOLOV <[email protected]>

* fix typecheck

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Gaurav Varma <[email protected]>
Co-authored-by: Andrew Mikofalvy <[email protected]>
Co-authored-by: Claude <[email protected]>
Co-authored-by: Varun Varahabhotla <[email protected]>
Co-authored-by: mike-inkeep <[email protected]>
Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: mike-inkeep <[email protected]>
Co-authored-by: inkeep[bot] <257615677+inkeep[bot]@users.noreply.github.com>
Co-authored-by: Dimitri POSTOLOV <[email protected]>
Co-authored-by: shagun-singh-inkeep <[email protected]>
Co-authored-by: omar-inkeep <[email protected]>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Co-authored-by: Andrew Mikofalvy <[email protected]>
Co-authored-by: inkeep-internal-ci[bot] <259778081+inkeep-internal-ci[bot]@users.noreply.github.com>
Co-authored-by: sarah <[email protected]>
Co-authored-by: Abraham <[email protected]>
Co-authored-by: pullfrog[bot] <226033991+pullfrog[bot]@users.noreply.github.com>
tim-inkeep pushed a commit that referenced this pull request Mar 31, 2026
* skill generator

* polish skill generator

* skills tests

* upd

* upd

* generation.test wip

* add generation.test

* tree node

* skill page

* skill loader

* skill loader refactor

* skill loader

* move skills sidebar to layout

* use pure monaco-editor component since we can have different file extension

* add shadcn context menu component

* format context menu

* skills files and edit pages

* dry

* update layout

* add docs

* add a changeset

* redirect to first skill

* skill files utils

* skill selector

* upd treenode

* skill files

* skill file editor

* delete skill confirmation

* add skill files actions

* skills data

* rm

* up skills route

* upd

* upd

* better project error message on dev

* types

* skill files

* skill loader

* format

* project test

* entities

* project full tests

* upd introspect

* upd cliiii

* nested skills tests

* remove edit page

* remove edit page

* update files page

* upd

* upd file editor

* add SkillFileInsertSchema

* superRefine

* add transform

* rm some cases in superRefine

* use pipe

* use pipe

* upd skill loader

* validation skills

* upd

* rm

* upd

* data access tests

* skills db changes

* add

* skill files

* upd

* upd

* upd skill update

* SkillUpdateSchema has required files

* upd skills manage

* upd

* upd layout and page

* style: auto-format with biome

* move empty state comp to page

* upd schemas

* update schemas

* move to with-sidebar

* polish

* upd

* upd skill generator

* Make webhooks docs user friendly (#2752)

* shaping a2a webhooks page

* moved triggers to visual builder

* vb webhooks wip

* numbered TOC steps

* added step circles

* indented toc steps more

* added newsletter signup to docs

* added share feedback button

* moved newsletter subscribe route to agent docs

* subscribe confirm polish

* improved spacing

* improved spacing

* added high quality images

* added verification step

* Sync lockfile after rebase

* Use tag reference for pullfrog action instead of pinned SHA (#2757)

The action is actively under development, so referencing the v0 tag
allows picking up updates automatically.

https://claude.ai/code/session_01QZyvEs97scVf1ahTG8C1rV

Co-authored-by: Claude <[email protected]>

* ci: provision PR preview environments in Railway (#2681)

* ci: add preview env diagnostics

* ci: probe preview env schema before deploy

* ci: probe preview env schema before deploy

* ci: harden preview api env defaults

* ci: attach git metadata to preview deploys

* ci: harden preview workflow operations

* ci: broaden preview log redaction

* ci: extract preview workflow scripts

* ci: harden preview script extraction

* fix(ci): correct Playwright cache restore-key prefix mismatch (#2760)

The restore-keys used `${{ runner.os }}-playwright-` but primary keys
used `playwright-${{ runner.os }}-`, so the prefix never matched on
cache miss, forcing a full browser download (~8.5 min) instead of a
cache restore (~13 sec).

Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>

* fix(ci): replace full git clone with shallow checkout in CI job (#2761)

Remove fetch-depth: 0 from the ci job's checkout step, which cloned the
entire git history (1.5-5 min overhead). Only the OpenAPI change detection
step needs the base branch ref, so fetch it on-demand with --depth=1.

Also switches the diff from three-dot merge-base syntax to a two-dot
pathspec-filtered diff against the fetched base ref, which works correctly
with shallow clones.

Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>

* format

* rm migration

* add new migrations

* validation for skill is ok

* move empty state to page

* delete skill

* delete skill revalidate path

* move skills schemas to own file

* upd

* upd

* upd

* upd

* upd

* upd

* upd

* upd

* upd

* upd

* more typecheck fixes

* more typecheck fixes

* fix

* fix isRequired

* f1x

* move skill sidebar

* refactor skill sidebar

* add collapse file tree button

* upd

* upd

* upd

* deleteSkillFile

* upd

* deleteSkillFile

* fileId

* fileId

* upd schemas

* DeleteSkillFileConfirmation

* updateSkillFile

* rm simplematter from sdk

* Get Skill File

* getSkillFileById

* add new skill file page

* update skill file editor

* format

* Create Skill File

* upd

* createSkillFileAction

* createSkillFileById

* fix: Make OpenTelemetry startup idempotent (#2684)

* fix: Make OpenTelemetry startup idempotent

* fix: Re-export defaultSDK and cache NodeSDK instance on globalThis

Restores the export on defaultSDK to avoid breaking the
create-agents-template subpath import. Moves the new NodeSDK()
construction behind a globalThis guard (getOrCreateSDK) so
repeated Vite HMR module evaluations reuse the same instance instead
of leaking fresh SDK objects.

Co-authored-by: mike-inkeep <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(template): use idempotent startOpenTelemetrySDK() in instrumentation

* fix: guard all OTel singletons behind globalThis for HMR idempotency

- Cache otlpExporter, batchProcessor, resource, instrumentations,
  spanProcessors, contextManager, and propagator on globalThis via
  Symbol keys and getOrCreate* helpers so HMR re-evaluation reuses
  existing instances instead of leaking new ones
- Make OtelGlobal type strict with per-key types, eliminating the
  loose `boolean | NodeSDK` union and the `as NodeSDK` cast
- Add logger.debug in the MetricReader catch block to distinguish
  clean idempotency from error-recovery idempotency
- Remove defaultSDK export (now module-private) since all consumers
  use startOpenTelemetrySDK() instead

* Fix type errors

* Simplify to just suppress the error since it's not an issue in prod, only local

* Limit to dev mode

* Add changeset for OTel HMR idempotency fix

Co-authored-by: Dimitri POSTOLOV <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: mike-inkeep <[email protected]>
Co-authored-by: Claude Opus 4.6 <[email protected]>
Co-authored-by: inkeep[bot] <257615677+inkeep[bot]@users.noreply.github.com>
Co-authored-by: Dimitri POSTOLOV <[email protected]>

* Fix scheduled trigger invocations being skipped (#2777)

* Fix scheduled trigger invocations being skipped when trigger is edited without changing the next execution time

* claude comments

* adding app id (#2779)

* Update pullfrog to latest SHA and add daily dependabot group for high-frequency deps (#2780)

* Update pullfrog to latest SHA and add daily dependabot group for high-frequency deps

- Update pullfrog from v0.0.178 SHA to v0.0.181 SHA (30d68e5) to stay on
  commit-pinned references for security (action has write permissions + 9 API keys)
- Split dependabot github-actions config into a "high-frequency" group for
  pullfrog with daily schedule, so SHA pins get updated automatically
- This supersedes PR #2757's approach of moving to mutable tag references

https://claude.ai/code/session_01D3ZGYHG8VhsZwqjjXqy2Ap

* Split dependabot github-actions into daily (pullfrog) and monthly (rest)

Separate into two ecosystem entries so pullfrog gets daily SHA updates
while other GitHub Actions stay on a monthly cadence.

https://claude.ai/code/session_01D3ZGYHG8VhsZwqjjXqy2Ap

* Fix invalid dependabot config: merge duplicate github-actions entries

Dependabot disallows duplicate ecosystem+directory pairs. Use a single
entry with two groups instead: high-frequency (pullfrog) and github-actions
(everything else via exclude-patterns).

https://claude.ai/code/session_01D3ZGYHG8VhsZwqjjXqy2Ap

---------

Co-authored-by: Claude <[email protected]>

* ci: seed preview auth in PR previews (#2775)

* ci: bootstrap preview auth

* ci: require secure preview auth config

* ci: recover preview auth runtime vars

* ci: install railway in preview bootstrap

* ci: provision preview db tcp proxies

* ci: proxy preview spicedb bootstrap

* ci: harden preview retry and error logging

---------

Co-authored-by: Andrew Mikofalvy <[email protected]>

* Fix scopes placeholder to show correct Nango format (#2784)

* Fix misleading scopes placeholder in credential form

The Nango API validates scopes against a strict comma-separated pattern
with no spaces. Updated placeholder and help text to show the correct
format and prevent 400 errors when users enter multiple scopes.

Made-with: Cursor

* Add changeset for scopes placeholder fix

Made-with: Cursor

* fix(manage-ui): fix URL validation bypass and permission guard in credential provider setup (#2776)

* fix(manage-ui): fix URL validation bypass and permission guard in credential provider setup

Reorder Zod schema construction so custom validators (e.g. URL protocol
allowlist) are chained after required/optional base schema instead of
being overwritten. Move all React hooks above the canEdit early-return
guard to satisfy Rules of Hooks, with canEdit checks inside hook bodies.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* fix(manage-ui): add server-side URL protocol validation in buildCredentialsPayload

Validate app_link against HTTP/HTTPS allowlist in the server action to
prevent bypassing client-side form validation.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* Update agents-manage-ui/src/components/credentials/views/generic-auth-form.tsx

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

* fix err

---------

Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

* feat(pdf): Support PDF attachments (#2709)

* feat(pdf): Support PDF attachments

* Add tests and other review feedback

* Fix doc

* More renaming and cleanup

* refactor: extract Vercel content part schemas to types/chat.ts for reuse

Move inline Zod schemas from chatDataStream.ts and message-parts.ts into
types/chat.ts as shared, exported schemas. This eliminates duplicate
definitions and makes schema management easier.

Co-authored-by: Andrew Mikofalvy <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Andrew Mikofalvy <[email protected]>
Co-authored-by: Claude Opus 4.6 <[email protected]>

* feat: Composio connected account ID pinning (#2783)

* feat: Composio connected account ID pinning

Pin connected_account_id to Composio MCP URLs to prevent cross-project
credential leakage. Implements "both or none" policy — user_id and
connected_account_id are injected together or not at all.

- Add ComposioCredentialStore for credential lifecycle management
- Update AgentMcpManager and discoverToolsFromServer with pinning logic
- Mark Composio tools without connectedAccountId as needs_auth
- Add generic disconnect credential UI (works for all credential types)
- Store authScheme in credential retrievalParams for display
- Update OAuth login flow to create credential references post-connect
- Add unit tests for new credential store, composio client, and pinning

Made-with: Cursor

* feedback

* fix test

* Version Packages (#2778)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Add back link to projects sidebar, add org settings link to user drop… (#2787)

* Add back link to projects sidebar, add org settings link to user dropdown, adjust sidebar highlight color in dark mode

* Apply suggestion from @claude[bot]

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

* Fix bad claude formatting

---------

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

* fix: return FileUIPart-compliant file parts from /run conversations endpoint (#2782)

* fix: return Vercel AI SDK FileUIPart-compliant file parts from /run conversations endpoint

- Resolve blob:// URIs to proxy HTTP URLs via resolveMessagesListBlobUris()
- Reshape file parts from { data, metadata.mimeType } to { url, mediaType, filename? }
- Matches Vercel AI SDK FileUIPart spec for useChat() compatibility

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* Skip malformed file parts

---------

Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
Co-authored-by: Mike Rashkovsky <[email protected]>

* fix: treat load_skill as internal tool to suppress false-positive Sentry errors (#2756)

* fix: provide relationshipId for load_skill tool calls in graph events

* fix: treat load_skill as internal tool, suppress chat/graph streaming events

* fix for fetch trace (#2791)

* fix for fetch trace

* fix for fetch trace

* Fix empty breadcrumb on `/[tenantId]/profile` page and replace prop-drilled permission flags (`readOnly`, `canEdit`, `canUse`) with direct hook call `useProjectPermissionsQuery()` (#2792)

* upd

* upd

* format

* format

* format

* format

* format

* format

* format

* format

* format

* fix review

* fix breadcrumb on profile page

* Apply suggestions from code review

Co-authored-by: Dimitri POSTOLOV <[email protected]>

* Update agents-manage-ui/src/lib/api/projects.ts

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

* Update agents-manage-ui/src/app/[tenantId]/profile/layout.tsx

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

* style: auto-format with biome

* fix review

---------

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* fix(manage-ui): fix user-scoped MCP credential card not refreshing after connect/disconnect (#2794)

Fetch user-scoped credential server-side in page.tsx (matching the
project-scoped pattern) instead of via a client-side React Query hook.
This ensures router.refresh() after OAuth connect or credential delete
re-fetches the credential data, so the "Your Connection" card updates
without a manual page refresh.

Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>

* reuse `useProjectsQuery` instead of `fetchProjectsAction` in `useEffect` (#2793)

* reuse `useProjectsQuery` instead of `fetchProjectsAction` in `useEffect`

* format

* upd

* fix lint

* Create little-hounds-battle.md

* upd

* upd skill file editor

* polish skill editor like in github

* remove canEdit

* upd

* move skill metadata under collapsible advanced section

* reuse DeleteConfirmation

* upd skill file editor

* upd skill file editor2

* add useInitialCollapsedSidebar

* add useInitialCollapsedSidebar

* upd skill file editor

* rm

* // Avoid including metadata in the frontmatter when it's null

* fetchSkillFile and createSkillFile

* refactor skill breadcrumb

* format

* polish

* upd

* skills integration tests

* fix validation tests

* update skill form

* upd api skills in manage ui

* upd entities

* partial

* fix skill loader test

* chore: update OpenAPI snapshot

* polish skill file editor

* upd core skills tests

* upd core skills tests

* add SkillCreateDataSchema

* update skills data manage

* remove redundant

* lint

* lint

* typecheck

* typecheck

* typecheck

* knip

* lint

* rollback skill modals

* make modal opens in skill selector

* fix typecheck

* this should fix cypress

* fix sdk tests

* split permissions call

* add folder feature

* findNodeByPath

* SkillDirectoryBrowser

* upd

* polish

* fix

* fix edge case metadata validation

* fix

* fix cli test

* format

* upd

* upd

* upd

* chore: update OpenAPI snapshot

* fix skill generator

* add button group

* connect submit logic with extension select

* polish

* update skill generator tests

* update generation test

* polish skill generator

* format

* format

* fixes for tests

* typecheck

* fix review

* format

* new migration

* upd

* rm migration

* add migrations

* fix migration and add

* rm outdated

* Apply suggestions from code review

Co-authored-by: pullfrog[bot] <226033991+pullfrog[bot]@users.noreply.github.com>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

* polish

* Add detailed changeset for nested skill files feature

Co-authored-by: Dimitri POSTOLOV <[email protected]>

* fix typecheck

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Gaurav Varma <[email protected]>
Co-authored-by: Andrew Mikofalvy <[email protected]>
Co-authored-by: Claude <[email protected]>
Co-authored-by: Varun Varahabhotla <[email protected]>
Co-authored-by: mike-inkeep <[email protected]>
Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: mike-inkeep <[email protected]>
Co-authored-by: inkeep[bot] <257615677+inkeep[bot]@users.noreply.github.com>
Co-authored-by: Dimitri POSTOLOV <[email protected]>
Co-authored-by: shagun-singh-inkeep <[email protected]>
Co-authored-by: omar-inkeep <[email protected]>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Co-authored-by: Andrew Mikofalvy <[email protected]>
Co-authored-by: inkeep-internal-ci[bot] <259778081+inkeep-internal-ci[bot]@users.noreply.github.com>
Co-authored-by: sarah <[email protected]>
Co-authored-by: Abraham <[email protected]>
Co-authored-by: pullfrog[bot] <226033991+pullfrog[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants