fix: Make OpenTelemetry startup idempotent#2684
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
🦋 Changeset detectedLatest commit: 9c711d1 The changes in this PR will be included in the next version bump. This PR includes changesets to release 10 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
There was a problem hiding this comment.
The idempotency approach is sound for the Vite HMR use case, but there's a breaking change that needs addressing before merge.
| }); | ||
|
|
||
| export const defaultSDK = new NodeSDK({ | ||
| const defaultSDK = new NodeSDK({ |
There was a problem hiding this comment.
Removing the export from defaultSDK is a breaking change. create-agents-template/apps/agents-api/src/instrumentation.ts:19 does:
const { defaultSDK } = await import('@inkeep/agents-api/instrumentation');
defaultSDK.start();@inkeep/agents-api/instrumentation is a declared subpath export in agents-api/package.json. The template (and any downstream consumer using that subpath) will break at runtime.
Either:
- Update the template in this PR to use
startOpenTelemetrySDK(), or - Continue exporting
defaultSDK(possibly deprecated) alongside the new function.
| const OTEL_STARTED_KEY = Symbol.for('@inkeep/agents-api.otelStarted'); | ||
|
|
||
| export function startOpenTelemetrySDK(): void { | ||
| if ((globalThis as unknown as { [key: symbol]: boolean })[OTEL_STARTED_KEY]) { | ||
| return; | ||
| } | ||
| try { | ||
| defaultSDK.start(); | ||
| (globalThis as unknown as { [key: symbol]: boolean })[OTEL_STARTED_KEY] = true; | ||
| } catch (error) { | ||
| const msg = error instanceof Error ? error.message : String(error); | ||
| if (msg.includes('MetricReader') && msg.includes('can not be bound')) { | ||
| (globalThis as unknown as { [key: symbol]: boolean })[OTEL_STARTED_KEY] = true; | ||
| return; | ||
| } | ||
| throw error; | ||
| } | ||
| } |
There was a problem hiding this comment.
The new NodeSDK(...) constructor at line 85 still runs at module scope on every HMR re-evaluation — allocating fresh span processors, context managers, and instrumentations each time. The guard only prevents start() from firing twice, but the stale SDK objects from prior evaluations are leaked.
Consider moving the new NodeSDK(...) construction inside startOpenTelemetrySDK() behind the same guard, or caching the instance on globalThis so repeated module evaluations reuse it.
There was a problem hiding this comment.
PR Review Summary
(4) Total Issues | Risk: Medium
🟠⚠️ Major (2) 🟠⚠️
🟠 1) create-agents-template/apps/agents-api/src/instrumentation.ts:19-20 Breaking change: defaultSDK export removed but template still imports it
Issue: This PR removes the export keyword from defaultSDK (making it private), but create-agents-template still imports this export directly:
// create-agents-template/apps/agents-api/src/instrumentation.ts:19-20
const { defaultSDK } = await import('@inkeep/agents-api/instrumentation');
defaultSDK.start();Why: When users upgrade @inkeep/agents-api or scaffold a new project using the template, the code will fail at runtime with an import error. This is a breaking change that affects the template's compatibility with the package.
Fix: Update the template to use the new idempotent function:
// create-agents-template/apps/agents-api/src/instrumentation.ts
const { startOpenTelemetrySDK } = await import('@inkeep/agents-api/instrumentation');
startOpenTelemetrySDK();Refs:
🟠 2) agents-api/src/instrumentation.ts:95-110 Missing test coverage for idempotency logic
Issue: The new startOpenTelemetrySDK() function has zero test coverage. It has three distinct code paths: (1) early return when already started, (2) successful SDK start with flag set, and (3) error handling for MetricReader binding errors — none of which are tested.
Why: Without tests, future refactoring could break the HMR safety mechanism, causing the original 500 error to resurface. The codebase already has a test pattern for this exact scenario in socket-mode.test.ts:
// packages/agents-work-apps/src/__tests__/slack/socket-mode.test.ts:121
it('should skip if client already exists (HMR safety)', async () => {
(globalThis as Record<string, unknown>).__inkeep_slack_socket_mode_client__ = {};
// ...
expect(mockSocketModeStart).not.toHaveBeenCalled();
});Fix: Add a test file at agents-api/src/__tests__/instrumentation.test.ts with cases for:
should skip SDK start if already started (HMR safety)— verify early returnshould set global flag after successful start— verify flag is setshould catch MetricReader binding error and set flag— verify error suppressionshould rethrow non-MetricReader errors— verify other errors propagate
Refs:
🟡 Minor (1) 🟡
Inline Comments:
- 🟡 Minor:
instrumentation.ts:104-107Silent error suppression without logging
💭 Consider (2) 💭
💭 1) agents-api/src/instrumentation.ts:104 Brittle string matching for error detection
Issue: The error check msg.includes('MetricReader') && msg.includes('can not be bound') relies on specific error message text from @opentelemetry/sdk-node. If OpenTelemetry changes the error message wording in a future version, this check would fail.
Why: A dependency update could silently break the hot-reload fix. Since this is dev-only, it might not be caught until it disrupts developer workflow.
Fix: Add a version comment documenting the expected error format:
// Known error from @opentelemetry/sdk-node when SDK.start() is called multiple times.
// Verified against @opentelemetry/[email protected]. Update if error message changes.
if (msg.includes('MetricReader') && msg.includes('can not be bound')) {Refs:
💭 2) agents-api/src/instrumentation.ts:102-108 Catch block sets flag even when SDK may be in partial state
Issue: If the MetricReader error is thrown, it means defaultSDK.start() partially succeeded before failing. Setting the flag to true prevents future retries, but the SDK may be in an inconsistent state.
Why: In the HMR context this is fine because the previous SDK instance's tracers are still working. But it's worth noting that subsequent calls to shutdown or flush on this new instance would operate on a broken SDK.
Fix: This is acceptable as-is for the HMR use case. Consider adding a code comment documenting this edge case for future maintainers.
🚫 REQUEST CHANGES
Summary: The approach using Symbol.for() + globalThis for idempotency is correct and follows established patterns in this codebase (see Slack Socket Mode). However, two issues need addressing before merge: (1) the template breaking change — create-agents-template imports the now-private defaultSDK export, and (2) the fix lacks test coverage for a behavior-critical change. Both are straightforward to fix.
Discarded (1)
| Location | Issue | Reason Discarded |
|---|---|---|
agents-docs/content/guides/observability/langfuse-usage.mdx:50 |
Documentation shows old OTEL pattern | Pre-existing issue, docs show user-maintained code example, LOW confidence |
Reviewers (5)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
pr-review-sre |
3 | 1 | 0 | 0 | 1 | 0 | 1 |
pr-review-tests |
2 | 1 | 1 | 0 | 0 | 0 | 0 |
pr-review-errors |
2 | 0 | 1 | 0 | 1 | 0 | 0 |
pr-review-precision |
2 | 1 | 1 | 0 | 0 | 0 | 0 |
pr-review-standards |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Total | 9 | 3 | 3 | 0 | 2 | 0 | 1 |
Note: Findings were deduplicated across reviewers — template breaking change was flagged by both pr-review-sre and pr-review-precision; silent error suppression was flagged by both pr-review-sre and pr-review-errors.
| if (msg.includes('MetricReader') && msg.includes('can not be bound')) { | ||
| (globalThis as unknown as { [key: symbol]: boolean })[OTEL_STARTED_KEY] = true; | ||
| return; | ||
| } |
There was a problem hiding this comment.
🟡 Minor: Silent error suppression without logging
Issue: When the MetricReader binding error is caught and suppressed, there's no logging to indicate this happened.
Why: This follows a different pattern than createSafeBatchProcessor() (line 37) which logs warnings for initialization failures. Without logging, debugging OTEL issues becomes harder — there's no trace that startup was skipped due to a binding conflict.
Fix: Add a debug/info log to maintain observability:
| if (msg.includes('MetricReader') && msg.includes('can not be bound')) { | |
| (globalThis as unknown as { [key: symbol]: boolean })[OTEL_STARTED_KEY] = true; | |
| return; | |
| } | |
| if (msg.includes('MetricReader') && msg.includes('can not be bound')) { | |
| logger.debug({}, 'OpenTelemetry SDK already started (MetricReader already bound), skipping re-initialization'); | |
| (globalThis as unknown as { [key: symbol]: boolean })[OTEL_STARTED_KEY] = true; | |
| return; | |
| } |
Refs:
- createSafeBatchProcessor logging pattern — existing fallback logging pattern
- Slack Socket Mode HMR logging — logs "HMR reload detected"
|
@claude add export back to defaultSDK move new NodeSDK() into startOpenTelemetrySDK behind the same guard, or cache the instance on globalThis so repeated module evaluations reuse it |
|
Claude finished @mike-inkeep's task in 1m 43s —— View job Changes implemented
What changed in
|
Ito Test Report ✅19 test cases ran. 19 passed. ✅ The local dev stability run stayed healthy across API route checks, auth bootstrap flows, multi-tab/mobile stress, and adversarial auth requests. Code-change logs only show non-product test-run configuration tweaks (dev server host/overlay indicators), and no test output provided a code-supported defect signal, so all executed cases are reported as passed. ✅ Passed (19)
📋 View Recording |
|
@claude re-review please |
d5f8ac1 to
2c8ea6c
Compare
There was a problem hiding this comment.
The idempotency guard for NodeSDK.start() is sound and correctly solves the Vite HMR crash. Two items worth addressing: (1) module-level singletons (defaultBatchProcessor, otlpExporter, etc.) are re-created on every reload but the cached SDK still references the originals — flushBatchProcessor() will silently flush the wrong instance; (2) the OtelGlobal type could be tighter to avoid the as NodeSDK cast.
| const OTEL_SDK_KEY = Symbol.for('@inkeep/agents-api.otelSDK'); | ||
| const OTEL_STARTED_KEY = Symbol.for('@inkeep/agents-api.otelStarted'); | ||
|
|
||
| type OtelGlobal = { [key: symbol]: boolean | NodeSDK }; |
There was a problem hiding this comment.
The OtelGlobal type allows assigning boolean to OTEL_SDK_KEY or NodeSDK to OTEL_STARTED_KEY without error. Consider using a stricter interface with the specific symbol constants as keys:
type OtelGlobal = {
[OTEL_SDK_KEY]?: NodeSDK;
[OTEL_STARTED_KEY]?: boolean;
};This eliminates the as NodeSDK cast on line 101 too.
| const OTEL_SDK_KEY = Symbol.for('@inkeep/agents-api.otelSDK'); | ||
| const OTEL_STARTED_KEY = Symbol.for('@inkeep/agents-api.otelStarted'); | ||
|
|
||
| type OtelGlobal = { [key: symbol]: boolean | NodeSDK }; | ||
|
|
||
| function getOrCreateSDK(): NodeSDK { | ||
| const g = globalThis as unknown as OtelGlobal; | ||
| if (!g[OTEL_SDK_KEY]) { | ||
| g[OTEL_SDK_KEY] = new NodeSDK({ | ||
| resource: defaultResource, | ||
| contextManager: defaultContextManager, | ||
| textMapPropagator: defaultTextMapPropagator, | ||
| spanProcessors: defaultSpanProcessors, | ||
| instrumentations: defaultInstrumentations, | ||
| }); | ||
| } | ||
| return g[OTEL_SDK_KEY] as NodeSDK; | ||
| } | ||
|
|
||
| export const defaultSDK = getOrCreateSDK(); |
There was a problem hiding this comment.
The SDK instance is cached on globalThis, but the module-level objects it was constructed with (defaultBatchProcessor, defaultContextManager, otlpExporter, etc.) are re-created on every Vite HMR re-evaluation since they're not behind a similar guard. This means the exported defaultBatchProcessor (used in flushBatchProcessor) will be a new instance unrelated to the one inside the cached SDK — so flushBatchProcessor() will silently flush nothing. Consider guarding those module-level singletons the same way, or deriving them from the cached SDK.
|
Claude finished @mike-inkeep's task in 1m 12s —— View job Re-review of PR #2684
Overall: The follow-up commit addresses the two major issues from the first review (re-exporting Resolved from previous reviews
Remaining issues1. // create-agents-template/apps/agents-api/src/instrumentation.ts
const { startOpenTelemetrySDK } = await import('@inkeep/agents-api/instrumentation');
startOpenTelemetrySDK();2. Module-level singletons are still re-created on HMR —
This is a dev-only issue (HMR doesn't happen in prod), but it means spans produced after a hot reload may silently fail to flush. Consider either guarding these singletons behind 3. type OtelGlobal = {
[OTEL_SDK_KEY]?: NodeSDK;
[OTEL_STARTED_KEY]?: boolean;
};This also eliminates the 4. Silent error suppression in catch block (minor) — When the MetricReader error catch fires (line 116-118), there's no log output. Adding a Summary
Item 1 is the most impactful — template users won't benefit from this fix without that change. |
There was a problem hiding this comment.
PR Review Summary
(3) Total Issues | Risk: Low
🟠⚠️ Major (1) 🟠⚠️
🟠 1) create-agents-template/apps/agents-api/src/instrumentation.ts:19-20 Template still calls defaultSDK.start() directly — inconsistent idempotency
Issue: The template imports defaultSDK and calls .start() directly, bypassing the new idempotent startOpenTelemetrySDK() wrapper:
const { defaultSDK } = await import('@inkeep/agents-api/instrumentation');
defaultSDK.start();While commit 2c8ea6c0d correctly restored the defaultSDK export to avoid breaking this import, the template now has inconsistent behavior compared to agents-api/src/index.ts which uses the idempotent wrapper.
Why: The template represents a real usage pattern for users scaffolding new projects. If a template user runs Vite HMR, they will still experience the MetricReader binding error because defaultSDK.start() bypasses the idempotency guards in startOpenTelemetrySDK(). The fix was applied to the main entry point but not propagated to the template.
Fix: Update the template to use the idempotent wrapper:
const { startOpenTelemetrySDK } = await import('@inkeep/agents-api/instrumentation');
startOpenTelemetrySDK();Refs:
🟡 Minor (1) 🟡
Inline Comments:
- 🟡 Minor:
instrumentation.ts:116-118Silent error suppression without logging
🕐 Pending Recommendations (1)
- 🟠
instrumentation.ts:106-122Missing test coverage forstartOpenTelemetrySDK()— still not addressed from previous review
✅ What's Working Well
The second commit (2c8ea6c0d) successfully addressed the breaking change from the first review:
- ✅
defaultSDKexport restored — Template imports no longer break - ✅ SDK instance caching via
OTEL_SDK_KEYsymbol on globalThis — Fixes the HMR memory leak issue where new SDK objects were created on each module re-evaluation - ✅ Dual-guard pattern (flag check + error catch) — Appropriate belt-and-suspenders approach: flag guard handles the common case efficiently, error catch handles edge cases like race conditions or external code calling
defaultSDK.start()directly
💡 APPROVE WITH SUGGESTIONS
Summary: The core fix is sound — SDK instance caching and the idempotent startOpenTelemetrySDK() wrapper correctly address the HMR crash. The breaking change from the first review has been fixed. Two items remain from the previous review (test coverage, logging), and one new item surfaced (template not updated). None of these are blockers, but addressing the template inconsistency would ensure all consumers benefit from the fix.
Discarded (2)
| Location | Issue | Reason Discarded |
|---|---|---|
instrumentation.ts:25-79 |
Module-level OTEL objects still recreated on HMR | These objects don't cause the crash — they're a minor resource leak during dev. The fix correctly targets the MetricReader binding error which is the stated goal. |
instrumentation.ts:90-102 |
getOrCreateSDK() caching untested |
Lower priority — if caching breaks, the error catch in startOpenTelemetrySDK() still handles it. Defense-in-depth, not critical path. |
Reviewers (4)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
pr-review-precision |
3 | 1 | 0 | 0 | 0 | 0 | 2 |
pr-review-tests |
2 | 0 | 0 | 0 | 0 | 1 | 1 |
pr-review-sre |
2 | 0 | 0 | 0 | 1 | 0 | 0 |
pr-review-standards |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Total | 7 | 1 | 0 | 0 | 1 | 1 | 3 |
Note: Test coverage finding routed to Pending Recommendations as it was raised in the previous review and remains unaddressed.
| if (msg.includes('MetricReader') && msg.includes('can not be bound')) { | ||
| g[OTEL_STARTED_KEY] = true; | ||
| return; |
There was a problem hiding this comment.
🟡 Minor: Silent error suppression without logging
Issue: When the MetricReader binding error is caught and suppressed, there's no logging to indicate this happened.
Why: This follows a different pattern than createSafeBatchProcessor() (line 37) which logs warnings for initialization failures. Without logging, debugging OTEL issues becomes harder — there's no trace that startup was skipped due to a binding conflict. The socket-mode HMR pattern logs 'Socket Mode client already running (HMR reload detected), skipping' for visibility.
Fix: Add debug-level logging for observability:
| if (msg.includes('MetricReader') && msg.includes('can not be bound')) { | |
| g[OTEL_STARTED_KEY] = true; | |
| return; | |
| if (msg.includes('MetricReader') && msg.includes('can not be bound')) { | |
| logger.debug({}, 'OpenTelemetry SDK already started (MetricReader already bound), skipping re-initialization'); | |
| g[OTEL_STARTED_KEY] = true; | |
| return; | |
| } |
Refs:
- createSafeBatchProcessor logging pattern — existing fallback logging pattern
- Slack Socket Mode HMR logging — logs "HMR reload detected"
|
@claude please address the following. After addressing Remaining issues
// create-agents-template/apps/agents-api/src/instrumentation.ts
flushBatchProcessor() calls forceFlush() on a new defaultBatchProcessor that has no buffered spans — effectively a no-op after HMR
type OtelGlobal = {
|
|
Claude finished @mike-inkeep's task in 9m 31s —— View job Addressing remaining review items
To apply these changes, please re-trigger me on this PR or run manually: git fetch origin && git rebase origin/fix/hot_reload_metricreader_error && git pushChanges implemented (local commit
|
There was a problem hiding this comment.
The idempotency guard on startOpenTelemetrySDK() is correct and the error-catch is appropriately narrow. One concern: the SDK singleton on globalThis may hold stale references to module-level objects that get recreated during HMR — see inline comment.
| function getOrCreateSDK(): NodeSDK { | ||
| const g = globalThis as unknown as OtelGlobal; | ||
| if (!g[OTEL_SDK_KEY]) { | ||
| g[OTEL_SDK_KEY] = new NodeSDK({ | ||
| resource: defaultResource, | ||
| contextManager: defaultContextManager, | ||
| textMapPropagator: defaultTextMapPropagator, | ||
| spanProcessors: defaultSpanProcessors, | ||
| instrumentations: defaultInstrumentations, | ||
| }); | ||
| } | ||
| return g[OTEL_SDK_KEY] as NodeSDK; | ||
| } |
There was a problem hiding this comment.
The SDK instance is cached on globalThis, but defaultResource, defaultSpanProcessors, defaultContextManager, etc. are module-level constants that get recreated on each HMR re-evaluation. On the second load, getOrCreateSDK() returns the old cached SDK holding references to the first set of processors/context-manager, while the rest of the module exports the new ones.
This means flushBatchProcessor() (which calls defaultBatchProcessor.forceFlush() directly) would flush the new processor that the cached SDK doesn't actually use. Consider either (a) also caching the config objects on globalThis, or (b) deriving defaultBatchProcessor from the SDK instance rather than a separate module-level var.
There was a problem hiding this comment.
PR Review Summary
(0) Total Issues | Risk: Low
✅ Delta Review: Simplified Approach is Sound
The last 2 commits (4b09d0f58, 62d658804) dramatically simplified the HMR idempotency fix:
What was removed (~149 lines):
- Complex
globalThiscaching with 9 Symbol keys - Type-safe
OtelGlobalinterface getOrCreate*()helpers for every singleton- Primary flag guard + fallback error catch
What replaced it (~17 lines):
export function startOpenTelemetrySDK(): void {
try {
defaultSDK.start();
} catch (error) {
const msg = error instanceof Error ? error.message : String(error);
const allowDuplicateStartErrors =
env.ENVIRONMENT === 'development' || env.NODE_ENV === 'development';
if (
allowDuplicateStartErrors &&
((msg.includes('MetricReader') && msg.includes('can not be bound')) ||
msg.includes('Attempted duplicate registration of API'))
) {
logger.debug({}, 'OpenTelemetry SDK already started');
return;
}
throw error;
}
}Why This Works
-
Production safety preserved — The
allowDuplicateStartErrorscheck ensures errors are only suppressed in development mode. Production startup failures propagate correctly. -
Narrow error matching — Only suppresses two specific known HMR-related OpenTelemetry errors. Any other error (config issues, connection failures, etc.) is re-thrown.
-
Observability maintained —
logger.debug()call provides visibility when startup is skipped. -
Simplicity wins — The previous implementation was over-engineered for a dev-only problem. This approach is ~9x smaller and equally effective at fixing the HMR crash.
Known Tradeoff (Acknowledged)
The stale singleton concern (raised by pullfrog in prior reviews) — where flushBatchProcessor() may flush the wrong instance after HMR — was deliberately accepted by the PR author. The commit message explicitly states: "Simplify to just suppress the error since it's not an issue in prod, only local".
This is an acceptable tradeoff because:
- HMR only occurs in development
- Production telemetry is never affected
- The fix achieves its stated goal (prevent 500 errors requiring dev server restart)
✅ APPROVE
Summary: The simplified approach correctly fixes the Vite HMR crash with minimal code. The delta removes unnecessary complexity while preserving production safety. The known dev-only tradeoff (flushBatchProcessor() may no-op after HMR) is explicitly acknowledged and acceptable. Ship it! 🚀
Discarded (1)
| Location | Issue | Reason Discarded |
|---|---|---|
instrumentation.ts:42,112-118 |
Stale singleton regression — flushBatchProcessor() may flush wrong instance after HMR |
Deliberately addressed by author in commit 4b09d0f58: "not an issue in prod, only local". Dev-only tradeoff explicitly accepted. |
Reviewers (4)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
pr-review-standards |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
pr-review-sre |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
pr-review-precision |
1 | 0 | 0 | 0 | 0 | 0 | 1 |
pr-review-errors |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Total | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
Note: The precision reviewer's finding about stale singletons was discarded because the PR author explicitly acknowledged and accepted this tradeoff in the commit message.
Ito Test Report ✅19 test cases ran. 19 passed. ✅ Verification found stable behavior across health/readiness checks, auth flows, navigation stress, and OTEL reload scenarios for the included cases. 🔍 Code review of the authentication/session and instrumentation paths did not surface a production-code defect for the non-passed stale-session case, so the included set contains only confirmed passes. ✅ Passed (19)
📋 View Recording |
dimaMachina
left a comment
There was a problem hiding this comment.
@claude add a changeset
|
Claude finished @dimaMachina's task in 54s —— View job Adding changeset
Added |
Co-authored-by: Dimitri POSTOLOV <[email protected]> Co-Authored-By: Claude Opus 4.6 <[email protected]>
There was a problem hiding this comment.
The catch-and-suppress approach works as a quick fix, but it has two structural weaknesses: string-matching is fragile against upstream OTel SDK message changes, and module-level singletons (exporters, processors) are still re-instantiated on every HMR reload — leaking resources even when the error is caught. A globalThis sentinel guard would be more robust and avoids both issues. See inline comments.
| export function startOpenTelemetrySDK(): void { | ||
| try { | ||
| defaultSDK.start(); | ||
| } catch (error) { | ||
| const msg = error instanceof Error ? error.message : String(error); | ||
| const allowDuplicateStartErrors = | ||
| env.ENVIRONMENT === 'development' || env.NODE_ENV === 'development'; | ||
| if ( | ||
| allowDuplicateStartErrors && | ||
| ((msg.includes('MetricReader') && msg.includes('can not be bound')) || | ||
| msg.includes('Attempted duplicate registration of API')) | ||
| ) { | ||
| logger.debug({}, 'OpenTelemetry SDK already started'); | ||
| return; | ||
| } | ||
| throw error; | ||
| } | ||
| } |
There was a problem hiding this comment.
The error-message string matching ('MetricReader', 'can not be bound', 'Attempted duplicate registration of API') is coupled to the exact wording chosen by the OpenTelemetry SDK. If an OTel SDK upgrade rephrases these messages, legitimate duplicate-start errors will slip through the guard and crash the dev server again.
Consider a more resilient approach — either:
- Guard with a
globalThissentinel sostart()is never called twice:
const OTEL_STARTED = Symbol.for('inkeep.otel.started');
export function startOpenTelemetrySDK(): void {
if ((globalThis as any)[OTEL_STARTED]) {
logger.debug({}, 'OpenTelemetry SDK already started (HMR)');
return;
}
defaultSDK.start();
(globalThis as any)[OTEL_STARTED] = true;
}- Or at minimum, broaden the catch to suppress any error in dev mode whose message mentions "already" or "duplicate" (still fragile, but less so).
Option 1 avoids string matching entirely and is the pattern recommended by OTel maintainers for singleton initialization.
| export function startOpenTelemetrySDK(): void { | ||
| try { | ||
| defaultSDK.start(); | ||
| } catch (error) { | ||
| const msg = error instanceof Error ? error.message : String(error); | ||
| const allowDuplicateStartErrors = | ||
| env.ENVIRONMENT === 'development' || env.NODE_ENV === 'development'; | ||
| if ( | ||
| allowDuplicateStartErrors && | ||
| ((msg.includes('MetricReader') && msg.includes('can not be bound')) || | ||
| msg.includes('Attempted duplicate registration of API')) | ||
| ) { | ||
| logger.debug({}, 'OpenTelemetry SDK already started'); | ||
| return; | ||
| } | ||
| throw error; | ||
| } | ||
| } |
There was a problem hiding this comment.
Even when the error is caught, every Vite HMR reload re-executes the module top-level code, creating new OTLPTraceExporter, BatchSpanProcessor, AsyncLocalStorageContextManager, and NodeSDK instances. The old ones remain registered in the OTel global API — this silently leaks resources (open connections, timers, etc.).
If you adopt a globalThis guard, you can also stash the SDK and processors on globalThis so the module reuses the existing instances on HMR reload rather than creating duplicates.
Ito Test Report ✅19 test cases ran. 19 passed. All planned API, auth, navigation, resilience, and security-path checks completed successfully, and verification did not surface any code-level defect hypothesis requiring escalation to a Failed classification. 🔍✅ ✅ Passed (19)
📋 View Recording |
|
TL;DR — Wraps the OpenTelemetry SDK Key changes
Summary | 4 files | 9 commits | base:
|
There was a problem hiding this comment.
PR Review Summary
(0) Total Issues | Risk: Low
✅ Delta Review: No New Issues
The delta since the last automated review (62d658804) consists of:
- Changeset addition (
93f033c95) — Correctly scoped@inkeep/agents-api: patchwith clear message - Merge with main (
9c711d1f5) — Integrates SSO configuration changes; no conflicts with PR scope
Implementation Summary (unchanged since last review)
The final implementation is clean and effective:
export function startOpenTelemetrySDK(): void {
try {
defaultSDK.start();
} catch (error) {
const msg = error instanceof Error ? error.message : String(error);
const allowDuplicateStartErrors =
env.ENVIRONMENT === 'development' || env.NODE_ENV === 'development';
if (
allowDuplicateStartErrors &&
((msg.includes('MetricReader') && msg.includes('can not be bound')) ||
msg.includes('Attempted duplicate registration of API'))
) {
logger.debug({}, 'OpenTelemetry SDK already started');
return;
}
throw error;
}
}Why this works:
- ✅ Production safety preserved — Errors only suppressed in development mode
- ✅ Narrow error matching — Only known HMR-related OTel errors are caught
- ✅ Observability maintained —
logger.debug()call for visibility - ✅ Template consistency — Both
agents-api/src/index.tsandcreate-agents-templateuse the idempotent wrapper - ✅ Changeset included — Proper patch version bump with descriptive message
Prior Review Feedback Status
| Issue | Status |
|---|---|
Breaking change: template imports removed defaultSDK export |
✅ Fixed — template updated to use startOpenTelemetrySDK() |
| Module-level singletons re-created on HMR | ✅ Acknowledged — dev-only tradeoff explicitly accepted |
| Silent error suppression | ✅ Fixed — logger.debug() added |
| Missing test coverage |
🕐 Pending Recommendations (1)
- 🟡 Previous review: Missing test coverage for
startOpenTelemetrySDK()— The function has three code paths (success, error suppression, error rethrow) with no unit tests. Not blocking, but would help prevent regressions.
✅ APPROVE
Summary: The implementation is sound, the changeset is properly formatted, and all blocking issues from prior reviews have been addressed. The simplified try-catch approach is an appropriate solution for a dev-only HMR problem. Ship it! 🚀
Reviewers (1)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
orchestrator |
1 | 0 | 0 | 0 | 0 | 1 | 0 |
| Total | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
Note: Delta review — no sub-agents dispatched as the only PR-specific change was a changeset addition.
Ito Test Report ❌15 test cases ran. 14 passed, 1 failed. This run shows strong overall stability for routing, auth endpoint hardening, stress handling, and startup guard behavior ✅. One verified regression remains in the template OTEL startup path: the template now calls a newer instrumentation API while still depending on an older ✅ Passed (14)
❌ Failed (1)
Template app still boots when OTEL is configured – Failed
📋 View Recording |
* fix: Make OpenTelemetry startup idempotent * fix: Re-export defaultSDK and cache NodeSDK instance on globalThis Restores the export on defaultSDK to avoid breaking the create-agents-template subpath import. Moves the new NodeSDK() construction behind a globalThis guard (getOrCreateSDK) so repeated Vite HMR module evaluations reuse the same instance instead of leaking fresh SDK objects. Co-authored-by: mike-inkeep <[email protected]> Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(template): use idempotent startOpenTelemetrySDK() in instrumentation * fix: guard all OTel singletons behind globalThis for HMR idempotency - Cache otlpExporter, batchProcessor, resource, instrumentations, spanProcessors, contextManager, and propagator on globalThis via Symbol keys and getOrCreate* helpers so HMR re-evaluation reuses existing instances instead of leaking new ones - Make OtelGlobal type strict with per-key types, eliminating the loose `boolean | NodeSDK` union and the `as NodeSDK` cast - Add logger.debug in the MetricReader catch block to distinguish clean idempotency from error-recovery idempotency - Remove defaultSDK export (now module-private) since all consumers use startOpenTelemetrySDK() instead * Fix type errors * Simplify to just suppress the error since it's not an issue in prod, only local * Limit to dev mode * Add changeset for OTel HMR idempotency fix Co-authored-by: Dimitri POSTOLOV <[email protected]> Co-Authored-By: Claude Opus 4.6 <[email protected]> --------- Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: mike-inkeep <[email protected]> Co-authored-by: Claude Opus 4.6 <[email protected]> Co-authored-by: inkeep[bot] <257615677+inkeep[bot]@users.noreply.github.com> Co-authored-by: Dimitri POSTOLOV <[email protected]>
* skill generator * polish skill generator * skills tests * upd * upd * generation.test wip * add generation.test * tree node * skill page * skill loader * skill loader refactor * skill loader * move skills sidebar to layout * use pure monaco-editor component since we can have different file extension * add shadcn context menu component * format context menu * skills files and edit pages * dry * update layout * add docs * add a changeset * redirect to first skill * skill files utils * skill selector * upd treenode * skill files * skill file editor * delete skill confirmation * add skill files actions * skills data * rm * up skills route * upd * upd * better project error message on dev * types * skill files * skill loader * format * project test * entities * project full tests * upd introspect * upd cliiii * nested skills tests * remove edit page * remove edit page * update files page * upd * upd file editor * add SkillFileInsertSchema * superRefine * add transform * rm some cases in superRefine * use pipe * use pipe * upd skill loader * validation skills * upd * rm * upd * data access tests * skills db changes * add * skill files * upd * upd * upd skill update * SkillUpdateSchema has required files * upd skills manage * upd * upd layout and page * style: auto-format with biome * move empty state comp to page * upd schemas * update schemas * move to with-sidebar * polish * upd * upd skill generator * Make webhooks docs user friendly (#2752) * shaping a2a webhooks page * moved triggers to visual builder * vb webhooks wip * numbered TOC steps * added step circles * indented toc steps more * added newsletter signup to docs * added share feedback button * moved newsletter subscribe route to agent docs * subscribe confirm polish * improved spacing * improved spacing * added high quality images * added verification step * Sync lockfile after rebase * Use tag reference for pullfrog action instead of pinned SHA (#2757) The action is actively under development, so referencing the v0 tag allows picking up updates automatically. https://claude.ai/code/session_01QZyvEs97scVf1ahTG8C1rV Co-authored-by: Claude <[email protected]> * ci: provision PR preview environments in Railway (#2681) * ci: add preview env diagnostics * ci: probe preview env schema before deploy * ci: probe preview env schema before deploy * ci: harden preview api env defaults * ci: attach git metadata to preview deploys * ci: harden preview workflow operations * ci: broaden preview log redaction * ci: extract preview workflow scripts * ci: harden preview script extraction * fix(ci): correct Playwright cache restore-key prefix mismatch (#2760) The restore-keys used `${{ runner.os }}-playwright-` but primary keys used `playwright-${{ runner.os }}-`, so the prefix never matched on cache miss, forcing a full browser download (~8.5 min) instead of a cache restore (~13 sec). Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> * fix(ci): replace full git clone with shallow checkout in CI job (#2761) Remove fetch-depth: 0 from the ci job's checkout step, which cloned the entire git history (1.5-5 min overhead). Only the OpenAPI change detection step needs the base branch ref, so fetch it on-demand with --depth=1. Also switches the diff from three-dot merge-base syntax to a two-dot pathspec-filtered diff against the fetched base ref, which works correctly with shallow clones. Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> * format * rm migration * add new migrations * validation for skill is ok * move empty state to page * delete skill * delete skill revalidate path * move skills schemas to own file * upd * upd * upd * upd * upd * upd * upd * upd * upd * upd * more typecheck fixes * more typecheck fixes * fix * fix isRequired * f1x * move skill sidebar * refactor skill sidebar * add collapse file tree button * upd * upd * upd * deleteSkillFile * upd * deleteSkillFile * fileId * fileId * upd schemas * DeleteSkillFileConfirmation * updateSkillFile * rm simplematter from sdk * Get Skill File * getSkillFileById * add new skill file page * update skill file editor * format * Create Skill File * upd * createSkillFileAction * createSkillFileById * fix: Make OpenTelemetry startup idempotent (#2684) * fix: Make OpenTelemetry startup idempotent * fix: Re-export defaultSDK and cache NodeSDK instance on globalThis Restores the export on defaultSDK to avoid breaking the create-agents-template subpath import. Moves the new NodeSDK() construction behind a globalThis guard (getOrCreateSDK) so repeated Vite HMR module evaluations reuse the same instance instead of leaking fresh SDK objects. Co-authored-by: mike-inkeep <[email protected]> Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(template): use idempotent startOpenTelemetrySDK() in instrumentation * fix: guard all OTel singletons behind globalThis for HMR idempotency - Cache otlpExporter, batchProcessor, resource, instrumentations, spanProcessors, contextManager, and propagator on globalThis via Symbol keys and getOrCreate* helpers so HMR re-evaluation reuses existing instances instead of leaking new ones - Make OtelGlobal type strict with per-key types, eliminating the loose `boolean | NodeSDK` union and the `as NodeSDK` cast - Add logger.debug in the MetricReader catch block to distinguish clean idempotency from error-recovery idempotency - Remove defaultSDK export (now module-private) since all consumers use startOpenTelemetrySDK() instead * Fix type errors * Simplify to just suppress the error since it's not an issue in prod, only local * Limit to dev mode * Add changeset for OTel HMR idempotency fix Co-authored-by: Dimitri POSTOLOV <[email protected]> Co-Authored-By: Claude Opus 4.6 <[email protected]> --------- Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: mike-inkeep <[email protected]> Co-authored-by: Claude Opus 4.6 <[email protected]> Co-authored-by: inkeep[bot] <257615677+inkeep[bot]@users.noreply.github.com> Co-authored-by: Dimitri POSTOLOV <[email protected]> * Fix scheduled trigger invocations being skipped (#2777) * Fix scheduled trigger invocations being skipped when trigger is edited without changing the next execution time * claude comments * adding app id (#2779) * Update pullfrog to latest SHA and add daily dependabot group for high-frequency deps (#2780) * Update pullfrog to latest SHA and add daily dependabot group for high-frequency deps - Update pullfrog from v0.0.178 SHA to v0.0.181 SHA (30d68e5) to stay on commit-pinned references for security (action has write permissions + 9 API keys) - Split dependabot github-actions config into a "high-frequency" group for pullfrog with daily schedule, so SHA pins get updated automatically - This supersedes PR #2757's approach of moving to mutable tag references https://claude.ai/code/session_01D3ZGYHG8VhsZwqjjXqy2Ap * Split dependabot github-actions into daily (pullfrog) and monthly (rest) Separate into two ecosystem entries so pullfrog gets daily SHA updates while other GitHub Actions stay on a monthly cadence. https://claude.ai/code/session_01D3ZGYHG8VhsZwqjjXqy2Ap * Fix invalid dependabot config: merge duplicate github-actions entries Dependabot disallows duplicate ecosystem+directory pairs. Use a single entry with two groups instead: high-frequency (pullfrog) and github-actions (everything else via exclude-patterns). https://claude.ai/code/session_01D3ZGYHG8VhsZwqjjXqy2Ap --------- Co-authored-by: Claude <[email protected]> * ci: seed preview auth in PR previews (#2775) * ci: bootstrap preview auth * ci: require secure preview auth config * ci: recover preview auth runtime vars * ci: install railway in preview bootstrap * ci: provision preview db tcp proxies * ci: proxy preview spicedb bootstrap * ci: harden preview retry and error logging --------- Co-authored-by: Andrew Mikofalvy <[email protected]> * Fix scopes placeholder to show correct Nango format (#2784) * Fix misleading scopes placeholder in credential form The Nango API validates scopes against a strict comma-separated pattern with no spaces. Updated placeholder and help text to show the correct format and prevent 400 errors when users enter multiple scopes. Made-with: Cursor * Add changeset for scopes placeholder fix Made-with: Cursor * fix(manage-ui): fix URL validation bypass and permission guard in credential provider setup (#2776) * fix(manage-ui): fix URL validation bypass and permission guard in credential provider setup Reorder Zod schema construction so custom validators (e.g. URL protocol allowlist) are chained after required/optional base schema instead of being overwritten. Move all React hooks above the canEdit early-return guard to satisfy Rules of Hooks, with canEdit checks inside hook bodies. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix(manage-ui): add server-side URL protocol validation in buildCredentialsPayload Validate app_link against HTTP/HTTPS allowlist in the server action to prevent bypassing client-side form validation. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Update agents-manage-ui/src/components/credentials/views/generic-auth-form.tsx Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> * fix err --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> * feat(pdf): Support PDF attachments (#2709) * feat(pdf): Support PDF attachments * Add tests and other review feedback * Fix doc * More renaming and cleanup * refactor: extract Vercel content part schemas to types/chat.ts for reuse Move inline Zod schemas from chatDataStream.ts and message-parts.ts into types/chat.ts as shared, exported schemas. This eliminates duplicate definitions and makes schema management easier. Co-authored-by: Andrew Mikofalvy <[email protected]> Co-Authored-By: Claude Opus 4.6 <[email protected]> --------- Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: Andrew Mikofalvy <[email protected]> Co-authored-by: Claude Opus 4.6 <[email protected]> * feat: Composio connected account ID pinning (#2783) * feat: Composio connected account ID pinning Pin connected_account_id to Composio MCP URLs to prevent cross-project credential leakage. Implements "both or none" policy — user_id and connected_account_id are injected together or not at all. - Add ComposioCredentialStore for credential lifecycle management - Update AgentMcpManager and discoverToolsFromServer with pinning logic - Mark Composio tools without connectedAccountId as needs_auth - Add generic disconnect credential UI (works for all credential types) - Store authScheme in credential retrievalParams for display - Update OAuth login flow to create credential references post-connect - Add unit tests for new credential store, composio client, and pinning Made-with: Cursor * feedback * fix test * Version Packages (#2778) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Add back link to projects sidebar, add org settings link to user drop… (#2787) * Add back link to projects sidebar, add org settings link to user dropdown, adjust sidebar highlight color in dark mode * Apply suggestion from @claude[bot] Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> * Fix bad claude formatting --------- Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> * fix: return FileUIPart-compliant file parts from /run conversations endpoint (#2782) * fix: return Vercel AI SDK FileUIPart-compliant file parts from /run conversations endpoint - Resolve blob:// URIs to proxy HTTP URLs via resolveMessagesListBlobUris() - Reshape file parts from { data, metadata.mimeType } to { url, mediaType, filename? } - Matches Vercel AI SDK FileUIPart spec for useChat() compatibility Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Skip malformed file parts --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> Co-authored-by: Mike Rashkovsky <[email protected]> * fix: treat load_skill as internal tool to suppress false-positive Sentry errors (#2756) * fix: provide relationshipId for load_skill tool calls in graph events * fix: treat load_skill as internal tool, suppress chat/graph streaming events * fix for fetch trace (#2791) * fix for fetch trace * fix for fetch trace * Fix empty breadcrumb on `/[tenantId]/profile` page and replace prop-drilled permission flags (`readOnly`, `canEdit`, `canUse`) with direct hook call `useProjectPermissionsQuery()` (#2792) * upd * upd * format * format * format * format * format * format * format * format * format * fix review * fix breadcrumb on profile page * Apply suggestions from code review Co-authored-by: Dimitri POSTOLOV <[email protected]> * Update agents-manage-ui/src/lib/api/projects.ts Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> * Update agents-manage-ui/src/app/[tenantId]/profile/layout.tsx Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> * style: auto-format with biome * fix review --------- Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * fix(manage-ui): fix user-scoped MCP credential card not refreshing after connect/disconnect (#2794) Fetch user-scoped credential server-side in page.tsx (matching the project-scoped pattern) instead of via a client-side React Query hook. This ensures router.refresh() after OAuth connect or credential delete re-fetches the credential data, so the "Your Connection" card updates without a manual page refresh. Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> * reuse `useProjectsQuery` instead of `fetchProjectsAction` in `useEffect` (#2793) * reuse `useProjectsQuery` instead of `fetchProjectsAction` in `useEffect` * format * upd * fix lint * Create little-hounds-battle.md * upd * upd skill file editor * polish skill editor like in github * remove canEdit * upd * move skill metadata under collapsible advanced section * reuse DeleteConfirmation * upd skill file editor * upd skill file editor2 * add useInitialCollapsedSidebar * add useInitialCollapsedSidebar * upd skill file editor * rm * // Avoid including metadata in the frontmatter when it's null * fetchSkillFile and createSkillFile * refactor skill breadcrumb * format * polish * upd * skills integration tests * fix validation tests * update skill form * upd api skills in manage ui * upd entities * partial * fix skill loader test * chore: update OpenAPI snapshot * polish skill file editor * upd core skills tests * upd core skills tests * add SkillCreateDataSchema * update skills data manage * remove redundant * lint * lint * typecheck * typecheck * typecheck * knip * lint * rollback skill modals * make modal opens in skill selector * fix typecheck * this should fix cypress * fix sdk tests * split permissions call * add folder feature * findNodeByPath * SkillDirectoryBrowser * upd * polish * fix * fix edge case metadata validation * fix * fix cli test * format * upd * upd * upd * chore: update OpenAPI snapshot * fix skill generator * add button group * connect submit logic with extension select * polish * update skill generator tests * update generation test * polish skill generator * format * format * fixes for tests * typecheck * fix review * format * new migration * upd * rm migration * add migrations * fix migration and add * rm outdated * Apply suggestions from code review Co-authored-by: pullfrog[bot] <226033991+pullfrog[bot]@users.noreply.github.com> Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> * polish * Add detailed changeset for nested skill files feature Co-authored-by: Dimitri POSTOLOV <[email protected]> * fix typecheck --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Gaurav Varma <[email protected]> Co-authored-by: Andrew Mikofalvy <[email protected]> Co-authored-by: Claude <[email protected]> Co-authored-by: Varun Varahabhotla <[email protected]> Co-authored-by: mike-inkeep <[email protected]> Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: mike-inkeep <[email protected]> Co-authored-by: inkeep[bot] <257615677+inkeep[bot]@users.noreply.github.com> Co-authored-by: Dimitri POSTOLOV <[email protected]> Co-authored-by: shagun-singh-inkeep <[email protected]> Co-authored-by: omar-inkeep <[email protected]> Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Co-authored-by: Andrew Mikofalvy <[email protected]> Co-authored-by: inkeep-internal-ci[bot] <259778081+inkeep-internal-ci[bot]@users.noreply.github.com> Co-authored-by: sarah <[email protected]> Co-authored-by: Abraham <[email protected]> Co-authored-by: pullfrog[bot] <226033991+pullfrog[bot]@users.noreply.github.com>
* skill generator * polish skill generator * skills tests * upd * upd * generation.test wip * add generation.test * tree node * skill page * skill loader * skill loader refactor * skill loader * move skills sidebar to layout * use pure monaco-editor component since we can have different file extension * add shadcn context menu component * format context menu * skills files and edit pages * dry * update layout * add docs * add a changeset * redirect to first skill * skill files utils * skill selector * upd treenode * skill files * skill file editor * delete skill confirmation * add skill files actions * skills data * rm * up skills route * upd * upd * better project error message on dev * types * skill files * skill loader * format * project test * entities * project full tests * upd introspect * upd cliiii * nested skills tests * remove edit page * remove edit page * update files page * upd * upd file editor * add SkillFileInsertSchema * superRefine * add transform * rm some cases in superRefine * use pipe * use pipe * upd skill loader * validation skills * upd * rm * upd * data access tests * skills db changes * add * skill files * upd * upd * upd skill update * SkillUpdateSchema has required files * upd skills manage * upd * upd layout and page * style: auto-format with biome * move empty state comp to page * upd schemas * update schemas * move to with-sidebar * polish * upd * upd skill generator * Make webhooks docs user friendly (#2752) * shaping a2a webhooks page * moved triggers to visual builder * vb webhooks wip * numbered TOC steps * added step circles * indented toc steps more * added newsletter signup to docs * added share feedback button * moved newsletter subscribe route to agent docs * subscribe confirm polish * improved spacing * improved spacing * added high quality images * added verification step * Sync lockfile after rebase * Use tag reference for pullfrog action instead of pinned SHA (#2757) The action is actively under development, so referencing the v0 tag allows picking up updates automatically. https://claude.ai/code/session_01QZyvEs97scVf1ahTG8C1rV Co-authored-by: Claude <[email protected]> * ci: provision PR preview environments in Railway (#2681) * ci: add preview env diagnostics * ci: probe preview env schema before deploy * ci: probe preview env schema before deploy * ci: harden preview api env defaults * ci: attach git metadata to preview deploys * ci: harden preview workflow operations * ci: broaden preview log redaction * ci: extract preview workflow scripts * ci: harden preview script extraction * fix(ci): correct Playwright cache restore-key prefix mismatch (#2760) The restore-keys used `${{ runner.os }}-playwright-` but primary keys used `playwright-${{ runner.os }}-`, so the prefix never matched on cache miss, forcing a full browser download (~8.5 min) instead of a cache restore (~13 sec). Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> * fix(ci): replace full git clone with shallow checkout in CI job (#2761) Remove fetch-depth: 0 from the ci job's checkout step, which cloned the entire git history (1.5-5 min overhead). Only the OpenAPI change detection step needs the base branch ref, so fetch it on-demand with --depth=1. Also switches the diff from three-dot merge-base syntax to a two-dot pathspec-filtered diff against the fetched base ref, which works correctly with shallow clones. Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> * format * rm migration * add new migrations * validation for skill is ok * move empty state to page * delete skill * delete skill revalidate path * move skills schemas to own file * upd * upd * upd * upd * upd * upd * upd * upd * upd * upd * more typecheck fixes * more typecheck fixes * fix * fix isRequired * f1x * move skill sidebar * refactor skill sidebar * add collapse file tree button * upd * upd * upd * deleteSkillFile * upd * deleteSkillFile * fileId * fileId * upd schemas * DeleteSkillFileConfirmation * updateSkillFile * rm simplematter from sdk * Get Skill File * getSkillFileById * add new skill file page * update skill file editor * format * Create Skill File * upd * createSkillFileAction * createSkillFileById * fix: Make OpenTelemetry startup idempotent (#2684) * fix: Make OpenTelemetry startup idempotent * fix: Re-export defaultSDK and cache NodeSDK instance on globalThis Restores the export on defaultSDK to avoid breaking the create-agents-template subpath import. Moves the new NodeSDK() construction behind a globalThis guard (getOrCreateSDK) so repeated Vite HMR module evaluations reuse the same instance instead of leaking fresh SDK objects. Co-authored-by: mike-inkeep <[email protected]> Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(template): use idempotent startOpenTelemetrySDK() in instrumentation * fix: guard all OTel singletons behind globalThis for HMR idempotency - Cache otlpExporter, batchProcessor, resource, instrumentations, spanProcessors, contextManager, and propagator on globalThis via Symbol keys and getOrCreate* helpers so HMR re-evaluation reuses existing instances instead of leaking new ones - Make OtelGlobal type strict with per-key types, eliminating the loose `boolean | NodeSDK` union and the `as NodeSDK` cast - Add logger.debug in the MetricReader catch block to distinguish clean idempotency from error-recovery idempotency - Remove defaultSDK export (now module-private) since all consumers use startOpenTelemetrySDK() instead * Fix type errors * Simplify to just suppress the error since it's not an issue in prod, only local * Limit to dev mode * Add changeset for OTel HMR idempotency fix Co-authored-by: Dimitri POSTOLOV <[email protected]> Co-Authored-By: Claude Opus 4.6 <[email protected]> --------- Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: mike-inkeep <[email protected]> Co-authored-by: Claude Opus 4.6 <[email protected]> Co-authored-by: inkeep[bot] <257615677+inkeep[bot]@users.noreply.github.com> Co-authored-by: Dimitri POSTOLOV <[email protected]> * Fix scheduled trigger invocations being skipped (#2777) * Fix scheduled trigger invocations being skipped when trigger is edited without changing the next execution time * claude comments * adding app id (#2779) * Update pullfrog to latest SHA and add daily dependabot group for high-frequency deps (#2780) * Update pullfrog to latest SHA and add daily dependabot group for high-frequency deps - Update pullfrog from v0.0.178 SHA to v0.0.181 SHA (30d68e5) to stay on commit-pinned references for security (action has write permissions + 9 API keys) - Split dependabot github-actions config into a "high-frequency" group for pullfrog with daily schedule, so SHA pins get updated automatically - This supersedes PR #2757's approach of moving to mutable tag references https://claude.ai/code/session_01D3ZGYHG8VhsZwqjjXqy2Ap * Split dependabot github-actions into daily (pullfrog) and monthly (rest) Separate into two ecosystem entries so pullfrog gets daily SHA updates while other GitHub Actions stay on a monthly cadence. https://claude.ai/code/session_01D3ZGYHG8VhsZwqjjXqy2Ap * Fix invalid dependabot config: merge duplicate github-actions entries Dependabot disallows duplicate ecosystem+directory pairs. Use a single entry with two groups instead: high-frequency (pullfrog) and github-actions (everything else via exclude-patterns). https://claude.ai/code/session_01D3ZGYHG8VhsZwqjjXqy2Ap --------- Co-authored-by: Claude <[email protected]> * ci: seed preview auth in PR previews (#2775) * ci: bootstrap preview auth * ci: require secure preview auth config * ci: recover preview auth runtime vars * ci: install railway in preview bootstrap * ci: provision preview db tcp proxies * ci: proxy preview spicedb bootstrap * ci: harden preview retry and error logging --------- Co-authored-by: Andrew Mikofalvy <[email protected]> * Fix scopes placeholder to show correct Nango format (#2784) * Fix misleading scopes placeholder in credential form The Nango API validates scopes against a strict comma-separated pattern with no spaces. Updated placeholder and help text to show the correct format and prevent 400 errors when users enter multiple scopes. Made-with: Cursor * Add changeset for scopes placeholder fix Made-with: Cursor * fix(manage-ui): fix URL validation bypass and permission guard in credential provider setup (#2776) * fix(manage-ui): fix URL validation bypass and permission guard in credential provider setup Reorder Zod schema construction so custom validators (e.g. URL protocol allowlist) are chained after required/optional base schema instead of being overwritten. Move all React hooks above the canEdit early-return guard to satisfy Rules of Hooks, with canEdit checks inside hook bodies. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix(manage-ui): add server-side URL protocol validation in buildCredentialsPayload Validate app_link against HTTP/HTTPS allowlist in the server action to prevent bypassing client-side form validation. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Update agents-manage-ui/src/components/credentials/views/generic-auth-form.tsx Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> * fix err --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> * feat(pdf): Support PDF attachments (#2709) * feat(pdf): Support PDF attachments * Add tests and other review feedback * Fix doc * More renaming and cleanup * refactor: extract Vercel content part schemas to types/chat.ts for reuse Move inline Zod schemas from chatDataStream.ts and message-parts.ts into types/chat.ts as shared, exported schemas. This eliminates duplicate definitions and makes schema management easier. Co-authored-by: Andrew Mikofalvy <[email protected]> Co-Authored-By: Claude Opus 4.6 <[email protected]> --------- Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: Andrew Mikofalvy <[email protected]> Co-authored-by: Claude Opus 4.6 <[email protected]> * feat: Composio connected account ID pinning (#2783) * feat: Composio connected account ID pinning Pin connected_account_id to Composio MCP URLs to prevent cross-project credential leakage. Implements "both or none" policy — user_id and connected_account_id are injected together or not at all. - Add ComposioCredentialStore for credential lifecycle management - Update AgentMcpManager and discoverToolsFromServer with pinning logic - Mark Composio tools without connectedAccountId as needs_auth - Add generic disconnect credential UI (works for all credential types) - Store authScheme in credential retrievalParams for display - Update OAuth login flow to create credential references post-connect - Add unit tests for new credential store, composio client, and pinning Made-with: Cursor * feedback * fix test * Version Packages (#2778) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Add back link to projects sidebar, add org settings link to user drop… (#2787) * Add back link to projects sidebar, add org settings link to user dropdown, adjust sidebar highlight color in dark mode * Apply suggestion from @claude[bot] Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> * Fix bad claude formatting --------- Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> * fix: return FileUIPart-compliant file parts from /run conversations endpoint (#2782) * fix: return Vercel AI SDK FileUIPart-compliant file parts from /run conversations endpoint - Resolve blob:// URIs to proxy HTTP URLs via resolveMessagesListBlobUris() - Reshape file parts from { data, metadata.mimeType } to { url, mediaType, filename? } - Matches Vercel AI SDK FileUIPart spec for useChat() compatibility Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Skip malformed file parts --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> Co-authored-by: Mike Rashkovsky <[email protected]> * fix: treat load_skill as internal tool to suppress false-positive Sentry errors (#2756) * fix: provide relationshipId for load_skill tool calls in graph events * fix: treat load_skill as internal tool, suppress chat/graph streaming events * fix for fetch trace (#2791) * fix for fetch trace * fix for fetch trace * Fix empty breadcrumb on `/[tenantId]/profile` page and replace prop-drilled permission flags (`readOnly`, `canEdit`, `canUse`) with direct hook call `useProjectPermissionsQuery()` (#2792) * upd * upd * format * format * format * format * format * format * format * format * format * fix review * fix breadcrumb on profile page * Apply suggestions from code review Co-authored-by: Dimitri POSTOLOV <[email protected]> * Update agents-manage-ui/src/lib/api/projects.ts Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> * Update agents-manage-ui/src/app/[tenantId]/profile/layout.tsx Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> * style: auto-format with biome * fix review --------- Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * fix(manage-ui): fix user-scoped MCP credential card not refreshing after connect/disconnect (#2794) Fetch user-scoped credential server-side in page.tsx (matching the project-scoped pattern) instead of via a client-side React Query hook. This ensures router.refresh() after OAuth connect or credential delete re-fetches the credential data, so the "Your Connection" card updates without a manual page refresh. Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> * reuse `useProjectsQuery` instead of `fetchProjectsAction` in `useEffect` (#2793) * reuse `useProjectsQuery` instead of `fetchProjectsAction` in `useEffect` * format * upd * fix lint * Create little-hounds-battle.md * upd * upd skill file editor * polish skill editor like in github * remove canEdit * upd * move skill metadata under collapsible advanced section * reuse DeleteConfirmation * upd skill file editor * upd skill file editor2 * add useInitialCollapsedSidebar * add useInitialCollapsedSidebar * upd skill file editor * rm * // Avoid including metadata in the frontmatter when it's null * fetchSkillFile and createSkillFile * refactor skill breadcrumb * format * polish * upd * skills integration tests * fix validation tests * update skill form * upd api skills in manage ui * upd entities * partial * fix skill loader test * chore: update OpenAPI snapshot * polish skill file editor * upd core skills tests * upd core skills tests * add SkillCreateDataSchema * update skills data manage * remove redundant * lint * lint * typecheck * typecheck * typecheck * knip * lint * rollback skill modals * make modal opens in skill selector * fix typecheck * this should fix cypress * fix sdk tests * split permissions call * add folder feature * findNodeByPath * SkillDirectoryBrowser * upd * polish * fix * fix edge case metadata validation * fix * fix cli test * format * upd * upd * upd * chore: update OpenAPI snapshot * fix skill generator * add button group * connect submit logic with extension select * polish * update skill generator tests * update generation test * polish skill generator * format * format * fixes for tests * typecheck * fix review * format * new migration * upd * rm migration * add migrations * fix migration and add * rm outdated * Apply suggestions from code review Co-authored-by: pullfrog[bot] <226033991+pullfrog[bot]@users.noreply.github.com> Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> * polish * Add detailed changeset for nested skill files feature Co-authored-by: Dimitri POSTOLOV <[email protected]> * fix typecheck --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Gaurav Varma <[email protected]> Co-authored-by: Andrew Mikofalvy <[email protected]> Co-authored-by: Claude <[email protected]> Co-authored-by: Varun Varahabhotla <[email protected]> Co-authored-by: mike-inkeep <[email protected]> Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: mike-inkeep <[email protected]> Co-authored-by: inkeep[bot] <257615677+inkeep[bot]@users.noreply.github.com> Co-authored-by: Dimitri POSTOLOV <[email protected]> Co-authored-by: shagun-singh-inkeep <[email protected]> Co-authored-by: omar-inkeep <[email protected]> Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Co-authored-by: Andrew Mikofalvy <[email protected]> Co-authored-by: inkeep-internal-ci[bot] <259778081+inkeep-internal-ci[bot]@users.noreply.github.com> Co-authored-by: sarah <[email protected]> Co-authored-by: Abraham <[email protected]> Co-authored-by: pullfrog[bot] <226033991+pullfrog[bot]@users.noreply.github.com>
Vite hot reload frequently tries to re-run the otel SDK startup, which is not idempotent and leads to a 500 error on all subsequent requests. This requires restarting
pnpm dev.The error shows up as:
10:04:53 AM [vite] Internal server error: MetricReader can not be bound to a MeterProvider again.This PR makes that operation idempotent by using a global var to keep track of whether otel was started up already.