changes to instrumentation by shagun-singh-inkeep · Pull Request #150 · inkeep/agents

shagun-singh-inkeep · 2025-09-12T22:12:55Z

No description provided.

vercel · 2025-09-12T22:12:59Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
agents-docs	Ready	Preview	Comment	Sep 12, 2025 10:20pm
agents-manage-api	Ready	Preview	Comment	Sep 12, 2025 10:20pm
agents-manage-ui	Error			Sep 12, 2025 10:20pm
agents-run-api	Error			Sep 12, 2025 10:20pm

claude · 2025-09-12T22:13:59Z

Claude finished @shagun-singh-inkeep's task —— View job

PR Review: Instrumentation Changes

Todo List:

Analyze changed files and understand the scope of instrumentation changes
Review package.json dependency additions
Review instrumentation.ts implementation changes
Check for consistency with existing codebase patterns
Provide comprehensive engineering and architecture feedback

Summary

This PR refactors the OpenTelemetry instrumentation setup in agents-run-api with several significant architectural changes. The changes appear to simplify configuration and align with standard OpenTelemetry patterns, but there are several important considerations.

Key Changes Analyzed

✅ Positive Changes

Export Format Switch: Changed from @opentelemetry/exporter-trace-otlp-proto to @opentelemetry/exporter-trace-otlp-http
- HTTP is more widely supported and easier to debug than protobuf
- Good alignment with standard practices
Resource Configuration: Properly using resourceFromAttributes with semantic conventions
- Correctly implements ATTR_SERVICE_NAME from semantic conventions
- More robust than the previous direct serviceName property
Import Consolidation: Improved import structure with better organization

⚠️ Critical Architecture Concerns

1. Configuration Removal Without Context (HIGH PRIORITY)

// REMOVED: Manual URL configuration
const otlpUrl = env.OTEL_EXPORTER_OTLP_ENDPOINT;
const otlpExporter = new OTLPTraceExporter({ url: otlpUrl });

// NEW: Default configuration only
const otlpExporter = new OTLPTraceExporter();

Issue: The removal of explicit OTEL_EXPORTER_OTLP_ENDPOINT configuration could break existing deployments that rely on custom endpoints (e.g., different Jaeger instances, SigNoz, external observability platforms).

Recommendation: Either:

Restore the conditional URL configuration: new OTLPTraceExporter(env.OTEL_EXPORTER_OTLP_ENDPOINT ? { url: env.OTEL_EXPORTER_OTLP_ENDPOINT } : {})
Document that this is an intentional breaking change with migration guidance

2. Export Scope Reduction (MEDIUM PRIORITY)

// REMOVED exports
export const sdk = ...;
export { spanProcessor };

// NOW: No exports, just immediate execution

Issue: The previous exports allowed other parts of the system to:

Access the SDK for custom configuration
Force flush spans on shutdown via spanProcessor.forceFlush()
Implement graceful shutdown patterns

Impact: This could affect application lifecycle management, especially in containerized deployments where graceful shutdown is critical for trace data integrity.

3. Missing Environment Variable Schema (LOW-MEDIUM PRIORITY)

The env.ts file doesn't include OTEL_EXPORTER_OTLP_ENDPOINT in the schema, but the previous implementation expected it. This suggests either:

The environment variable was never properly validated
It should be added to the schema for completeness

🔧 Implementation Quality Issues

4. BatchSpanProcessor Import Change

// Changed from:
import { BatchSpanProcessor } from '@opentelemetry/sdk-trace-node';
// To:
import { BatchSpanProcessor } from '@opentelemetry/sdk-trace-base';

This is correct - @opentelemetry/sdk-trace-base is the proper location for BatchSpanProcessor.

5. FanOutSpanProcessor Type Safety

class FanOutSpanProcessor {
  constructor(private inner: any[]) {} // Using 'any' type

Issue: The any type reduces type safety. Consider using proper OpenTelemetry interfaces:

import type { SpanProcessor } from '@opentelemetry/sdk-trace-base';

class FanOutSpanProcessor implements SpanProcessor {
  constructor(private inner: SpanProcessor[]) {}

🏗️ System-Level Considerations

6. Observability Strategy Alignment

The changes suggest a move toward more standardized OpenTelemetry configuration
However, without PR description, it's unclear if this aligns with broader observability strategy
Consider documenting the rationale for these changes

7. Testing Implications

The removal of exports makes unit testing of instrumentation components more difficult
Consider whether test utilities need to be updated to accommodate these changes

Recommendations

Immediate Actions (Before Merge)

Restore Configuration Flexibility:

const otlpExporter = new OTLPTraceExporter(
  env.OTEL_EXPORTER_OTLP_ENDPOINT ? { url: env.OTEL_EXPORTER_OTLP_ENDPOINT } : {}
);

Add Environment Variable to Schema:

// In env.ts
OTEL_EXPORTER_OTLP_ENDPOINT: z.string().optional(),

Consider Graceful Shutdown Support:

// Export for lifecycle management
export const shutdownInstrumentation = () => sdk.shutdown();

Medium-Term Improvements

Improve Type Safety: Replace any types with proper OpenTelemetry interfaces
Add Integration Tests: Ensure instrumentation works end-to-end with different configurations
Documentation: Add inline comments explaining the FanOutSpanProcessor pattern

Customer Impact Assessment

Breaking Change Risk: HIGH - Removal of custom endpoint configuration
Migration Path: Document how to configure OTEL endpoints via environment variables
Observability: Changes should improve trace naming consistency

Final Assessment

While the refactoring improves code organization and follows OpenTelemetry best practices, the removal of configuration flexibility introduces significant deployment risk. The changes appear technically sound but need careful consideration of existing production usage patterns.

Recommendation: Address the configuration concerns before merging, particularly the OTLP endpoint configuration.

claude · 2025-09-12T22:19:12Z

Claude finished @shagun-singh-inkeep's task —— View job

PR Review: Instrumentation Changes

Todo List:

Analyze changed files and understand the scope of instrumentation changes
Review package.json dependency additions
Review instrumentation.ts implementation changes
Check for consistency with existing codebase patterns
Provide comprehensive engineering and architecture feedback

Summary

This PR refactors the OpenTelemetry instrumentation setup in agents-run-api with several significant architectural changes. The changes appear to simplify configuration and align with standard OpenTelemetry patterns, but there are several important considerations.

Key Changes Analyzed

✅ Positive Changes

Export Format Switch: Changed from @opentelemetry/exporter-trace-otlp-proto to @opentelemetry/exporter-trace-otlp-http
- HTTP is more widely supported and easier to debug than protobuf
- Good alignment with standard practices
Resource Configuration: Properly using resourceFromAttributes with semantic conventions
- Correctly implements ATTR_SERVICE_NAME from semantic conventions
- More robust than the previous direct serviceName property
Import Consolidation: Improved import structure with better organization

⚠️ Critical Architecture Concerns

1. Configuration Removal Without Context (HIGH PRIORITY)

// REMOVED: Manual URL configuration
const otlpUrl = env.OTEL_EXPORTER_OTLP_ENDPOINT;
const otlpExporter = new OTLPTraceExporter({ url: otlpUrl });

// NEW: Default configuration only
const otlpExporter = new OTLPTraceExporter();

Issue: The removal of explicit OTEL_EXPORTER_OTLP_ENDPOINT configuration could break existing deployments that rely on custom endpoints (e.g., different Jaeger instances, SigNoz, external observability platforms).

Recommendation: Either:

Restore the conditional URL configuration: new OTLPTraceExporter(env.OTEL_EXPORTER_OTLP_ENDPOINT ? { url: env.OTEL_EXPORTER_OTLP_ENDPOINT } : {})
Document that this is an intentional breaking change with migration guidance

2. Export Scope Reduction (MEDIUM PRIORITY)

// REMOVED exports
export const sdk = ...;
export { spanProcessor };

// NOW: No exports, just immediate execution

Issue: The previous exports allowed other parts of the system to:

Access the SDK for custom configuration
Force flush spans on shutdown via spanProcessor.forceFlush()
Implement graceful shutdown patterns

Impact: This could affect application lifecycle management, especially in containerized deployments where graceful shutdown is critical for trace data integrity.

3. Missing Environment Variable Schema (LOW-MEDIUM PRIORITY)

The env.ts file doesn't include OTEL_EXPORTER_OTLP_ENDPOINT in the schema, but the previous implementation expected it. This suggests either:

The environment variable was never properly validated
It should be added to the schema for completeness

🔧 Implementation Quality Issues

4. BatchSpanProcessor Import Change

// Changed from:
import { BatchSpanProcessor } from '@opentelemetry/sdk-trace-node';
// To:
import { BatchSpanProcessor } from '@opentelemetry/sdk-trace-base';

This is correct - @opentelemetry/sdk-trace-base is the proper location for BatchSpanProcessor.

5. FanOutSpanProcessor Type Safety

class FanOutSpanProcessor {
  constructor(private inner: any[]) {} // Using 'any' type

Issue: The any type reduces type safety. Consider using proper OpenTelemetry interfaces:

import type { SpanProcessor } from '@opentelemetry/sdk-trace-base';

class FanOutSpanProcessor implements SpanProcessor {
  constructor(private inner: SpanProcessor[]) {}

🏗️ System-Level Considerations

6. Observability Strategy Alignment

The changes suggest a move toward more standardized OpenTelemetry configuration
However, without PR description, it's unclear if this aligns with broader observability strategy
Consider documenting the rationale for these changes

7. Testing Implications

The removal of exports makes unit testing of instrumentation components more difficult
Consider whether test utilities need to be updated to accommodate these changes

Recommendations

Immediate Actions (Before Merge)

Restore Configuration Flexibility:

const otlpExporter = new OTLPTraceExporter(
  env.OTEL_EXPORTER_OTLP_ENDPOINT ? { url: env.OTEL_EXPORTER_OTLP_ENDPOINT } : {}
);

Add Environment Variable to Schema:

// In env.ts
OTEL_EXPORTER_OTLP_ENDPOINT: z.string().optional(),

Consider Graceful Shutdown Support:

// Export for lifecycle management
export const shutdownInstrumentation = () => sdk.shutdown();

Medium-Term Improvements

Improve Type Safety: Replace any types with proper OpenTelemetry interfaces
Add Integration Tests: Ensure instrumentation works end-to-end with different configurations
Documentation: Add inline comments explaining the FanOutSpanProcessor pattern

Customer Impact Assessment

Breaking Change Risk: HIGH - Removal of custom endpoint configuration
Migration Path: Document how to configure OTEL endpoints via environment variables
Observability: Changes should improve trace naming consistency

Final Assessment

While the refactoring improves code organization and follows OpenTelemetry best practices, the removal of configuration flexibility introduces significant deployment risk. The changes appear technically sound but need careful consideration of existing production usage patterns.

Recommendation: Address the configuration concerns before merging, particularly the OTLP endpoint configuration.

* Fix Public Mirror Sync: auto-approve sync PRs + handle stuck PRs Problem observed today: every push to main for the last ~18 hours failed the Public Mirror Sync workflow with Failed to push to inkeep/agents [HEAD:refs/heads/copybara/sync], because local/origin history is behind destination Root cause is two-part: 1. inkeep/agents `main` branch protection was tightened to require one approval on every PR (require_last_push_approval=true). Sync PRs authored by the inkeep-oss-sync app are never approved, so they sit BLOCKED with auto-merge enabled but no approver. Bypass actor status on the ruleset does NOT propagate through GitHub's auto-merge path. 2. Once one sync PR is stuck open, copybara on every subsequent agents- private push tries to push new commits to `copybara/sync` but the branch has diverging history from our local generation, so the push is rejected. The existing cleanup step explicitly leaves the branch alone while a sync PR is open, so the workflow wedges indefinitely. Fix: - Add `public/agents/.github/workflows/auto-approve-sync.yml`. Copybara mirrors files under `public/agents/**` into inkeep/agents, so this workflow ends up at `.github/workflows/auto-approve-sync.yml` on the public repo. It listens on pull_request_target for PRs authored by inkeep-oss-sync[bot] against the copybara/sync head ref and posts an approving review from github-actions[bot]. Handles the synchronize event too so the approval is refreshed after every copybara push (require_last_push_approval dismisses prior approvals on new commits). - Rework the "Clean up stale copybara/sync branch" step into a proper reconciliation step. Three cases now handled: (a) previous sync PR merged (squashed) — delete the stale branch (b) open sync PR older than STALE_PR_HOURS (default 2h) — close the PR and delete the branch so this run opens a fresh one with all accumulated changes (c) open sync PR recent — leave as-is; copybara appends commits Staff-engineer framing: (1) fixes the approval gap at the source (inkeep/agents) so no human is in the loop for routine syncs, while (2) ensures the agents-private sync workflow self-heals if the approval path ever breaks again (bot outage, policy tightening, etc.) instead of silently wedging. Bootstrap: PR #3150 (the currently-stuck sync PR) was manually approved and enqueued to unblock today. Once this PR merges to agents-private and the next sync lands the auto-approve workflow on inkeep/agents, subsequent syncs will no longer need manual approval. * Address review: harden reconcile error handling, elevate warnings, add timeout Pullfrog + Claude bot review on #150: - Separate "gh pr list failed" from "no open PR found" in the reconcile step. A silent API failure previously fell through to the branch- existence block and could delete the branch of a live PR. Now we exit early with ::warning:: and skip reconciliation for this run (Copybara may still succeed if the branch state is clean). - Remove the `exit 0` after closing a stale PR so execution falls through to the branch-existence fallback. If the DELETE of copybara/sync fails on the primary path, the fallback block gets a second attempt. - Elevate all cleanup/failure `|| echo` fallbacks to ::warning:: annotations so operators see them in the workflow summary instead of buried in stdout. - Replace stale comment on the `gh pr merge` call. The oss-sync app's bypass-actor status does NOT propagate through auto-merge (that's the whole reason auto-approve-sync.yml now exists). Document the real mechanism. - Add `timeout-minutes: 5` to the approve job in auto-approve-sync.yml. GitOrigin-RevId: 9dcc4f2f123f531a6c303cb8f43a4d096384c736

…#171) * fix: approve sync PRs from source side, drop target-repo auto-approve Moves copybara sync PR approval into public-mirror-sync.yml on agents-private using the INTERNAL_CI_APP token. Deletes the target-repo auto-approve-sync.yml workflow it replaces. Background PR #150 introduced auto-approve-sync.yml on the public tree to satisfy the required-approval rule that blocks copybara sync PRs on inkeep/agents. That design has a bootstrap problem: pull_request_target evaluates workflows from the target branch, so the very PR that first mirrors auto-approve-sync.yml to inkeep/agents/main cannot auto-approve itself. inkeep/agents PR #3157 is stuck in exactly that state, and because copybara/sync is blocked by that stuck PR, subsequent mirror runs on agents-private are failing at "Run agents mirror". Fix Approve from the source side instead. After public-mirror-sync.yml creates or updates the sync PR and enables auto-merge, a new step mints a second app token from INTERNAL_CI_APP (a different identity from inkeep-oss-sync, since GitHub blocks self-approval) and posts an approving review. The approval is scoped to the current head SHA and re-runs on every sync, which satisfies require_last_push_approval=true. - .github/workflows/public-mirror-sync.yml Adds Generate approver token step and Approve sync PR step. Skips re-approval when an approval already stands on the current head SHA. - public/agents/.github/workflows/auto-approve-sync.yml Deleted; replaced by the source-side step above. Prereq to verify once: INTERNAL_CI_APP is installed on inkeep/agents with Pull requests: Read and write. The app already powers Version Packages and release-handler paths that write to inkeep/agents, so the install is almost certainly already in place. * docs: update CI.md and runbook to reflect source-side sync approval Review feedback on #171: CI.md and CI_RUNBOOK.md still pointed at the deleted public/agents/.github/workflows/auto-approve-sync.yml. - CI.md: drop the row for auto-approve-sync.yml in the public workflows table; in the public/private parity table, replace the old row with one explaining approval now comes from public-mirror-sync.yml on the source side; update the OSS_SYNC_APP secrets row to drop "auto-approve sync" and add Copybara sync PR approval to INTERNAL_CI_APP usage. - CI_RUNBOOK.md: rewrite the "Sync PR on public repo is stuck open" entry to point operators at the Approve sync PR step and INTERNAL_CI_APP app installation instead of the deleted workflow. GitOrigin-RevId: 51edfa6a3a1e0863ba99d8498efad88c5dec1f06

* Delete duplicate Vercel workflows from public/agents (inkeep#152) These three files duplicated agents-private's root-level workflows: public/agents/.github/workflows/vercel-production.yml public/agents/.github/workflows/preview-environments.yml public/agents/.github/workflows/preview-janitor.yml Vercel projects (agents-api, agents-manage-ui, agents-docs) are owned and deployed from agents-private. Running duplicate workflows on the copybara-mirrored inkeep/agents repo means every sync fires BOTH repos' workflows against the same Vercel projects — wasted CI minutes at best, race-condition deploys at worst (both repos carry VERCEL_PROJECT_ID secrets targeting the same projects). The agents-private root-level replacements are: .github/workflows/public-agents-vercel-production.yml .github/workflows/public-agents-preview-environments.yml .github/workflows/public-agents-preview-janitor.yml Under the new release-split architecture (PR inkeep#144), the public mirror's only CI responsibility is the npm publish step in release.yml. Deleting these three keeps that scope clean. Copybara will propagate the deletions to inkeep/agents on next sync — the origin_files glob covers public/agents/** with no exclude for these filenames, so their disappearance from the source will delete them from the destination. GitOrigin-RevId: 2fa1f228c728dadf17bedbae88d776a11a8e127b * Fix Public Mirror Sync: auto-approve sync PRs + handle stuck PRs (inkeep#150) * Fix Public Mirror Sync: auto-approve sync PRs + handle stuck PRs Problem observed today: every push to main for the last ~18 hours failed the Public Mirror Sync workflow with Failed to push to inkeep/agents [HEAD:refs/heads/copybara/sync], because local/origin history is behind destination Root cause is two-part: 1. inkeep/agents `main` branch protection was tightened to require one approval on every PR (require_last_push_approval=true). Sync PRs authored by the inkeep-oss-sync app are never approved, so they sit BLOCKED with auto-merge enabled but no approver. Bypass actor status on the ruleset does NOT propagate through GitHub's auto-merge path. 2. Once one sync PR is stuck open, copybara on every subsequent agents- private push tries to push new commits to `copybara/sync` but the branch has diverging history from our local generation, so the push is rejected. The existing cleanup step explicitly leaves the branch alone while a sync PR is open, so the workflow wedges indefinitely. Fix: - Add `public/agents/.github/workflows/auto-approve-sync.yml`. Copybara mirrors files under `public/agents/**` into inkeep/agents, so this workflow ends up at `.github/workflows/auto-approve-sync.yml` on the public repo. It listens on pull_request_target for PRs authored by inkeep-oss-sync[bot] against the copybara/sync head ref and posts an approving review from github-actions[bot]. Handles the synchronize event too so the approval is refreshed after every copybara push (require_last_push_approval dismisses prior approvals on new commits). - Rework the "Clean up stale copybara/sync branch" step into a proper reconciliation step. Three cases now handled: (a) previous sync PR merged (squashed) — delete the stale branch (b) open sync PR older than STALE_PR_HOURS (default 2h) — close the PR and delete the branch so this run opens a fresh one with all accumulated changes (c) open sync PR recent — leave as-is; copybara appends commits Staff-engineer framing: (1) fixes the approval gap at the source (inkeep/agents) so no human is in the loop for routine syncs, while (2) ensures the agents-private sync workflow self-heals if the approval path ever breaks again (bot outage, policy tightening, etc.) instead of silently wedging. Bootstrap: PR inkeep#3150 (the currently-stuck sync PR) was manually approved and enqueued to unblock today. Once this PR merges to agents-private and the next sync lands the auto-approve workflow on inkeep/agents, subsequent syncs will no longer need manual approval. * Address review: harden reconcile error handling, elevate warnings, add timeout Pullfrog + Claude bot review on inkeep#150: - Separate "gh pr list failed" from "no open PR found" in the reconcile step. A silent API failure previously fell through to the branch- existence block and could delete the branch of a live PR. Now we exit early with ::warning:: and skip reconciliation for this run (Copybara may still succeed if the branch state is clean). - Remove the `exit 0` after closing a stale PR so execution falls through to the branch-existence fallback. If the DELETE of copybara/sync fails on the primary path, the fallback block gets a second attempt. - Elevate all cleanup/failure `|| echo` fallbacks to ::warning:: annotations so operators see them in the workflow summary instead of buried in stdout. - Replace stale comment on the `gh pr merge` call. The oss-sync app's bypass-actor status does NOT propagate through auto-merge (that's the whole reason auto-approve-sync.yml now exists). Document the real mechanism. - Add `timeout-minutes: 5` to the approve job in auto-approve-sync.yml. GitOrigin-RevId: 9dcc4f2f123f531a6c303cb8f43a4d096384c736 * fix(create-agents-template): narrow postinstall VERCEL guard to opt-in env var (inkeep#140) Same class of bug as inkeep#137 (chat-to-edit postinstall), but shipped in the customer-facing template. Every downstream user who scaffolds this template and deploys to Vercel hits the same ambient postinstall landmine. Background: create-agents-template/scripts/postinstall.ts ran if (process.env.VERCEL === '1' && !skip) { execSync('inkeep dev --export --output-dir ./apps/manage-ui', ...) execSync('pnpm -C apps/manage-ui install ...', ...) } The gate `VERCEL === '1'` is true on every Vercel build anywhere. Today the template only has one Vercel target (apps/agents-api), so the gate happened to correlate with the intended context. But: 1. Any customer who adds a second Vercel project targeting a different app (e.g. apps/mcp) silently fires the export, which assumes ./apps/manage-ui exists and `inkeep` is on PATH - neither is guaranteed for non-agents-api deploys. 2. Any Vercel deploy that installs with --production skips devDeps. `@inkeep/agents-cli` (which provides the `inkeep` CLI) is a devDep, so the postinstall would crash. 3. Replicates the exact failure mode that broke inkeep-cloud-mcp production in the main monorepo (inkeep#137). Fix: narrow to an explicit opt-in env var INKEEP_QUICKSTART_EXPORT=1 set only by the apps/agents-api vercel.json installCommand. The postinstall requires BOTH VERCEL=1 AND INKEEP_QUICKSTART_EXPORT=1 before running. Files: - scripts/postinstall.ts: require INKEEP_QUICKSTART_EXPORT=1 gate - apps/agents-api/vercel.json: prepend INKEEP_QUICKSTART_EXPORT=1 to existing installCommand Backward-compat impact: existing Vercel projects customers created from prior template versions will skip the export on their next deploy until they update their vercel.json (or re-scaffold from the template). They have two paths: - Pull in this template change (agents-api/vercel.json now carries the var) - Manually add INKEEP_QUICKSTART_EXPORT=1 to their installCommand Both recover cleanly. Existing apps/manage-ui exports from prior deploys remain on disk; the only loss is the refresh on the next deploy, which is repaired by either of the above actions. GitOrigin-RevId: a5fc5646e7fb2ae2c6f81cdc34453ed89d8191f9 * chore(ci): clean up stale monorepo-migration artifacts (inkeep#157) * chore(ci): clean up stale monorepo-migration artifacts - Delete public/agents/.github/workflows/coverage.yml.disabled. The workflow references package paths (execution-api, management-api, agent-builder) that no longer exist post-monorepo-migration, so it couldn't be re-enabled as-is. Also delete the companion doc public/agents/agents-docs/CI_SETUP.md, which was framed entirely around this workflow. Drop the stale row in public/agents/.agents/skills/internal-surface-areas/SKILL.md. - Add *.disabled to the copybara manifest exclude list so any future dead workflow files don't silently leak into the public mirror. Regenerated public-agents.bara.sky from the manifest. - Remove the dead chat-to-edit branch trigger from both public/agents/.github/workflows/release.yml and .github/workflows/public-agents-version-packages.yml. The branch returns 404 on both inkeep/agents-private and inkeep/agents. Out of scope (investigated, confirmed load-bearing): the agents-cli/** and agents-api/** path filters on public release.yml stay - both exist as top-level dirs on the public inkeep/agents repo (hoisted layout) and removing them would skip npm publishes for CLI/API-only changes. * Revert chat-to-edit branch removal (load-bearing) chat-to-edit branch on public inkeep/agents drives a dedicated snapshot publish step at public/agents/.github/workflows/release.yml lines 109-117 (pnpm changeset version --snapshot chat-to-edit). The branch returning 404 on both repos just means nobody has pushed to it right now; it's created on-demand for dev snapshot publishes. Keep the trigger in both release.yml and public-agents-version-packages.yml. GitOrigin-RevId: 5de85ee7359c4033fa2d3b7cd69910c57e1070ac * fix(preview): wire SigNoz to PR preview environments (inkeep#158) * fix(preview): wire SigNoz to PR preview environments Preview PR deployments had broken tracing: the manage-ui /api/traces proxy dropped the bypass secret so SigNoz health checks 401'd, and neither the Vercel projects nor agents-api had SIGNOZ_URL / otel exporter config. Adds a shared-observability Railway stack (signoz + otel-collector + clickhouse) and injects the endpoints + JWT-format API key at preview env provisioning time. - upsert-vercel-preview-env.sh: accept PREVIEW_SIGNOZ_URL / PREVIEW_SIGNOZ_API_KEY / PREVIEW_SIGNOZ_INGESTION_KEY / PREVIEW_OTEL_EXPORTER_OTLP_TRACES_ENDPOINT and upsert into both Vercel projects when set. - public-agents-preview-environments.yml: forward the new secrets into the upsert step. - agents-api signoz.ts: support JWT-format SIGNOZ_API_KEY via Authorization: Bearer (the shared preview instance exposes a rotating refresh token, not a PAT). Production PATs still go through the SIGNOZ-API-KEY header. - agents-manage-ui /api/traces: forward the manage-api bypass secret so SigNoz proxy calls authenticate through to agents-api. * fix(preview): export OTLP traces without requiring ingestion key Self-hosted SigNoz collector accepts unauthenticated OTLP, so gating OTEL_EXPORTER_OTLP_TRACES_ENDPOINT on PREVIEW_SIGNOZ_INGESTION_KEY was skipping trace export entirely. Split the guards, also wire OTLP env vars on the manage-ui project (which emits traces too) and add OTEL_RESOURCE_ATTRIBUTES so preview spans are filterable by PR branch. * refactor: address PR review polish items - signoz.ts: document why the 'eyJ' prefix safely discriminates JWT refresh tokens from SigNoz PATs (preview-stack workaround for broken v0.119 enterprise PAT endpoint). - traces/route.ts GET handler: reuse extractRequestContext instead of duplicating cookie + bypass-secret header construction. GitOrigin-RevId: 942b6298a47afca1c49c3ae06fb10bce04f4f27f * Move entitlement lock query to DAL layer (inkeep#113) * Move entitlement lock query to DAL, remove /auth/ exclusion from boundary check * style: auto-format with biome --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> GitOrigin-RevId: 6afc2828e2b2bd136280daece631469d44fafe39 * feat(apps): add quick actions editor to support_copilot config (inkeep#164) * feat(apps): add quick actions editor to support_copilot config Adds quickActions schema to agents-core, persists the field through the apps PATCH route, and introduces a drag-and-drop quick actions section in the support_copilot app form. * style: auto-format with biome * fix(apps): address review feedback on support_copilot quick actions - Add max length constraints (label: 100, prompt: 4000, group: 100) in both the agents-core schema and the manage-ui form validation. - Disable the "Add group" button when the last group has no name, so users can't stack empty groups. - Add screen-reader announcements to both the group-level and action-level DndContexts. - Fix pre-existing a11y lint errors (noLabelWithoutControl, noStaticElementInteractions) by switching the action popover to FormLabel and moving keyboard shortcuts onto the inputs themselves. * chore: sync public/agents lockfile for @dnd-kit deps The initial commit only regenerated the root pnpm-lock.yaml. Vercel builds from public/agents/ which uses its own lockfile, so the @dnd-kit/* deps added to agents-manage-ui/package.json must be present there too or Vercel fails with ERR_PNPM_OUTDATED_LOCKFILE. * chore(apps): drop unused SupportCopilotQuickActionFormInput type Only the group-level type is consumed (app-update-form.tsx). The single-action form type was exported but never imported, flagged by knip in CI. * test(api): regenerate openapi snapshot for SupportCopilotQuickAction schemas Adds the new SupportCopilotQuickAction and SupportCopilotQuickActionGroup schema entries (with min/max length constraints) and the quickActions field on SupportCopilotConfig that were introduced for the quick-actions editor in agents-manage-ui. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> GitOrigin-RevId: 32bce4fd36e8dd4ae12a8ab2ac3607ea29fdd2f1 * Agent conversation self-reference ({{$conversation.id}}) (inkeep#141) * spec: agent conversation self-reference Add SPEC.md with β-pure design (template-variable-only) for exposing {{$conversation.id}} as a prompt template variable. Includes evidence files (render-site inventory, A2A propagation trace, prior-art analysis, downstream-surfaces analysis, template-engine details), audit + design-challenge findings, and full changelog capturing the β→α→β-pure pivot history with rationale. Implementation follows in subsequent commits per tmp/ship/spec.json (US-001 through US-009). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * [US-001] Add TemplateEngine.renderPrompt() with PromptRenderOptions New static method resolves $-prefixed paths from runtime-provided builtins via dotted-path walk, falling through to the existing built-in dispatch on miss. render() signature and behavior are preserved byte-exact — only renderPrompt() accepts runtimeBuiltins, structurally enforcing the scope invariant that non-prompt render sites cannot opt into runtime builtins. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * [US-002] Wire {{$conversation.id}} resolution in buildSystemPrompt Switch both agent-prompt render sites in buildSystemPrompt from TemplateEngine.render() to TemplateEngine.renderPrompt() and narrow the contextConfigId guard so agents without contextConfig and without {{$conversation.*}} in their prompt see zero behavior change (G4). - Extend runtimeContext type literal to include metadata.taskId - Compute runtimeBuiltins from runtimeContext.metadata.conversationId, treating the 'default' sentinel and undefined as empty - Narrow !resolvedContext path: only render when prompt contains {{$conversation., with preserveUnresolved=true so non-$conversation literals are preserved - Add 6 integration tests covering contextConfig present/absent, G4 zero-drift regression, and 'default'/undefined sentinel cases Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * [US-003] A2A delegation propagation test — child resolves {{$conversation.id}} to parent's conversationId Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * [US-004] Negative-path tests — {{$conversation.id}} does not resolve at non-prompt render sites (D6) Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * [US-005] Recognize {{\$conversation.id}} in Monaco prompt editor lint + autocomplete Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * [US-006] Preserve \$-prefixed template variables through pull-v4 round-trip Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * [US-007] Customer-support cookbook demonstrates {{\$conversation.id}} in Zendesk write-back Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * [US-008] Docs: central template-variable reference page Adds public/agents/agents-docs/content/typescript-sdk/prompt-template-variables.mdx documenting all four prompt template variables: contextVariable.*, headers.*, $env.*, and $conversation.id. Registers the new page in typescript-sdk/meta.json navigation. Per SPEC §6 (Must) and §13 Deployment considerations — first centralization of this vocabulary. Promotes $env.* (currently undocumented) alongside the new $conversation.id variable so D3's "mirrors $env.* convention" rationale is grounded in published docs rather than invisible convention. Spec: public/agents/specs/2026-04-16-agent-conversation-self-reference/SPEC.md Co-Authored-By: Claude Opus 4.6 <[email protected]> * [US-009] Changesets: minor bumps for {{\$conversation.id}} Adds four changesets per SPEC §13 / §16 SCOPE: - agents-core: new TemplateEngine.renderPrompt() + PromptRenderOptions - agents-api: resolves {{\$conversation.id}} in agent prompts; A2A propagation - agents-cli: preserves \$-prefixed template variables through pull-v4 round-trip - agents-manage-ui: Monaco prompt editor lint + autocomplete recognition All four bump minor. Changeset config version-locks @inkeep/* packages together via "fixed" rule. agents-docs and cookbook-templates are in the ignore list; no changeset needed for them. Spec: public/agents/specs/2026-04-16-agent-conversation-self-reference/SPEC.md Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: trim prompt-template-variables.mdx to customer-facing variables only Drops the {{\$env.*}} section from the new reference page and shortens the overall page from 150 → ~65 lines. Page now covers three variables: contextVariable.*, headers.*, and \$conversation.*, linking out to the existing context-fetchers.mdx and headers.mdx pages for the first two and fully documenting {{\$conversation.id}} inline. Rationale — \$env.* has no traceable spec, PRD, or introducing PR: it shipped in the root commit of inkeep/agents (c39fdd0, 2025-09-05, initial squash) as pre-public scaffolding and was preserved without explanation by inkeep#818 (2025-10-24) when the other builtins were removed. Zero cookbook templates use it; zero customer-facing docs ever mentioned it. The draft docs page would have been the first act of promoting it to a customer feature — without the security/privacy review that warrants. Rendered env values are visible to the LLM and can flow through to tool arguments, assistant output, and traces. Keep the undocumented-but-functional status quo; address \$env.* product surface in a separate spec if warranted. Spec updates: - §6 Functional — "Must: docs page" rewritten: three variables, justification restated as single discovery surface + primary home for {{\$conversation.id}} - §9 Proposed solution — Docs bullet rewritten to match - §10 Decision log — adds D8 (LOCKED): intentional omission of \$env.* with full git-archaeology rationale - §16 SCOPE — agents-docs entry names exact file path and meta.json nav step - meta/_changelog.md — appends post-approval pivot entry explaining the trigger (PR inkeep#141 review), evidence, decision, and surgical edits D3 remains LOCKED on its own merits — dollar-prefix still prevents syntactic collision with user-defined contextVariable names regardless of whether \$env.* is a documented convention. Co-Authored-By: Claude Opus 4.6 <[email protected]> --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> GitOrigin-RevId: 52d0831b4f6e0c7790597c979f1711a5b1a4cd9c * Version Packages (agents) (inkeep#138) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> GitOrigin-RevId: 967594b0c065f561e566bc905512e45385cf7080 * fix: approve sync PRs from source side, drop target-repo auto-approve (inkeep#171) * fix: approve sync PRs from source side, drop target-repo auto-approve Moves copybara sync PR approval into public-mirror-sync.yml on agents-private using the INTERNAL_CI_APP token. Deletes the target-repo auto-approve-sync.yml workflow it replaces. Background PR inkeep#150 introduced auto-approve-sync.yml on the public tree to satisfy the required-approval rule that blocks copybara sync PRs on inkeep/agents. That design has a bootstrap problem: pull_request_target evaluates workflows from the target branch, so the very PR that first mirrors auto-approve-sync.yml to inkeep/agents/main cannot auto-approve itself. inkeep/agents PR inkeep#3157 is stuck in exactly that state, and because copybara/sync is blocked by that stuck PR, subsequent mirror runs on agents-private are failing at "Run agents mirror". Fix Approve from the source side instead. After public-mirror-sync.yml creates or updates the sync PR and enables auto-merge, a new step mints a second app token from INTERNAL_CI_APP (a different identity from inkeep-oss-sync, since GitHub blocks self-approval) and posts an approving review. The approval is scoped to the current head SHA and re-runs on every sync, which satisfies require_last_push_approval=true. - .github/workflows/public-mirror-sync.yml Adds Generate approver token step and Approve sync PR step. Skips re-approval when an approval already stands on the current head SHA. - public/agents/.github/workflows/auto-approve-sync.yml Deleted; replaced by the source-side step above. Prereq to verify once: INTERNAL_CI_APP is installed on inkeep/agents with Pull requests: Read and write. The app already powers Version Packages and release-handler paths that write to inkeep/agents, so the install is almost certainly already in place. * docs: update CI.md and runbook to reflect source-side sync approval Review feedback on inkeep#171: CI.md and CI_RUNBOOK.md still pointed at the deleted public/agents/.github/workflows/auto-approve-sync.yml. - CI.md: drop the row for auto-approve-sync.yml in the public workflows table; in the public/private parity table, replace the old row with one explaining approval now comes from public-mirror-sync.yml on the source side; update the OSS_SYNC_APP secrets row to drop "auto-approve sync" and add Copybara sync PR approval to INTERNAL_CI_APP usage. - CI_RUNBOOK.md: rewrite the "Sync PR on public repo is stuck open" entry to point operators at the Approve sync PR step and INTERNAL_CI_APP app installation instead of the deleted workflow. GitOrigin-RevId: 51edfa6a3a1e0863ba99d8498efad88c5dec1f06 --------- Co-authored-by: Varun Varahabhotla <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: omar-inkeep <[email protected]> Co-authored-by: tim-inkeep <[email protected]> Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> Co-authored-by: inkeep-internal-ci[bot] <259778081+inkeep-internal-ci[bot]@users.noreply.github.com>

changes

3ac891d

vercel Bot had a problem deploying to Preview – agents-run-api September 12, 2025 22:13 Failure

vercel Bot had a problem deploying to Preview – agents-manage-ui September 12, 2025 22:13 Failure

vercel Bot deployed to Preview – agents-manage-api September 12, 2025 22:13 View deployment

vercel Bot had a problem deploying to Preview – agents-run-api September 12, 2025 22:14 Failure

vercel Bot had a problem deploying to Preview – agents-manage-ui September 12, 2025 22:14 Failure

vercel Bot deployed to Preview – agents-manage-api September 12, 2025 22:14 View deployment

vercel Bot deployed to Preview – agents-docs September 12, 2025 22:15 View deployment

fix

17cb307

shagun-singh-inkeep force-pushed the instrumentfix branch from bab6262 to 17cb307 Compare September 12, 2025 22:17

vercel Bot temporarily deployed to Preview – agents-docs September 12, 2025 22:17 Inactive

robert-inkeep approved these changes Sep 12, 2025

View reviewed changes

vercel Bot had a problem deploying to Preview – agents-run-api September 12, 2025 22:18 Failure

vercel Bot had a problem deploying to Preview – agents-manage-ui September 12, 2025 22:18 Failure

vercel Bot deployed to Preview – agents-manage-api September 12, 2025 22:18 View deployment

Merge branch 'main' into instrumentfix

d71289b

vercel Bot had a problem deploying to Preview – agents-run-api September 12, 2025 22:19 Failure

vercel Bot had a problem deploying to Preview – agents-manage-ui September 12, 2025 22:19 Failure

vercel Bot deployed to Preview – agents-manage-api September 12, 2025 22:19 View deployment

vercel Bot deployed to Preview – agents-docs September 12, 2025 22:20 View deployment

robert-inkeep merged commit 85df6d7 into main Sep 12, 2025
4 of 7 checks passed

robert-inkeep deleted the instrumentfix branch September 12, 2025 22:21

claude Bot mentioned this pull request Sep 12, 2025

upgrade package.json #152

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

changes to instrumentation#150

changes to instrumentation#150
robert-inkeep merged 3 commits intomainfrom
instrumentfix

shagun-singh-inkeep commented Sep 12, 2025

Uh oh!

vercel Bot commented Sep 12, 2025 •

edited

Loading

Uh oh!

claude Bot commented Sep 12, 2025 •

edited

Loading

Uh oh!

claude Bot commented Sep 12, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shagun-singh-inkeep commented Sep 12, 2025

Uh oh!

vercel Bot commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Instrumentation Changes

Summary

Key Changes Analyzed

✅ Positive Changes

⚠️ Critical Architecture Concerns

1. Configuration Removal Without Context (HIGH PRIORITY)

2. Export Scope Reduction (MEDIUM PRIORITY)

3. Missing Environment Variable Schema (LOW-MEDIUM PRIORITY)

🔧 Implementation Quality Issues

4. BatchSpanProcessor Import Change

5. FanOutSpanProcessor Type Safety

🏗️ System-Level Considerations

6. Observability Strategy Alignment

7. Testing Implications

Recommendations

Immediate Actions (Before Merge)

Medium-Term Improvements

Customer Impact Assessment

Final Assessment

Uh oh!

claude Bot commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Instrumentation Changes

Summary

Key Changes Analyzed

✅ Positive Changes

⚠️ Critical Architecture Concerns

1. Configuration Removal Without Context (HIGH PRIORITY)

2. Export Scope Reduction (MEDIUM PRIORITY)

3. Missing Environment Variable Schema (LOW-MEDIUM PRIORITY)

🔧 Implementation Quality Issues

4. BatchSpanProcessor Import Change

5. FanOutSpanProcessor Type Safety

🏗️ System-Level Considerations

6. Observability Strategy Alignment

7. Testing Implications

Recommendations

Immediate Actions (Before Merge)

Medium-Term Improvements

Customer Impact Assessment

Final Assessment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel Bot commented Sep 12, 2025 •

edited

Loading

claude Bot commented Sep 12, 2025 •

edited

Loading

claude Bot commented Sep 12, 2025 •

edited

Loading