Skip to content

Prd 6370 copy#3171

Closed
robert-inkeep wants to merge 14 commits intomainfrom
prd-6370-copy
Closed

Prd 6370 copy#3171
robert-inkeep wants to merge 14 commits intomainfrom
prd-6370-copy

Conversation

@robert-inkeep
Copy link
Copy Markdown
Collaborator

No description provided.

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Apr 22, 2026

⚠️ No Changeset found

Latest commit: b33cc46

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 22, 2026

The matching internal PR #203 has been merged in agents-private, and the latest non-dry-run mirror sync completed successfully.

Closing this public PR so the public repo stays aligned with the monorepo merge surface.

@pullfrog
Copy link
Copy Markdown
Contributor

pullfrog Bot commented Apr 22, 2026

PR #3171 — Prd 6370 copy

TL;DR

Introduces a centralized authentication and authorization layer for Next.js API routes in the manage UI. Previously, these routes blindly forwarded raw cookie/authorization headers to agents-api without verifying the caller's identity. Now, every route validates the session (or bearer token) before proceeding, and project-scoped routes additionally check fine-grained permissions.

Key Changes

  • New file: api-route-auth.ts — centralized auth utilities for Next.js API route handlers (session validation, bearer token support, project permission checks, cookie filtering)
  • New file: api-route-auth.test.ts — 199 lines of unit tests covering cookie filtering, session auth, bearer auth, fallback behavior, and project permission rejection
  • 5 route handlers updated to use the new auth guards instead of manually forwarding raw cookie/authorization headers
  • Cookie header sanitization — only better-auth.* cookies are forwarded to agents-api, stripping all other cookies

New centralized auth module (api-route-auth.ts)

A 256-line module providing three main exported functions:

Function Purpose
requireApiRouteSession Validates the caller has a valid Better Auth session via cookie. Filters cookies, calls agents-api /api/auth/get-session, returns a discriminated union (ok: true with session data or ok: false with a NextResponse).
requireApiRouteSessionOrBearer Accepts either a Bearer token (preferred, skips session fetch) or falls back to cookie-based session auth. Returns a headers record ready to spread into downstream fetch calls.
requireApiRouteProjectPermission Validates session first, then checks project-level permissions (canView/canUse/canEdit) by calling the agents-api permissions endpoint. Returns 403 when the user lacks the requested permission level.

Helper: filterAuthCookieHeader strips non-auth cookies from the raw Cookie header using the shared isAuthCookie predicate from @inkeep/agents-core/auth/cookie-names.

Before: Routes had no server-side auth validation — they forwarded whatever headers the browser sent.
After: Every route validates authentication before doing any work. Unauthenticated requests get a 401; unauthorized requests get a 403.


Artifact & data component generate-render routes now require project edit permission

Files: artifact-components/[artifactComponentId]/generate-render/route.ts, data-components/[dataComponentId]/generate-render/route.ts

Before: No auth check — any request with a tenantId/projectId in the body could trigger generation.
After: Calls requireApiRouteProjectPermission(request, { tenantId, projectId, level: 'edit' }) and short-circuits with the error response if the check fails.


Trace routes now require session or bearer auth with cleaner header forwarding

Files: traces/route.ts, traces/conversations/[conversationId]/route.ts, traces/spans/[spanId]/route.ts

Before: Each route manually extracted req.headers.get('cookie') and req.headers.get('authorization'), then conditionally set them on downstream fetch headers — no actual auth validation occurred.

After:

  1. Each route calls requireApiRouteSessionOrBearer(req) at the top and returns early on failure.
  2. The validated authResult.headers record (containing either Cookie or Authorization) is spread into downstream fetch headers, replacing the ad-hoc cookie/auth header forwarding.
  3. The signozQuery helper in the conversation route is simplified from (payload, tenantId, cookieHeader, authHeader) to (payload, tenantId, authHeaders).

Comprehensive test coverage

File: api-route-auth.test.ts (199 lines)

Tests cover:

  • Cookie filtering: non-auth cookies stripped; auth cookies (better-auth.session_token, __Secure-better-auth.session_data) preserved
  • Missing auth: returns 401 without calling agents-api
  • Valid session: calls /api/auth/get-session with filtered cookies, returns session data
  • Bearer-only: returns immediately with authType: 'bearer' without fetching session
  • Bearer + cookie: bearer takes priority; no session fetch
  • Session fallback: when no bearer token, falls back to cookie-based session auth
  • Project permission rejection: user with canEdit: false gets 403 when level: 'edit' is required

Pullfrog  | View workflow run | Triggered by Pullfrog | Using Claude Opus𝕏

Copy link
Copy Markdown
Contributor

@pullfrog pullfrog Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: has a runtime-breaking bug. The auth hardening itself is well-structured — clean discriminated union types, cookie filtering via the shared isAuthCookie utility, proper session validation, and good test coverage.

Critical: conversations/[conversationId]/route.ts line 716 (outside the diff) still calls signozQuery(usageEventsPayload, tenantId, cookieHeader, authHeader) with 4 arguments using the now-removed cookieHeader and authHeader variables. signozQuery was updated to accept 3 parameters (payload, tenantId, authHeaders: Record<string, string>), and the old variable declarations were removed on lines 677–678. This will throw a ReferenceError at runtime whenever the usage-events query path executes. Fix: signozQuery(usageEventsPayload, tenantId, authResult.headers) (same pattern as the corrected call on line 701).

Pullfrog  | Fix all ➔Fix 👍s ➔View workflow run | Using Claude Opus𝕏

payloads.map(async (p, i) => {
const batchStart = Date.now();
const result = await signozQuery(p, tenantId, cookieHeader, authHeader);
const result = await signozQuery(p, tenantId, authResult.headers);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call was correctly updated, but there is a second signozQuery call ~15 lines below (line 716, outside this diff hunk) that was not updated:

const result = await signozQuery(usageEventsPayload, tenantId, cookieHeader, authHeader);

cookieHeader and authHeader no longer exist in scope (removed in this PR on old lines 677–678). This will throw a ReferenceError at runtime. It should be:

const result = await signozQuery(usageEventsPayload, tenantId, authResult.headers);

Comment on lines +156 to +164
if (authorizationHeader) {
return {
ok: true,
authType: 'bearer',
authorizationHeader,
headers: {
Authorization: authorizationHeader,
},
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Informational (not blocking): The bearer path trusts the token without server-side validation here — it passes it through to agents-api. This is a valid pass-through pattern as long as every downstream agents-api endpoint (e.g. /manage/tenants/{id}/signoz/query, /manage/tenants/{id}/signoz/health) enforces bearer token validation and doesn't silently fall through to an unauthenticated code path. Worth a quick sanity check if not already confirmed.

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(1) Total Issues | Risk: High

🔴❗ Critical (1) ❗🔴

Inline Comments:

  • 🔴 Critical: conversations/[conversationId]/route.ts:716 Undefined variables cookieHeader and authHeader cause runtime error

🟡 Minor (1) 🟡

Inline Comments:

  • 🟡 Minor: api-route-auth.ts:147 Silent error swallowing hinders debugging

💭 Consider (1) 💭

Inline Comments:

  • 💭 Consider: api-route-auth.ts:164 Bearer tokens pass through without validation — confirm this is intentional

📋 Overall Assessment

What this PR does well:

  • ✅ Clean, well-structured auth helper module with good separation of concerns
  • ✅ Proper discriminated union types (ok: true/false) for result handling
  • ✅ Good test coverage for the core authentication flows
  • ✅ Cookie filtering using shared isAuthCookie from @inkeep/agents-core ensures consistency
  • ✅ Consistent integration pattern across all 5 modified route handlers

Areas for attention:

  1. Critical bug must be fixed — Line 716 in conversations route references undefined variables
  2. Missing changeset — This PR modifies runtime behavior for agents-manage-ui and should have a changeset (per AGENTS.md guidelines)
  3. Error logging — Consider adding structured logging for auth failures to aid debugging

📦 Missing Changeset

Per the AGENTS.md guidelines, this PR should include a changeset since it modifies runtime behavior:

pnpm bump patch --pkg agents-manage-ui "Add authentication helpers for API routes"

🚫 REQUEST CHANGES

Summary: The auth module design is solid and the integration is consistent, but there's a critical runtime bug that must be fixed before merging. Line 716 in the conversations route still references the old cookieHeader and authHeader variables that were removed during the refactoring — this will cause a ReferenceError when the usage events query runs. The fix is straightforward (use authResult.headers like the other call sites).


Discarded (0)

No findings were discarded.

Reviewers (1)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
orchestrator (manual) 3 0 0 0 3 0 0
Total 3 0 0 0 3 0 0

Note: Subagent dispatch failed; conducted manual review.

return {
ok: false,
response: internalErrorResponse(),
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor: Silent error swallowing hinders debugging

Issue: Errors are caught and discarded without logging. When auth validation fails for non-401 reasons (network errors, JSON parse errors, unexpected responses), the error details are lost.

Why: Without logging, debugging production auth issues requires reproducing the exact scenario. Adding structured logging here would significantly help diagnose intermittent auth failures.

Fix: Consider adding error logging before returning the error response:

} catch (error) {
  const logger = getLogger('api-route-auth');
  logger.error({ error: error instanceof Error ? error.message : String(error) }, 'Session validation failed');
  return {
    ok: false,
    response: internalErrorResponse(),
  };
}

Refs:

headers: {
Authorization: authorizationHeader,
},
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 Consider: Bearer tokens pass through without validation

Issue: Bearer tokens are accepted and forwarded without any validation. This differs from session auth which validates against agents-api.

Why: This is likely intentional (agents-api validates the token downstream), but worth confirming:

  1. Is there any risk from accepting malformed bearer tokens?
  2. Should there be basic format validation (e.g., minimum length)?
  3. Are there specific token patterns expected?

If this is intentional and agents-api validates all bearer tokens, this is fine as-is. Just flagging for explicit confirmation of the design decision.

@github-actions github-actions Bot deleted a comment from claude Bot Apr 22, 2026
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(1) Total Issues | Risk: High

🔴❗ Critical (1) ❗🔴

Inline Comments:

  • 🔴 Critical: route.ts:716 Undefined variables cookieHeader and authHeader still cause runtime ReferenceError

🕐 Pending Recommendations (2)


📦 Missing Changeset

This PR modifies runtime behavior for agents-manage-ui and should have a changeset per AGENTS.md guidelines:

pnpm bump patch --pkg agents-manage-ui "Add authentication helpers for API routes"

🚫 REQUEST CHANGES

Summary: The critical bug at line 716 flagged in the prior review is still not fixed. The code still references cookieHeader and authHeader which were removed — this will throw a ReferenceError at runtime whenever the usage-events query executes. The fix is a simple one-liner (use authResult.headers like line 701).


Discarded (0)

No findings were discarded.

Reviewers (1)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
orchestrator (delta) 1 0 0 0 1 2 0
Total 1 0 0 0 1 2 0

Note: Delta review scoped to 3 commits since last automated review (commit 2599e18c4582).

@github-actions github-actions Bot deleted a comment from claude Bot Apr 22, 2026
inkeep-oss-sync Bot pushed a commit to inkeep/agents-optional-local-dev that referenced this pull request Apr 22, 2026
…e + preview paths (#198)

* fix(ci): close remaining silent strandings in release cascade + bridge + preview paths

Bundle of 10 code + 1 doc changes, each closing a distinct silent-failure
mode where CI would pass green (or sit red invisibly) while production
work stranded. Complement to #194's release-cascade hardening — each fix
here addresses a gap that pass missed.

Release cascade (5)
- release-handler.yml: pin GitHub Release to client_payload.commit_sha,
  not origin/main. Auto-format landing between sync and handler used to
  create a SHA drift where Vercel deployed a newer commit than npm got.
- release-handler.yml: 3-attempt bash retry around gh release view/edit/
  create. Mirrors the dispatch-side retry from #174 — one transient 5xx
  used to leave npm published with no Release (so no Vercel deploy) and
  no alert.
- release-handler.yml: notify-on-failure job. If the success job itself
  errors (app-token, checkout, or handler logic), the dispatch was
  already delivered and public-side retries don't re-fire. Without this
  notifier, npm publishes but the cascade stops silently.
- public-mirror-sync.yml: 3-attempt retry on sync PR approval. One
  dropped GH API call used to leave the sync PR open with no approval;
  now exhaustion exits non-zero so the workflow turns red and the
  janitor's sync-PR sweep (below) catches long-open PRs as a safety net.
- scripts/check-monorepo-traps.mjs: flip check:override-masks-bump from
  warn to hard-fail. Current tree is clean of masks, so flipping is safe
  and prevents future PR #170-class silent divergence.

Preview envs (2)
- deploy-vercel-preview.sh: validate captured deployment URL matches
  *.vercel.app pattern before alias-set. The log-grep fallback could
  silently capture a docs/inspection URL and alias api.preview.inkeep.com
  to the wrong place.
- public-agents-preview-environments.yml: add teardown-failure-notify
  job. Railway and Vercel teardowns run in parallel by design; this
  closes the observability gap when either silently fails — Railway
  continues billing, Vercel env vars pollute, and the next 6h janitor
  sweep re-attempts cleanup.

Dependency sync + bridge (2)
- dependabot-sync-root-lockfile.yml: on push failure, full-reset onto
  origin/$HEAD_REF and re-regenerate both lockfiles (root + public/agents)
  from the new baseline. Previous simple rebase would replay our stale
  lockfile on top of a different package.json after a Dependabot
  force-push, yielding inconsistent committed state that either broke
  --frozen-lockfile or silently shipped drift into main.
- bridge-public-pr-to-monorepo.mjs (both copies): truncate bridged PR
  body to fit GitHub's 65,536-char limit, with a link to the original PR.
  Was failing 422 "body is too long" on Dependabot mega-bumps, stranding
  those PRs outside the agents-private review surface.

Janitor (1)
- public-agents-preview-janitor.yml: new sweep-stale-sync-prs job. Every
  6h, identify open copybara/sync PRs on inkeep/agents >4h old, open
  (idempotent) tracking issue on agents-private, re-dispatch
  public-mirror-sync.yml. Does NOT auto-close the PR — destructive
  actions stay human-driven.

Docs (1)
- CI_RUNBOOK.md: partial-npm-publish recovery path. Documents what to
  do when pnpm changeset publish fails on package N of 10 (some live,
  some not): confirm via npm view, re-run the failed job (publish is
  idempotent), verify all 10 present before unblocking cascade.

Each change verified in isolation with pnpm check:structural; YAML
validated with safe_load; bridge truncation logic tested at boundary.

* fix(ci): add checkout step and remove heredoc indentation in release-handler notify

Addresses review feedback on PR #198:
- notify-success-failure now does a Checkout before gh calls, matching
  the failure job pattern (without it, gh CLI lacks git context and
  issue-create would silently fail with 'fatal: not a git repository').
- Body is now built with echo into a temp file instead of a heredoc.
  The original 10-space indent inside run: | would have been preserved
  literally by bash, making the tracking issue body unreadable.

* fix(ci): address review feedback on release-handler retry + docs drift

Review feedback from #198 (Claude + Pullfrog):

release-handler.yml:
- Capture stderr in gh_with_retry so on-call sees the actual API error
  (rate-limit, 5xx, auth) instead of a generic 'command failed' line.
  Was making incident triage 15-30m slower. (Claude Major)
- Stop wrapping 'gh release view' in gh_with_retry. It returns non-zero
  for the legitimate 'release doesn't exist' case (the normal new-release
  happy path); retrying burned 15s of backoff before falling through to
  create. Only mutating calls (edit/create) need retry. (Pullfrog + Claude)
- notify-success-failure now fails loudly (exit 1) if gh issue create
  fails, matching the failure job's pattern. Silent swallow would mean
  the success handler failed AND the notifier failed with no visible
  signal beyond a red check — exactly the class this PR closes elsewhere.
  (Claude Major)

Docs drift (Pullfrog):
- AGENTS.md: remove '(soft today)' / 'non-blocking today' annotations
  from override-masks-bump in both the cheatsheet and the trap-list.
- AGENTS.md trap #4: flip 'Currently a warning, not a hard fail' to
  'Hard-fail (flipped...)' to match the code.
- CI_RUNBOOK.md: rename the override-masks-bump section header from
  'warns (non-blocking today)' to 'fails — root override masks a
  workspace bump'.

Preview janitor (Claude Consider):
- Document that sweep-stale-sync-prs runs unconditionally regardless of
  PREVIEW_ENVIRONMENTS_ENABLED. Pr body said so; workflow didn't.

Bridge script (Claude Consider):
- console.log when PR body is truncated, with original + new sizes.
  Helps CI log debugging when a bridged PR looks shorter than expected.
  Applied to both copies (public/agents + public/agents-optional-local-dev).

Intentionally not addressed:
- dependabot-sync $MSG staleness after reset: Pullfrog flagged as
  intentional (message describes the operation, not the specific
  baseline), Claude flagged as cosmetic minor. Siding with Pullfrog.
- Preview teardown notify swallowed error: lower severity given the
  janitor re-attempts cleanup every 6h; accepting current pattern.

* fix(ci): pre-fetch PR base into agents-private before git apply --3way

Closes the bridge-failure class seen on inkeep/agents#3171 (and
eventually on any bridged PR with a conflicting hunk).

Root cause:
`git apply --index --3way` resolves the patch's `index <old>..<new>`
blob SHAs against the local repo's object store. The SHAs come from
inkeep/agents' object graph; agents-private has never fetched from
inkeep/agents, so those SHAs aren't resolvable. Clean patches work
(no --3way fallback needed); any conflict triggers --3way, which
fails with 'repository lacks the necessary blob to perform 3-way
merge' — bridge stops with no useful diagnostic, PR strands outside
the canonical review surface.

Fix:
Before calling git apply, shallow-fetch the PR's base commit from
the public repo into agents-private's object store. GitHub permits
fetching by SHA when the SHA is reachable from a ref (PR base is on
main, always reachable), and shallow=1 keeps it cheap. After fetch,
--3way can find the `<old>` blob and resolve conflicts into merge
markers instead of a hard error.

Why not drop --3way:
Dropping --3way makes every conflicting PR fail hard with no
recovery path. Pre-fetching preserves the --3way fallback so
conflicting hunks produce visible conflict markers in the bridged
PR, which is reviewable rather than opaque.

Failure handling:
If the fetch itself fails (rate-limit, network flake), we log and
proceed. The subsequent git apply will hit the same blob error as
before the fix — no regression, just no improvement for that run.

Applied to both script copies: public/agents and
public/agents-optional-local-dev. Bundled into PR #198 since it's
the same release/sync hardening scope.

* Revert "fix(ci): pre-fetch PR base into agents-private before git apply --3way"

This reverts commit e7e0640e2345bbd7cdea1185eb3abc666f2c60fa.

* fix(ci): switch bridge patch fetch from .patch to .diff media type

Root cause of bridge failures on multi-commit PRs (eg inkeep/agents#3171,
9 commits):

`gh api .../pulls/N Accept: application/vnd.github.patch` returns a
mailbox-format patch series — one patch per PR commit, each with
`index <old>..<new>` lines referencing INTERMEDIATE blob SHAs created
during the PR's history on inkeep/agents. Those intermediates exist
only in inkeep/agents' object store; agents-private has never seen
them. The first patch applies against agents-private's file (blob
content matches the PR base even if SHAs differ), git's index advances
to the intermediate state (only referenced by SHA, never fetched as
an object). Any subsequent conflicting hunk triggers `git apply
--3way` fallback, which needs the intermediate blob → 'repository
lacks the necessary blob' → stranded PR.

Single-commit PRs don't hit this (no intermediates). That's why the
bridge used to work for most Dependabot bumps but started failing on
longer branches like #3171.

Fix: switch `Accept: application/vnd.github.patch` →
`Accept: application/vnd.github.diff` in both bridge-public-pr-to-
monorepo.mjs copies. .diff returns a single consolidated base-vs-head
unified diff with only the PR's base blob SHAs referenced. Those are
content-identical to agents-private's copies (Copybara mirrors blobs
1:1 without content rewrites for .ts files), so --3way can resolve
them locally from agents-private's own object store.

Also updated githubRequest's text-response check to include .diff
alongside .patch (both are text, not JSON).

Previous pre-fetch commit (`e7e0640e2`) was reverted in a prior commit
— it was addressing the wrong layer (only PR-base blob, not
intermediates). This commit supersedes that approach entirely.

* chore(ci): biome format fix on bridge script console.log

GitOrigin-RevId: a26d5dac74ed9938fc512703af21a25d82616504
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(1) Total Issues | Risk: High

🔴❗ Critical (1) ❗🔴

Inline Comments:

  • 🔴 Critical: route.ts:716 Undefined variables cookieHeader and authHeader cause runtime ReferenceError

🕐 Pending Recommendations (2)

  • 🟡 api-route-auth.ts:147 — Silent error swallowing in catch block hinders debugging; consider adding structured logging
  • 💭 api-route-auth.ts:164 — Bearer tokens pass through without validation; confirm this is intentional (agents-api validates downstream)

📦 Missing Changeset

This PR modifies runtime behavior for agents-manage-ui and should have a changeset per AGENTS.md guidelines:

pnpm bump patch --pkg agents-manage-ui "Add authentication helpers for API routes"

📋 Delta Review Note

The 3 commits since the last review (fe9c4d9d4, 37f8a3e4a, b33cc4604) only touched .github/scripts/bridge-public-pr-to-monorepo.mjs — a CI bridge script unrelated to the auth code under review. The auth module itself has not changed since the prior reviews.


🚫 REQUEST CHANGES

Summary: The critical bug at line 716 flagged in prior reviews remains unfixed. The code still references cookieHeader and authHeader which were removed — this will throw a ReferenceError at runtime whenever the usage-events query executes. The fix is a simple one-liner: use authResult.headers like the corrected call on line 701. A 1-click suggestion is provided in the inline comment.


Discarded (0)

No findings were discarded.

Reviewers (1)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
orchestrator (delta) 1 0 0 0 1 2 0
Total 1 0 0 0 1 2 0

Note: Delta review scoped to 3 commits since last automated review. Sub-reviewers not dispatched as delta only touched CI scripts unrelated to auth code.

@github-actions github-actions Bot deleted a comment from claude Bot Apr 22, 2026
@itoqa
Copy link
Copy Markdown

itoqa Bot commented Apr 22, 2026

Ito Test Report ❌

16 test cases ran. 1 failed, 1 additional finding, 14 passed.

Across 16 QA scenarios, 14 passed and 2 failed, indicating strong stability in render regeneration and mockData preservation, permission and session enforcement, bearer/session auth routing behavior, concurrent trace request isolation, retry envelope coherence, and usage-events fault-tolerance.
The most important issues are two medium-severity defects: the unauthenticated conversation details page incorrectly renders a project-not-found (404-style) state instead of an auth-required experience even though the API correctly returns 401, and trace endpoints handle wrong-tenant requests inconsistently while the conversation trace handler contains a confirmed stale signozQuery call-site bug that can drop usage-events data.

❌ Failed (1)
Category Summary Screenshot
Traces 🟠 Confirmed real bug: tenant-mismatch responses are mapped inconsistently across trace endpoints (health route returns 200 connection_failed while conversation/span routes return error HTTP statuses), violating consistent tenant-safe rejection behavior; conversation route also contains an undeclared-variables call-site defect in usage-events query path. TRACES-3
🟠 Inconsistent tenant rejection in trace endpoints
  • What failed: Wrong-tenant requests are handled inconsistently across traces endpoints, and the conversation route has a stale call site that invokes signozQuery with undeclared variables, causing the usage-events branch to fail and return empty fallback data.
  • Impact: Users investigating traces can receive inconsistent error semantics across endpoints and miss usage-events data in conversation details. This weakens debuggability for tenant-scoped trace analysis even when core timeline data still loads.
  • Steps to reproduce:
    1. Call GET /api/traces?tenantId=wrong-tenant with a valid bearer bypass token.
    2. Call GET /api/traces/conversations/conversation-1?tenantId=wrong-tenant with the same token.
    3. Call GET /api/traces/spans/span-1?tenantId=wrong-tenant&conversationId=conversation-1 and compare status semantics across the three endpoints.
  • Stub / mock context: A deterministic local auth and traces-fixture bypass configuration was enabled to keep tenant-routing checks repeatable; this setup does not explain the stale variable call-site defect found in production route code.
  • Code analysis: I reviewed the traces health, conversation, and span handlers plus shared auth helper logic. The conversation handler now defines signozQuery with authHeaders but still calls it once with removed variables (cookieHeader, authHeader), which is a concrete production-code defect; error handling across trace endpoints also maps upstream tenant/auth failures to different HTTP semantics.
  • Why this is likely a bug: The conversation route contains a mismatched function call that references undefined identifiers after a signature refactor, which is a direct code defect independent of test setup.

Relevant code:

agents-manage-ui/src/app/api/traces/conversations/[conversationId]/route.ts (lines 102-106)

async function signozQuery(
  payload: any,
  tenantId: string,
  authHeaders: Record<string, string>
): Promise<SigNozResp> {

agents-manage-ui/src/app/api/traces/conversations/[conversationId]/route.ts (lines 714-722)

(async () => {
      const batchStart = Date.now();
      try {
        const result = await signozQuery(usageEventsPayload, tenantId, cookieHeader, authHeader);
        logger.info(
          { batch: 'usage-events', ms: Date.now() - batchStart },
          'signoz batch complete'
        );

agents-manage-ui/src/app/api/traces/route.ts (lines 233-253)

if (!response.ok) {
    logger.error(
      {
        error: data,
        status: response.status,
      },
      'SigNoz health check failed'
    );

    let errorMessage = 'Failed to check SigNoz configuration';
    if (response.status === 401 || response.status === 403) {
      errorMessage = 'Authentication required';
    } else if (data?.error) {
      errorMessage = data.error;
    }

    return NextResponse.json({
      status: 'connection_failed',
✅ Passed (14)
Category Summary Screenshot
Conversation Usage-events fault-mode handling preserved conversation timeline integrity and returned stable responses. N/A
Conversation Repeated stale/fault usage-events calls stayed non-crashing and kept core conversation data intact. N/A
Conversation Alternating usage-events failure conditions did not break Promise.all assembly ordering or completeness. N/A
Render Regeneration completed and preserved existing mockData during instruction-based updates after deterministic local interception removed external dependency noise. N/A
Render Permission-denial path returned 403 Forbidden and blocked generation when edit access was not granted. N/A
Render Cross-tenant and mismatched project scope request was denied with no generation stream output. N/A
Render Permission-preflight failure returned a controlled 500 validation error and did not emit partial stream chunks. N/A
Session Valid Better Auth session passed route authentication and reached downstream project validation, which then returned project-not-found for missing fixture data. N/A
Session Requests using only non-auth cookies and unauthenticated conversation-trace access were correctly rejected with 401 responses. N/A
Traces Request with both bearer token and Better Auth cookie returned 200 without session-auth rejection; response reached upstream path (connection_failed from SigNoz config). TRACES-1
Traces Bearer-only request to /api/traces returned 200 (non-401), demonstrating bypass bearer auth works without Better Auth session cookies. TRACES-2
Traces Concurrent batch and non-batch trace POST requests stayed isolated during auth-header rotation. TRACES-4
Traces Malformed bearer headers correctly fell back to session auth when a valid session path was available. TRACES-5
Traces Repeated batch upstream failures returned coherent retry/error envelopes without mixed payload segments. TRACES-6
ℹ️ Additional Findings (1)

These findings are unrelated to the current changes but were observed during testing.

Category Summary Screenshot
Conversation 🟠 Unauthenticated conversation details page shows a project-not-found state while the underlying traces API correctly returns authentication required. CONVERSATION-2
🟠 Unauthenticated conversation page shows 404
  • What failed: The UI resolves to a project 404 state instead of an authentication-required experience, while the API route correctly enforces auth and returns 401.
  • Impact: Unauthenticated users are shown a misleading "project not found" failure instead of a clear login-required state. This obscures root cause and can send users down incorrect debugging or support paths.
  • Steps to reproduce:
    1. Open /default/projects/project-1/traces/conversations/conversation-1 in a fresh browser session without logging in.
    2. Observe the page-level error state rendered for the project route.
    3. Call /api/traces/conversations/conversation-1?tenantId=default without auth and compare the HTTP 401 response.
  • Stub / mock context: The run used local auth/session helpers and seeded conversation fixtures; this case explicitly cleared login state and compared page behavior against direct unauthenticated API responses without adding endpoint response stubs.
  • Code analysis: The traces conversation API route enforces authentication up front, but the project layout always prefetches project permissions and converts fetch failures into a project-scoped FullPageError. In the same flow, project permission middleware maps access failures to not_found, so unauthenticated page entry can surface as a 404-style project error rather than an auth-first UX.
  • Why this is likely a bug: Production code paths map the unauthenticated page load into a project-not-found UI state while the paired API path explicitly reports authentication required, creating a clear and reproducible auth-state mismatch.

Relevant code:

agents-manage-ui/src/app/api/traces/conversations/[conversationId]/route.ts (lines 652-659)

export async function GET(
  req: NextRequest,
  context: RouteContext<'/api/traces/conversations/[conversationId]'>
) {
  const authResult = await requireApiRouteSessionOrBearer(req);
  if (!authResult.ok) {
    return authResult.response;
  }

agents-manage-ui/src/app/[tenantId]/projects/[projectId]/layout.tsx (lines 15-31)

try {
  const queryClient = new QueryClient();
  const permissions = await fetchProjectPermissions(tenantId, projectId);
  queryClient.setQueryData(projectQueryKeys.permissions(tenantId, projectId), permissions);

  return <HydrationBoundary state={dehydrate(queryClient)}>{children}</HydrationBoundary>;
} catch (error) {
  return (
    <FullPageError
      errorCode={getErrorCode(error)}
      context="project"
    />
  );
}

agents-api/src/middleware/projectAccess.ts (lines 92-98)

if (!hasAccess) {
  throw createApiError({
    code: 'not_found',
    message: 'Project not found',
    instance: c.req.path,
  });
}

Commit: b33cc46

View Full Run


Tell us how we did: Give Ito Feedback

Zeeeepa pushed a commit to Zeeeepa/inkeep_agents that referenced this pull request Apr 23, 2026
…e + preview paths (inkeep#198) (inkeep#3172)

* fix(ci): close remaining silent strandings in release cascade + bridge + preview paths

Bundle of 10 code + 1 doc changes, each closing a distinct silent-failure
mode where CI would pass green (or sit red invisibly) while production
work stranded. Complement to inkeep#194's release-cascade hardening — each fix
here addresses a gap that pass missed.

Release cascade (5)
- release-handler.yml: pin GitHub Release to client_payload.commit_sha,
  not origin/main. Auto-format landing between sync and handler used to
  create a SHA drift where Vercel deployed a newer commit than npm got.
- release-handler.yml: 3-attempt bash retry around gh release view/edit/
  create. Mirrors the dispatch-side retry from inkeep#174 — one transient 5xx
  used to leave npm published with no Release (so no Vercel deploy) and
  no alert.
- release-handler.yml: notify-on-failure job. If the success job itself
  errors (app-token, checkout, or handler logic), the dispatch was
  already delivered and public-side retries don't re-fire. Without this
  notifier, npm publishes but the cascade stops silently.
- public-mirror-sync.yml: 3-attempt retry on sync PR approval. One
  dropped GH API call used to leave the sync PR open with no approval;
  now exhaustion exits non-zero so the workflow turns red and the
  janitor's sync-PR sweep (below) catches long-open PRs as a safety net.
- scripts/check-monorepo-traps.mjs: flip check:override-masks-bump from
  warn to hard-fail. Current tree is clean of masks, so flipping is safe
  and prevents future PR inkeep#170-class silent divergence.

Preview envs (2)
- deploy-vercel-preview.sh: validate captured deployment URL matches
  *.vercel.app pattern before alias-set. The log-grep fallback could
  silently capture a docs/inspection URL and alias api.preview.inkeep.com
  to the wrong place.
- public-agents-preview-environments.yml: add teardown-failure-notify
  job. Railway and Vercel teardowns run in parallel by design; this
  closes the observability gap when either silently fails — Railway
  continues billing, Vercel env vars pollute, and the next 6h janitor
  sweep re-attempts cleanup.

Dependency sync + bridge (2)
- dependabot-sync-root-lockfile.yml: on push failure, full-reset onto
  origin/$HEAD_REF and re-regenerate both lockfiles (root + public/agents)
  from the new baseline. Previous simple rebase would replay our stale
  lockfile on top of a different package.json after a Dependabot
  force-push, yielding inconsistent committed state that either broke
  --frozen-lockfile or silently shipped drift into main.
- bridge-public-pr-to-monorepo.mjs (both copies): truncate bridged PR
  body to fit GitHub's 65,536-char limit, with a link to the original PR.
  Was failing 422 "body is too long" on Dependabot mega-bumps, stranding
  those PRs outside the agents-private review surface.

Janitor (1)
- public-agents-preview-janitor.yml: new sweep-stale-sync-prs job. Every
  6h, identify open copybara/sync PRs on inkeep/agents >4h old, open
  (idempotent) tracking issue on agents-private, re-dispatch
  public-mirror-sync.yml. Does NOT auto-close the PR — destructive
  actions stay human-driven.

Docs (1)
- CI_RUNBOOK.md: partial-npm-publish recovery path. Documents what to
  do when pnpm changeset publish fails on package N of 10 (some live,
  some not): confirm via npm view, re-run the failed job (publish is
  idempotent), verify all 10 present before unblocking cascade.

Each change verified in isolation with pnpm check:structural; YAML
validated with safe_load; bridge truncation logic tested at boundary.

* fix(ci): add checkout step and remove heredoc indentation in release-handler notify

Addresses review feedback on PR inkeep#198:
- notify-success-failure now does a Checkout before gh calls, matching
  the failure job pattern (without it, gh CLI lacks git context and
  issue-create would silently fail with 'fatal: not a git repository').
- Body is now built with echo into a temp file instead of a heredoc.
  The original 10-space indent inside run: | would have been preserved
  literally by bash, making the tracking issue body unreadable.

* fix(ci): address review feedback on release-handler retry + docs drift

Review feedback from inkeep#198 (Claude + Pullfrog):

release-handler.yml:
- Capture stderr in gh_with_retry so on-call sees the actual API error
  (rate-limit, 5xx, auth) instead of a generic 'command failed' line.
  Was making incident triage 15-30m slower. (Claude Major)
- Stop wrapping 'gh release view' in gh_with_retry. It returns non-zero
  for the legitimate 'release doesn't exist' case (the normal new-release
  happy path); retrying burned 15s of backoff before falling through to
  create. Only mutating calls (edit/create) need retry. (Pullfrog + Claude)
- notify-success-failure now fails loudly (exit 1) if gh issue create
  fails, matching the failure job's pattern. Silent swallow would mean
  the success handler failed AND the notifier failed with no visible
  signal beyond a red check — exactly the class this PR closes elsewhere.
  (Claude Major)

Docs drift (Pullfrog):
- AGENTS.md: remove '(soft today)' / 'non-blocking today' annotations
  from override-masks-bump in both the cheatsheet and the trap-list.
- AGENTS.md trap inkeep#4: flip 'Currently a warning, not a hard fail' to
  'Hard-fail (flipped...)' to match the code.
- CI_RUNBOOK.md: rename the override-masks-bump section header from
  'warns (non-blocking today)' to 'fails — root override masks a
  workspace bump'.

Preview janitor (Claude Consider):
- Document that sweep-stale-sync-prs runs unconditionally regardless of
  PREVIEW_ENVIRONMENTS_ENABLED. Pr body said so; workflow didn't.

Bridge script (Claude Consider):
- console.log when PR body is truncated, with original + new sizes.
  Helps CI log debugging when a bridged PR looks shorter than expected.
  Applied to both copies (public/agents + public/agents-optional-local-dev).

Intentionally not addressed:
- dependabot-sync $MSG staleness after reset: Pullfrog flagged as
  intentional (message describes the operation, not the specific
  baseline), Claude flagged as cosmetic minor. Siding with Pullfrog.
- Preview teardown notify swallowed error: lower severity given the
  janitor re-attempts cleanup every 6h; accepting current pattern.

* fix(ci): pre-fetch PR base into agents-private before git apply --3way

Closes the bridge-failure class seen on inkeep#3171 (and
eventually on any bridged PR with a conflicting hunk).

Root cause:
`git apply --index --3way` resolves the patch's `index <old>..<new>`
blob SHAs against the local repo's object store. The SHAs come from
inkeep/agents' object graph; agents-private has never fetched from
inkeep/agents, so those SHAs aren't resolvable. Clean patches work
(no --3way fallback needed); any conflict triggers --3way, which
fails with 'repository lacks the necessary blob to perform 3-way
merge' — bridge stops with no useful diagnostic, PR strands outside
the canonical review surface.

Fix:
Before calling git apply, shallow-fetch the PR's base commit from
the public repo into agents-private's object store. GitHub permits
fetching by SHA when the SHA is reachable from a ref (PR base is on
main, always reachable), and shallow=1 keeps it cheap. After fetch,
--3way can find the `<old>` blob and resolve conflicts into merge
markers instead of a hard error.

Why not drop --3way:
Dropping --3way makes every conflicting PR fail hard with no
recovery path. Pre-fetching preserves the --3way fallback so
conflicting hunks produce visible conflict markers in the bridged
PR, which is reviewable rather than opaque.

Failure handling:
If the fetch itself fails (rate-limit, network flake), we log and
proceed. The subsequent git apply will hit the same blob error as
before the fix — no regression, just no improvement for that run.

Applied to both script copies: public/agents and
public/agents-optional-local-dev. Bundled into PR inkeep#198 since it's
the same release/sync hardening scope.

* Revert "fix(ci): pre-fetch PR base into agents-private before git apply --3way"

This reverts commit e7e0640e2345bbd7cdea1185eb3abc666f2c60fa.

* fix(ci): switch bridge patch fetch from .patch to .diff media type

Root cause of bridge failures on multi-commit PRs (eg inkeep#3171,
9 commits):

`gh api .../pulls/N Accept: application/vnd.github.patch` returns a
mailbox-format patch series — one patch per PR commit, each with
`index <old>..<new>` lines referencing INTERMEDIATE blob SHAs created
during the PR's history on inkeep/agents. Those intermediates exist
only in inkeep/agents' object store; agents-private has never seen
them. The first patch applies against agents-private's file (blob
content matches the PR base even if SHAs differ), git's index advances
to the intermediate state (only referenced by SHA, never fetched as
an object). Any subsequent conflicting hunk triggers `git apply
--3way` fallback, which needs the intermediate blob → 'repository
lacks the necessary blob' → stranded PR.

Single-commit PRs don't hit this (no intermediates). That's why the
bridge used to work for most Dependabot bumps but started failing on
longer branches like inkeep#3171.

Fix: switch `Accept: application/vnd.github.patch` →
`Accept: application/vnd.github.diff` in both bridge-public-pr-to-
monorepo.mjs copies. .diff returns a single consolidated base-vs-head
unified diff with only the PR's base blob SHAs referenced. Those are
content-identical to agents-private's copies (Copybara mirrors blobs
1:1 without content rewrites for .ts files), so --3way can resolve
them locally from agents-private's own object store.

Also updated githubRequest's text-response check to include .diff
alongside .patch (both are text, not JSON).

Previous pre-fetch commit (`e7e0640e2`) was reverted in a prior commit
— it was addressing the wrong layer (only PR-base blob, not
intermediates). This commit supersedes that approach entirely.

* chore(ci): biome format fix on bridge script console.log

GitOrigin-RevId: a26d5dac74ed9938fc512703af21a25d82616504

Co-authored-by: Varun Varahabhotla <[email protected]>
@inkeep-oss-sync inkeep-oss-sync Bot closed this Apr 24, 2026
github-merge-queue Bot pushed a commit that referenced this pull request Apr 24, 2026
* chore(sync): mirror #3171

* apply review

* style: auto-format with biome

* Update public/agents/agents-manage-ui/src/app/api/traces/conversations/[conversationId]/route.ts



* chore: add changeset for manage UI auth guards



---------









GitOrigin-RevId: 5a2dd89d0f85597bd83b9938fb778a08b50e71fd

Co-authored-by: inkeep-internal-ci[bot] <259778081+inkeep-internal-ci[bot]@users.noreply.github.com>
Co-authored-by: robert-inkeep <[email protected]>
Co-authored-by: Varun Varahabhotla <[email protected]>
Co-authored-by: Dima <[email protected]>
Co-authored-by: Dimitri POSTOLOV <[email protected]>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Dimitri POSTOLOV <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants