trace api to clickhouse by shagun-singh-inkeep · Pull Request #2465 · inkeep/agents

shagun-singh-inkeep · 2026-03-01T22:14:31Z

removes 500/502 errors
improve performance speed by a lot
*Better performance for large datasets
Single query for all data

changeset-bot · 2026-03-01T22:14:35Z

⚠️ No Changeset found

Latest commit: ee2c3b0

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

vercel · 2026-03-01T22:14:35Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
agents-api	Ready	Preview, Comment	Mar 2, 2026 4:29pm
agents-docs	Ready	Preview, Comment	Mar 2, 2026 4:29pm
agents-manage-ui	Ready	Preview, Comment	Mar 2, 2026 4:29pm

pullfrog · 2026-03-01T22:15:40Z

Replaces the SigNoz Builder Query API with direct ClickHouse SQL to fetch all conversation trace spans in a single query, then filters client-side — eliminating ~1,000 lines of verbose per-span-type builder query definitions.

agents-manage-ui/src/app/api/signoz/conversations/[conversationId]/route.ts — replaces 15+ individual builderQueries with one clickhouse_sql query against signoz_traces.distributed_signoz_index_v3; parses the flat series[].labels response (with attrs_str/attrs_num/attrs_bool JSON columns) into SpanRow objects; filters span types client-side via .filter() instead of server-side query separation
agents-api/src/utils/signozHelpers.ts — enforceSecurityFilters (builder filter injection) replaced by enforceQuerySecurity which sets variables.tenant_id / variables.project_id for SigNoz's {{.var}} substitution
agents-api/src/domains/manage/routes/signoz.ts — updated callsites and injectConversationIdFilter to operate on chQueries with __CONVERSATION_IDS__ placeholder replacement instead of builder filter manipulation
agents-manage-ui/src/app/api/signoz/route.ts — validation schema updated from builderQueries to chQueries
agents-manage-ui/src/components/traces/timeline/timeline-item.tsx — minor: drops .toISOString() from datetime formatting
agents-manage-ui/src/lib/api/signoz-stats.ts — no functional changes

^{｜ View workflow run ｜ Using Claude Code ｜ Triggered by Pullfrog ｜ pullfrog.com ｜ 𝕏}

itoqa · 2026-03-01T23:44:05Z

Ito Test Report ❌

25 test cases ran. 22 passed, 3 failed.

This verification run tested the SigNoz query migration from builder format to raw ClickHouse SQL (PR #2465). Core functionality works well: traces overview pages load correctly, conversation detail pages render, batch queries return paginated data, and tenant/project isolation via variable substitution is enforced. However, three issues were identified: a timestamp format regression in the timeline component, and a missing schema validation that allows the deprecated builder-format queries to bypass the new ClickHouse SQL security model.

✅ Passed (22)

Test Case	Summary	Timestamp
ROUTE-1	Traces overview page loads with conversations per day chart, stats cards, filters, and empty conversation list	2:51
ROUTE-2	Conversation detail page loads with heading, metrics cards, and activity timeline	15:28
ROUTE-3	Batch query endpoint returns paginationResponse with empty series and detailResponse null	12:48
ROUTE-4	AI calls, tool calls, and org-level statistics pages render correctly with zero-data states	19:31
ROUTE-5	SigNoz health check returns HTTP 200 with status ok, confirming proxy chain works	22:13
EDGE-1	Tenant isolation verified: SQL queries use {{.tenant_id}} variable substitution via baseWhere()	23:27
EDGE-2	Empty/nonexistent conversation handled gracefully with empty activities array and zero token counts	17:10
EDGE-4	Batch query with zero matching conversations returns empty series and null detailResponse without errors	12:49
EDGE-5	Large conversation query capped at LIMIT 10000 spans, page renders without performance issues	17:29
EDGE-6	Malformed timestamps handled gracefully with parseInt fallback to epoch and Number.isNaN() guards	17:47
EDGE-7	Conversation ID injection properly escapes single quotes via doubling for ClickHouse SQL	28:12
EDGE-8	Search and filter functionality works with ClickHouse SQL queries, filters apply correctly	51:05
ADV-1	Tenant spoofing prevented: enforceQuerySecurity overrides client-provided tenant_id	24:11
ADV-2	Project spoofing prevented: enforceQuerySecurity overrides client-provided project_id	24:29
ADV-4	SQL injection via conversation ID mitigated by server-derived IDs and quote escaping	30:40
ADV-5	Variable overwrite for tenant_id prevented by enforceQuerySecurity (time boundary variables pass through as expected)	57:00
ADV-6	Batch template manipulation: enforceQuerySecurity applied, detailResponse null when no matching IDs	57:20
ADV-7	Unbounded time range accepted (no max range enforcement), 30-second timeout provides backstop	57:40
LOGIC-1	getBool helper correctly parses boolean span attributes (true, 'true', 1)	18:04
LOGIC-2	Token usage stats page displays correct zero aggregates with breakdown sections	20:49
LOGIC-3	Parent-child activity hierarchy resolves correctly with tree building and indented rendering	18:34
LOGIC-4	Conversations per day chart renders with correct date labels and updates on time range filter changes	21:30

❌ Failed (3)

Test Case	Summary	Timestamp
ROUTE-6	Timeline dateTime attribute regression: uses Date object instead of ISO string	16:35
EDGE-3	Schema validation missing: old builderQueries format accepted instead of rejected	25:07
ADV-3	Builder-format bypass: no Zod validation enforces chQueries-only format, security risk	25:27

ROUTE-6 – Failed

Where: Timeline item component in conversation detail view
Steps to reproduce:
1. Navigate to a conversation detail page at /{tenantId}/projects/{projectId}/traces/conversations/{conversationId}
2. Inspect any timeline item's <time> element
3. Check the dateTime attribute value
What failed: The <time dateTime={isoDateTime}> attribute contains a non-ISO-8601 string (e.g., 'Sun Mar 01 2026 22:30:00 GMT+0000') instead of ISO format (e.g., '2026-03-01T22:30:00.000Z'). This breaks accessibility and SEO machine-readability requirements for the dateTime attribute.
Code analysis: Examined timeline-item.tsx where the timestamp is formatted. Line 228 creates a Date object that is passed directly to the dateTime attribute, which React converts to a non-ISO string via toString().

Relevant code:

agents-manage-ui/src/components/traces/timeline/timeline-item.tsx (lines 227–228)

const formattedDateTime = formatDateTime(activity.timestamp, { local: true });
const isoDateTime = new Date(activity.timestamp);

agents-manage-ui/src/components/traces/timeline/timeline-item.tsx (lines 600–606)

<time
  className="text-xs mb-2 inline-block text-gray-500 dark:text-white/50"
  dateTime={isoDateTime}
  title={formattedDateTime}
>
  {formattedDateTime}
</time>

Why this is likely a bug: The PR changed line 228 from .toISOString() to just new Date(). The HTML <time> element's dateTime attribute must contain a machine-readable ISO 8601 string for accessibility tools and search engines to parse correctly. Passing a Date object causes React to call .toString() which produces a locale-dependent, non-ISO format.
Introduced by this PR: Yes – this PR modified the relevant code (changed new Date(activity.timestamp).toISOString() to new Date(activity.timestamp))
Timestamp: 16:35

EDGE-3 – Failed

Where: SigNoz proxy route in agents-manage-ui
Steps to reproduce:
1. Send a POST request to /api/signoz?tenantId=default with a payload containing compositeQuery.queryType: 'builder' and compositeQuery.builderQueries instead of chQueries
2. Observe the response status code
What failed: The old builderQueries format was accepted with HTTP 200 instead of being rejected with HTTP 400. The Zod schema in route.ts requires chQueries, but this validation only applies at the Next.js layer. The agents-api signoz.ts route has no Zod validation and passes any payload directly to SigNoz.
Code analysis: Examined agents-manage-ui/src/app/api/signoz/route.ts (Zod schema at lines 14-18) and agents-api/src/domains/manage/routes/signoz.ts (no schema validation). The Next.js route validates payloads, but requests can potentially reach agents-api through other paths or the schema can be bypassed.

Relevant code:

agents-manage-ui/src/app/api/signoz/route.ts (lines 14–18)

const compositeQuerySchema = z.object({
  queryType: z.string(),
  panelType: z.string(),
  chQueries: z.record(z.string(), z.any()),
});

agents-api/src/domains/manage/routes/signoz.ts (lines 14–16)

app.post('/query', async (c) => {
  let payload = await c.req.json();
  const requestedProjectId = payload.projectId;

Why this is likely a bug: The PR migrated from builderQueries to chQueries format but did not add validation at the agents-api layer to reject the old format. This creates inconsistent behavior where the old format may still work depending on the request path.
Introduced by this PR: Yes – this PR introduced the chQueries migration but did not add schema enforcement at agents-api
Timestamp: 25:07

ADV-3 – Failed

Where: SigNoz proxy route in agents-api
Steps to reproduce:
1. Send a POST request to /api/signoz?tenantId=default with compositeQuery.queryType: 'builder' and compositeQuery.builderQueries
2. Observe that the request is accepted (200 OK)
What failed: Builder-format payloads are accepted without validation. While enforceQuerySecurity injects variables.tenant_id, builder-format queries in SigNoz do not use {{.variable}} substitution syntax. This means the injected tenant_id variable may have no effect on query scoping for builder-format queries, potentially allowing tenant isolation bypass.
Code analysis: Examined agents-api/src/utils/signozHelpers.ts which only injects variables. It does not validate the query format or reject builder queries. Builder queries use a different filtering mechanism (filters.items array) that the old enforceSecurityFilters function handled, but the new enforceQuerySecurity function does not.

Relevant code:

agents-api/src/utils/signozHelpers.ts (lines 1–18)

/**
 * Enforce tenant/project security on ClickHouse SQL queries by injecting
 * server-side variables. SQL queries use {{.tenant_id}} / {{.project_id}}
 * via SigNoz variable substitution, preventing SQL injection.
 *
 * Always overrides client-provided values to prevent spoofing.
 */
export function enforceQuerySecurity(payload: any, tenantId: string, projectId?: string): any {
  const modifiedPayload = JSON.parse(JSON.stringify(payload));
  if (!modifiedPayload.variables) {
    modifiedPayload.variables = {};
  }
  modifiedPayload.variables.tenant_id = tenantId;
  if (projectId) {
    modifiedPayload.variables.project_id = projectId;
  }
  return modifiedPayload;
}

Why this is likely a bug: The new enforceQuerySecurity function assumes all queries use ClickHouse SQL with {{.tenant_id}} variable substitution. Builder-format queries do not use this mechanism and are not filtered. An attacker could craft builder-format queries that bypass tenant isolation. Defense-in-depth via SigNoz API key scoping provides partial mitigation, but this is a security regression from the previous implementation which handled both formats.
Introduced by this PR: Yes – this PR removed the builder-format filter logic from enforceSecurityFilters and replaced it with enforceQuerySecurity which only handles ClickHouse SQL queries
Timestamp: 25:27

📋 View Recording

Screen Recording

claude

PR Review Summary

(1) Total Issues | Risk: Low

Re-review scoped to delta since 5e5c134c9d80 — Prior critical issues have been addressed.

✅ Addressed from Prior Review

The author has successfully addressed all critical and major issues from the prior review:

Issue	Fix
🔴 Tenant isolation bypass (missing `{{.tenant_id}}` validation)	✅ `enforceQuerySecurity` now rejects queries missing `{{.tenant_id}}` placeholder
🔴 Conversation detail SQL missing tenant filter	✅ Added `WHERE attributes_string['tenant.id'] = {{.tenant_id}}`
🔴 Span detail SQL missing tenant filter	✅ Added tenant filter to span query
🔴 `toISOString()` missing on timeline dateTime	✅ Fixed — `new Date().toISOString()`
🟠 No tests for `enforceQuerySecurity`	✅ 12 comprehensive tests added covering anti-spoofing, validation, edge cases
🟡 `replace` vs `replaceAll` for placeholders	✅ Changed to `replaceAll`

💭 Consider (2) 💭

💭 1) signozHelpers.ts String inclusion check for tenant filter is syntactically bypassable

Issue: The validation checks if the query string contains {{.tenant_id}} using a simple query.includes(). A technically malicious query could include the literal string in a comment or string literal (e.g., SELECT * FROM t WHERE 1=1 -- {{.tenant_id}}) and pass validation without actually filtering.

Why: Currently low risk since queries are server-constructed, not client-provided. But as a defense-in-depth measure, the validation could be strengthened.

Fix: Consider a regex that validates {{.tenant_id}} appears in a WHERE clause context, e.g., /WHERE[^;]*\{\{\.tenant_id\}\}/i. Or document that queries must always be server-constructed.

Refs: signozHelpers.ts:26

💭 2) signozHelpers.test.ts Missing test for non-string query values

Issue: The test suite doesn't cover the behavior when a chQuery entry has a non-string query value (e.g., null, object, number). The current implementation silently accepts these (the typeof query === 'string' check skips non-strings).

Why: Documents expected behavior and prevents regressions if the validation logic changes.

Fix: Add a test documenting that non-string query values are silently accepted:

it('skips validation for non-string query values', () => {
  const payload = {
    compositeQuery: {
      chQueries: {
        A: { query: null },
        B: { query: { nested: 'object' } },
      },
    },
  };
  const result = enforceQuerySecurity(payload, 'tenant-1');
  expect(result.error).toBeUndefined();
});

Refs: signozHelpers.test.ts

🧹 While You're Here (1) 🧹

🧹 1) conversations/[conversationId]/route.ts Direct SigNoz call path accepts untrusted tenant_id from query params

Issue: When shouldCallSigNozDirectly() returns true (no cookies but SIGNOZ_URL and SIGNOZ_API_KEY env vars set), the tenantId is taken directly from the query string (url.searchParams.get('tenantId') || 'default'). This code path bypasses the authenticated API proxy and doesn't validate that the caller is authorized for the requested tenant.

Why: An attacker could call /api/signoz/conversations/{id}?tenantId=victim-tenant without cookies to access another tenant's trace data, assuming the server has direct SigNoz access configured.

Fix: Either:

Remove the direct SigNoz call path entirely (always go through authenticated API)
Require a service-to-service auth token when cookies are missing
Validate tenant access before the direct call

Refs: route.ts:59-61, route.ts:241

🕐 Pending Recommendations (4)

These items were raised in the prior review and remain relevant but are outside the delta scope:

🟠 signoz-stats.ts Silent error swallowing — catch blocks return empty data without surfacing errors to UI
🟠 signoz-stats.ts:1203 Unbounded 100,000 row pagination limit risks memory exhaustion
🟠 signoz-stats.ts Missing 429 rate limit handling in axios retry config
🟡 signoz-stats.ts:105 Duplicated esc() SQL escape function

✅ APPROVE

Summary: The critical security issues from the prior review have been addressed effectively. The enforceQuerySecurity function now validates that all ClickHouse queries include the {{.tenant_id}} placeholder and returns clear error messages when they don't. The new test suite provides good coverage of the security-critical validation logic. The remaining items are either minor considerations or pre-existing issues outside the delta scope. Ship it! 🚀

Discarded (3)

Location	Issue	Reason Discarded
`signozHelpers.test.ts`	Missing test for absent compositeQuery	Code handles this correctly via optional chaining; low risk edge case
`signozHelpers.test.ts`	Missing test for empty chQueries object	Code handles this correctly; empty object means no queries to validate
`signoz-stats.ts`	Dead constants after migration	Outside delta scope; pre-existing

Reviewers (3)

Reviewer	Returned	Consider	While You're Here	Discarded
`pr-review-security-iam`	2	1	1	0
`pr-review-tests`	3	1	0	2
`pr-review-standards`	0	0	0	0
Total	5	2	1	2

Note: Delta review — reviewers scoped to changes since 5e5c134c9d80. Standards reviewer found no issues in the delta code quality.

claude

PR Review Summary

(3) Total Issues | Risk: Low

This is a full scope review after the pull_request:synchronize event. Prior critical issues (tenant isolation, toISOString() fix, missing tests for security function) have been successfully addressed.

✅ Addressed from Prior Reviews

The author has successfully addressed all critical and major issues from prior reviews:

Issue	Status
🔴 Tenant isolation bypass (missing `{{.tenant_id}}` validation)	✅ `enforceQuerySecurity` now rejects queries missing tenant filter
🔴 Conversation detail SQL missing tenant filter	✅ Added `WHERE attributes_string['tenant.id'] = {{.tenant_id}}`
🔴 `toISOString()` missing on timeline dateTime	✅ Fixed — `new Date().toISOString()` at line 228
🟠 No tests for `enforceQuerySecurity`	✅ 12 comprehensive tests added
🟡 `replace` vs `replaceAll` for placeholders	✅ Changed to `replaceAll`

🟡 Minor (1) 🟡

Inline Comments:

🟡 Minor: conversations/[conversationId]/route.ts:218 Hardcoded LIMIT 10000 without truncation warning

💭 Consider (2) 💭

💭 1) signoz.ts Backend lacks retry logic for SigNoz calls

Issue: The backend signoz.ts proxy makes direct axios calls to SigNoz without retry configuration (lines 90, 218, 243). While the frontend has axios-retry configured, transient failures from the backend will immediately fail.

Why: Given that the PR description mentions fixing 500/502 errors, adding retry logic here would improve resilience against transient SigNoz failures.

Fix: Configure axios-retry on the agents-api axios instance for SigNoz calls:

import axiosRetry from 'axios-retry';
axiosRetry(axios, { retries: 3, retryDelay: axiosRetry.exponentialDelay });

Refs: signoz.ts:90

💭 2) signozHelpers.test.ts Missing test for injectConversationIdFilter

Issue: The injectConversationIdFilter() function (signoz.ts lines 279-292) performs SQL placeholder replacement with conversation IDs but lacks test coverage.

Why: While IDs are server-generated and the escaping logic appears correct, tests would document expected behavior and catch regressions if the replacement logic changes.

Fix: Add unit tests verifying:

Normal case: ["conv-1", "conv-2"] → 'conv-1','conv-2'
Edge case: IDs with single quotes are properly escaped
Edge case: Empty array behavior

Refs: signoz.ts:279-292

🧹 While You're Here (1) 🧹

🧹 1) conversations/[conversationId]/route.ts Direct SigNoz call path accepts untrusted tenant_id

Issue: When shouldCallSigNozDirectly() returns true (no cookies but SIGNOZ_URL and SIGNOZ_API_KEY env vars set), the tenantId is sourced from the URL query parameter (line 243). This code path bypasses the authenticated API proxy.

Why: This is a pre-existing pattern that was not introduced by this PR, but it's worth noting: an attacker could potentially call /api/signoz/conversations/{id}?tenantId=victim-tenant without cookies to query another tenant's traces, assuming the server has direct SigNoz access configured.

Fix: Either:

Remove the direct SigNoz call path entirely (always go through authenticated API)
Require a service-to-service auth token when cookies are missing
Document this as an intentional internal-only code path

Refs: route.ts:59-61, route.ts:243

🕐 Pending Recommendations (4)

These items were raised in prior reviews and remain relevant:

🟠 signoz-stats.ts Silent error swallowing — catch blocks return empty data without surfacing errors to UI
🟠 signoz-stats.ts Missing 429 rate limit handling in axios retry config
🟠 signoz-stats.ts:1203 Unbounded 100,000 row pagination limit risks memory exhaustion
🟡 signoz-stats.ts:105 Duplicated esc() SQL escape function

💡 APPROVE WITH SUGGESTIONS

Summary: This PR successfully migrates the SigNoz trace API from builder-query format to direct ClickHouse SQL, achieving a net reduction of ~2,500 lines while maintaining robust security controls. The critical tenant isolation issues from prior reviews have been properly addressed with enforceQuerySecurity() now validating tenant filters and the comprehensive test suite. The remaining items are either minor improvements, pre-existing patterns worth addressing separately, or pending recommendations from prior reviews. Ship it! 🚀

Discarded (8)

Location	Issue	Reason Discarded
`signoz-stats.ts`	No test coverage for 1857-line file	Pre-existing code; tests would be valuable but out of PR scope
`conversations/route.ts`	No test coverage for 1180-line route	Significant but out of scope for this migration PR
`signoz-stats.ts`	Silent error swallowing	Already in Pending Recommendations from prior review
`signoz-stats.ts`	429 rate limit handling	Already in Pending Recommendations from prior review
`signoz-stats.ts:1203`	100,000 row limit	Already in Pending Recommendations from prior review
`signoz-stats.ts:105`	Duplicated esc() function	Already in Pending Recommendations from prior review
`conversations/route.ts`	Cross-tenant data exposure via direct SigNoz	Pre-existing pattern, moved to While You're Here
`spans/route.ts`	Cross-tenant data exposure via direct SigNoz	Same pre-existing pattern as above

Reviewers (6)

Reviewer	Returned	Consider	While You're Here	Inline Comments	Pending Recs	Discarded
`pr-review-security-iam`	2	0	1	0	0	1
`pr-review-sre`	5	1	0	1	3	0
`pr-review-tests`	5	1	0	0	0	4
`pr-review-errors`	1	0	0	0	1	0
`pr-review-consistency`	6	0	0	0	0	6
`pr-review-standards`	0	0	0	0	0	0
Total	19	2	1	1	4	11

Note: Consistency reviewer returned 6 INFO-level findings confirming patterns are correct (no issues). Standards reviewer found no issues.

claude · 2026-03-02T16:34:02Z

+              AND timestamp BETWEEN {{.start_datetime}} AND {{.end_datetime}}
+              AND ts_bucket_start BETWEEN {{.start_timestamp}} - 1800 AND {{.end_timestamp}}
+            ORDER BY timestamp ASC
+            LIMIT 10000


🟡 Minor: Hardcoded LIMIT without truncation warning

Issue: The query uses LIMIT 10000 but there's no indication to callers when results are truncated.

Why: Conversations with heavy tool usage could exceed 10,000 spans. When truncated, the timeline silently shows incomplete data with no warning.

Fix: Consider logging a warning when the limit is reached:

// After line 259 if (allSpans.length === 10000) { logger.warn({ conversationId }, 'Conversation spans truncated at 10000 limit'); }

Refs:

ClickHouse query performance docs

itoqa · 2026-03-02T19:09:24Z

Ito Test Report ❌

28 test cases ran. 26 passed, 2 failed.

This test run verified the SigNoz Trace API migration from builder-based queries to ClickHouse SQL queries (PR #2465). The majority of tests passed successfully, confirming that the new ClickHouse SQL query format works correctly for stats dashboards, traces overview, conversation detail timeline, batch query mode, search/filter functionality, and security enforcement (tenant isolation, anti-spoofing). Two edge case tests revealed error handling issues in the span detail endpoint and SigNoz proxy that return incorrect HTTP status codes.

✅ Passed (26)

Test Case	Summary	Timestamp
ROUTE-1	Stats dashboard loads correctly with ClickHouse SQL queries, all API requests use chQueries format	85:47
ROUTE-2	Traces overview page loads with filter controls, search, stat cards, and time range filter works correctly	89:53
ROUTE-3	Conversation detail timeline renders with diverse activity types including tool calls, AI generations, user messages	95:16
ROUTE-4	Span detail endpoint includes tenant_id filter in SQL query (verified in source code at route.ts:73)	96:07
ROUTE-5	AI Calls Breakdown page loads with stat cards and breakdown tables, time range filter works	84:48
ROUTE-6	Tool Calls Breakdown page loads with stat cards and MCP Calls by Project table, time range filter works	85:22
ROUTE-7	SigNoz health API returns proper structured response with status and configured fields	83:03
ROUTE-8	Batch query endpoint uses paginationPayload and detailPayloadTemplate with chQueries format	91:11
ROUTE-10	Copy Trace dropdown provides summarized and full trace options, both work without errors	97:57
EDGE-3	Missing conversationId param returns 400 with expected error message	100:56
EDGE-5	Legacy builderQueries payload rejected with 400 - Zod schema validation failed as expected	100:57
EDGE-6	Default tenantId fallback works correctly when not provided in query params	100:58
EDGE-7	Conversation with all activity types renders correctly with proper icons and styling	104:11
EDGE-8	Search filter and span filters work correctly with chQueries format	91:31
EDGE-9	Large 182-day time range query completes without timeout errors	93:39
EDGE-10	getBool helper correctly handles truthy/falsy values, error activities show ERROR status	104:27
ADV-1	Query without tenant_id rejected with 400 - tenant isolation enforcement working	105:48
ADV-2	Spoofed tenant_id in variables is overwritten by server - anti-spoofing working	107:27
ADV-4	SQL injection prevention confirmed - pagination works correctly without SQL errors	107:25
ADV-5	Payload with builderQueries and chQueries - builderQueries ignored, chQueries validated correctly	107:29
ADV-6	Span detail endpoint always includes tenant filter in SQL query - verified in source code	104:46
MANUAL-1	All 11 enforceQuerySecurity unit tests passed	108:53
MANUAL-2	pnpm typecheck passed - 14 tasks successful, 0 failures	109:36
MANUAL-3	Conversation API response format verified with activities array and metadata fields	105:12
MANUAL-4	tenant_id in comment passed string check validation - known defense-in-depth limitation documented	107:30
MANUAL-5	Performance metrics verified - Stats page API calls average 62ms, batch mode 37ms	111:27

❌ Failed (2)

Test Case	Summary	Timestamp	Screenshot
EDGE-2	Non-existent span returns 500 instead of expected 404	101:30
EDGE-4	SigNoz unavailable returns 500 instead of expected 503	102:12

Non-existent span returns 404 – Failed

Where: Span detail API endpoint /api/signoz/spans/[spanId]
Steps to reproduce:
1. Call GET /api/signoz/spans/nonexistent-span-id?conversationId=test-conv&tenantId=default
2. Observe the HTTP response status code
What failed: Expected 404 "Span not found" but received 500 Internal Server Error. When SigNoz is unreachable or returns an error, the endpoint catches all errors and returns 500 without distinguishing between "span not found" and "service unavailable" scenarios.
Code analysis: The span detail endpoint at agents-manage-ui/src/app/api/signoz/spans/[spanId]/route.ts has proper 404 handling when the query succeeds but returns no results (lines 114-121). However, the catch block at lines 143-145 returns 500 for ALL errors without distinguishing connection errors (which should be 503) from other failures.

Relevant code:

agents-manage-ui/src/app/api/signoz/spans/[spanId]/route.ts (lines 113-146)

  const series = result?.series;

  if (!series || series.length === 0) {
    return NextResponse.json({ error: 'Span not found' }, { status: 404 });
  }
  // ... span parsing code ...

} catch (error) {
  const errorMessage = error instanceof Error ? error.message : 'Failed to fetch span details';
  return NextResponse.json({ error: errorMessage }, { status: 500 });
}

Why this is likely a bug: The catch block does not differentiate between connection errors (ECONNREFUSED should return 503), authentication errors (should return 502), and actual failures. The conversation endpoint at conversations/[conversationId]/route.ts:1151-1177 correctly handles these cases with proper status codes, but the spans endpoint does not follow the same pattern.
Introduced by this PR: No – pre-existing bug (the spans endpoint error handling was not modified in this PR, only the tenant_id filter was added at line 73)
Timestamp: 101:30

SigNoz unavailable returns 503 not 500 – Failed

Where: SigNoz proxy endpoint /api/signoz (POST)
Steps to reproduce:
1. Ensure SigNoz is not running (port 3080 unreachable)
2. Navigate to /default/stats and observe network requests
3. Check the HTTP status codes returned by POST /api/signoz requests
What failed: Expected 503 Service Unavailable when SigNoz is unreachable, but received 500 Internal Server Error. The error handling does not distinguish between a downstream service being unreachable and an internal application error.
Code analysis: The agents-api correctly returns 503 for ECONNREFUSED errors (see agents-api/src/domains/manage/routes/signoz.ts:103-111). However, when SigNoz returns 401/403 authentication errors, the agents-api returns 500 (lines 113-122), and the manage-ui proxy at agents-manage-ui/src/app/api/signoz/route.ts:68-86 passes through the 500 status without mapping it to the appropriate service unavailability code.

Relevant code:

agents-api/src/domains/manage/routes/signoz.ts (lines 101-143)

} catch (error) {
  if (axios.isAxiosError(error)) {
    if (error.code === 'ECONNREFUSED' || error.code === 'ENOTFOUND') {
      logger.error({ error: error.message }, 'SigNoz service unavailable');
      return c.json(
        { error: 'Service Unavailable', message: 'SigNoz service is unavailable' },
        503
      );
    }
    if (error.response?.status === 401 || error.response?.status === 403) {
      logger.error({ status: error.response.status }, 'SigNoz authentication failed');
      return c.json(
        { error: 'Internal Server Error', message: 'SigNoz authentication failed' },
        500  // Should be 502 Bad Gateway for upstream auth failure
      );
    }
  }
  // Falls through to 500 for all other errors

agents-manage-ui/src/app/api/signoz/route.ts (lines 68-86)

function handleProxyError(error: unknown, logger: ReturnType<typeof getLogger>) {
  // ...
  if (axios.isAxiosError(error)) {
    const status = error.response?.status || 500;
    const message = error.response?.data?.message || error.message;
    return NextResponse.json({ error: 'Failed to query SigNoz', details: message }, { status });
  }
  // Does not distinguish service unavailability from internal errors

Why this is likely a bug: When SigNoz authentication fails (401/403), the code returns 500 "Internal Server Error" which is semantically incorrect. An authentication failure with an upstream service should return 502 Bad Gateway. Additionally, the manage-ui proxy doesn't handle ECONNREFUSED and other connection errors to map them to 503, relying on the agents-api to do so. The inconsistent status codes make it harder for clients to understand whether the error is recoverable.
Introduced by this PR: No – pre-existing bug (the error handling logic in signoz.ts and route.ts was not changed in this PR)
Timestamp: 102:12

📋 View Recording

Screen Recording

vercel Bot deployed to Preview – agents-api March 1, 2026 22:15 View deployment

vercel Bot had a problem deploying to Preview – agents-manage-ui March 1, 2026 22:16 Failure

This comment was marked as outdated.

Sign in to view

github-actions Bot deleted a comment from claude Bot Mar 1, 2026

vercel Bot deployed to Preview – agents-api March 2, 2026 15:15 View deployment

vercel Bot deployed to Preview – agents-docs March 2, 2026 15:15 View deployment

vercel Bot deployed to Preview – agents-manage-ui March 2, 2026 15:17 View deployment

claude Bot reviewed Mar 2, 2026

View reviewed changes

github-actions Bot deleted a comment from claude Bot Mar 2, 2026

shagun-singh-inkeep added 2 commits March 2, 2026 11:24

trace api to clickhouse

a109424

fixes

ee2c3b0

shagun-singh-inkeep force-pushed the clickhouse branch from 6c90338 to ee2c3b0 Compare March 2, 2026 16:27

vercel Bot deployed to Preview – agents-api March 2, 2026 16:28 View deployment

vercel Bot deployed to Preview – agents-manage-ui March 2, 2026 16:29 View deployment

vercel Bot deployed to Preview – agents-docs March 2, 2026 16:29 View deployment

claude Bot reviewed Mar 2, 2026

View reviewed changes

github-actions Bot deleted a comment from claude Bot Mar 2, 2026

shagun-singh-inkeep marked this pull request as draft March 2, 2026 21:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trace api to clickhouse#2465

trace api to clickhouse#2465
shagun-singh-inkeep wants to merge 2 commits intomainfrom
clickhouse

shagun-singh-inkeep commented Mar 1, 2026

Uh oh!

changeset-bot Bot commented Mar 1, 2026 •

edited

Loading

Uh oh!

vercel Bot commented Mar 1, 2026 •

edited

Loading

Uh oh!

pullfrog Bot commented Mar 1, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

itoqa Bot commented Mar 1, 2026

Uh oh!

claude Bot left a comment

Uh oh!

claude Bot left a comment

Uh oh!

claude Bot Mar 2, 2026

Uh oh!

itoqa Bot commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shagun-singh-inkeep commented Mar 1, 2026

Uh oh!

changeset-bot Bot commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

vercel Bot commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pullfrog Bot commented Mar 1, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

itoqa Bot commented Mar 1, 2026

Ito Test Report ❌

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

PR Review Summary

✅ Addressed from Prior Review

💭 Consider (2) 💭

🧹 While You're Here (1) 🧹

🕐 Pending Recommendations (4)

✅ APPROVE

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

PR Review Summary

✅ Addressed from Prior Reviews

🟡 Minor (1) 🟡

💭 Consider (2) 💭

🧹 While You're Here (1) 🧹

🕐 Pending Recommendations (4)

💡 APPROVE WITH SUGGESTIONS

Uh oh!

claude Bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

itoqa Bot commented Mar 2, 2026

Ito Test Report ❌

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

changeset-bot Bot commented Mar 1, 2026 •

edited

Loading

vercel Bot commented Mar 1, 2026 •

edited

Loading