Skip to content

trace api to clickhouse#2465

Draft
shagun-singh-inkeep wants to merge 2 commits intomainfrom
clickhouse
Draft

trace api to clickhouse#2465
shagun-singh-inkeep wants to merge 2 commits intomainfrom
clickhouse

Conversation

@shagun-singh-inkeep
Copy link
Copy Markdown
Collaborator

  • removes 500/502 errors
  • improve performance speed by a lot
    *Better performance for large datasets
  • Single query for all data

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Mar 1, 2026

⚠️ No Changeset found

Latest commit: ee2c3b0

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@vercel
Copy link
Copy Markdown

vercel Bot commented Mar 1, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agents-api Ready Ready Preview, Comment Mar 2, 2026 4:29pm
agents-docs Ready Ready Preview, Comment Mar 2, 2026 4:29pm
agents-manage-ui Ready Ready Preview, Comment Mar 2, 2026 4:29pm

Request Review

@pullfrog
Copy link
Copy Markdown
Contributor

pullfrog Bot commented Mar 1, 2026

Replaces the SigNoz Builder Query API with direct ClickHouse SQL to fetch all conversation trace spans in a single query, then filters client-side — eliminating ~1,000 lines of verbose per-span-type builder query definitions.

  • agents-manage-ui/src/app/api/signoz/conversations/[conversationId]/route.ts — replaces 15+ individual builderQueries with one clickhouse_sql query against signoz_traces.distributed_signoz_index_v3; parses the flat series[].labels response (with attrs_str/attrs_num/attrs_bool JSON columns) into SpanRow objects; filters span types client-side via .filter() instead of server-side query separation
  • agents-api/src/utils/signozHelpers.tsenforceSecurityFilters (builder filter injection) replaced by enforceQuerySecurity which sets variables.tenant_id / variables.project_id for SigNoz's {{.var}} substitution
  • agents-api/src/domains/manage/routes/signoz.ts — updated callsites and injectConversationIdFilter to operate on chQueries with __CONVERSATION_IDS__ placeholder replacement instead of builder filter manipulation
  • agents-manage-ui/src/app/api/signoz/route.ts — validation schema updated from builderQueries to chQueries
  • agents-manage-ui/src/components/traces/timeline/timeline-item.tsx — minor: drops .toISOString() from datetime formatting
  • agents-manage-ui/src/lib/api/signoz-stats.ts — no functional changes

Pullfrog  | View workflow run | Using Claude Code | Triggered by Pullfrogpullfrog.com𝕏

pullfrog[bot]

This comment was marked as outdated.

claude[bot]

This comment was marked as outdated.

@github-actions github-actions Bot deleted a comment from claude Bot Mar 1, 2026
@itoqa
Copy link
Copy Markdown

itoqa Bot commented Mar 1, 2026

Ito Test Report ❌

25 test cases ran. 22 passed, 3 failed.

This verification run tested the SigNoz query migration from builder format to raw ClickHouse SQL (PR #2465). Core functionality works well: traces overview pages load correctly, conversation detail pages render, batch queries return paginated data, and tenant/project isolation via variable substitution is enforced. However, three issues were identified: a timestamp format regression in the timeline component, and a missing schema validation that allows the deprecated builder-format queries to bypass the new ClickHouse SQL security model.

✅ Passed (22)
Test Case Summary Timestamp Screenshot
ROUTE-1 Traces overview page loads with conversations per day chart, stats cards, filters, and empty conversation list 2:51 ROUTE-1_2-51.png
ROUTE-2 Conversation detail page loads with heading, metrics cards, and activity timeline 15:28 ROUTE-2_15-28.png
ROUTE-3 Batch query endpoint returns paginationResponse with empty series and detailResponse null 12:48 ROUTE-3_12-48.png
ROUTE-4 AI calls, tool calls, and org-level statistics pages render correctly with zero-data states 19:31 ROUTE-4_19-31.png
ROUTE-5 SigNoz health check returns HTTP 200 with status ok, confirming proxy chain works 22:13 ROUTE-5_22-13.png
EDGE-1 Tenant isolation verified: SQL queries use {{.tenant_id}} variable substitution via baseWhere() 23:27 EDGE-1_23-27.png
EDGE-2 Empty/nonexistent conversation handled gracefully with empty activities array and zero token counts 17:10 EDGE-2_17-10.png
EDGE-4 Batch query with zero matching conversations returns empty series and null detailResponse without errors 12:49 EDGE-4_12-49.png
EDGE-5 Large conversation query capped at LIMIT 10000 spans, page renders without performance issues 17:29 EDGE-5_17-29.png
EDGE-6 Malformed timestamps handled gracefully with parseInt fallback to epoch and Number.isNaN() guards 17:47 EDGE-6_17-47.png
EDGE-7 Conversation ID injection properly escapes single quotes via doubling for ClickHouse SQL 28:12 EDGE-7_28-12.png
EDGE-8 Search and filter functionality works with ClickHouse SQL queries, filters apply correctly 51:05 EDGE-8_51-05.png
ADV-1 Tenant spoofing prevented: enforceQuerySecurity overrides client-provided tenant_id 24:11 ADV-1_24-11.png
ADV-2 Project spoofing prevented: enforceQuerySecurity overrides client-provided project_id 24:29 ADV-2_24-29.png
ADV-4 SQL injection via conversation ID mitigated by server-derived IDs and quote escaping 30:40 ADV-4_30-40.png
ADV-5 Variable overwrite for tenant_id prevented by enforceQuerySecurity (time boundary variables pass through as expected) 57:00 ADV-5_57-00.png
ADV-6 Batch template manipulation: enforceQuerySecurity applied, detailResponse null when no matching IDs 57:20 ADV-6_57-20.png
ADV-7 Unbounded time range accepted (no max range enforcement), 30-second timeout provides backstop 57:40 ADV-7_57-40.png
LOGIC-1 getBool helper correctly parses boolean span attributes (true, 'true', 1) 18:04 LOGIC-1_18-04.png
LOGIC-2 Token usage stats page displays correct zero aggregates with breakdown sections 20:49 LOGIC-2_20-49.png
LOGIC-3 Parent-child activity hierarchy resolves correctly with tree building and indented rendering 18:34 LOGIC-3_18-34.png
LOGIC-4 Conversations per day chart renders with correct date labels and updates on time range filter changes 21:30 LOGIC-4_21-30.png
❌ Failed (3)
Test Case Summary Timestamp Screenshot
ROUTE-6 Timeline dateTime attribute regression: uses Date object instead of ISO string 16:35 ROUTE-6_16-35.png
EDGE-3 Schema validation missing: old builderQueries format accepted instead of rejected 25:07 EDGE-3_25-07.png
ADV-3 Builder-format bypass: no Zod validation enforces chQueries-only format, security risk 25:27 ADV-3_25-27.png
ROUTE-6 – Failed
  • Where: Timeline item component in conversation detail view

  • Steps to reproduce:

    1. Navigate to a conversation detail page at /{tenantId}/projects/{projectId}/traces/conversations/{conversationId}
    2. Inspect any timeline item's <time> element
    3. Check the dateTime attribute value
  • What failed: The <time dateTime={isoDateTime}> attribute contains a non-ISO-8601 string (e.g., 'Sun Mar 01 2026 22:30:00 GMT+0000') instead of ISO format (e.g., '2026-03-01T22:30:00.000Z'). This breaks accessibility and SEO machine-readability requirements for the dateTime attribute.

  • Code analysis: Examined timeline-item.tsx where the timestamp is formatted. Line 228 creates a Date object that is passed directly to the dateTime attribute, which React converts to a non-ISO string via toString().

  • Relevant code:

    agents-manage-ui/src/components/traces/timeline/timeline-item.tsx (lines 227–228)

    const formattedDateTime = formatDateTime(activity.timestamp, { local: true });
    const isoDateTime = new Date(activity.timestamp);

    agents-manage-ui/src/components/traces/timeline/timeline-item.tsx (lines 600–606)

    <time
      className="text-xs mb-2 inline-block text-gray-500 dark:text-white/50"
      dateTime={isoDateTime}
      title={formattedDateTime}
    >
      {formattedDateTime}
    </time>
  • Why this is likely a bug: The PR changed line 228 from .toISOString() to just new Date(). The HTML <time> element's dateTime attribute must contain a machine-readable ISO 8601 string for accessibility tools and search engines to parse correctly. Passing a Date object causes React to call .toString() which produces a locale-dependent, non-ISO format.

  • Introduced by this PR: Yes – this PR modified the relevant code (changed new Date(activity.timestamp).toISOString() to new Date(activity.timestamp))

  • Timestamp: 16:35

EDGE-3 – Failed
  • Where: SigNoz proxy route in agents-manage-ui

  • Steps to reproduce:

    1. Send a POST request to /api/signoz?tenantId=default with a payload containing compositeQuery.queryType: 'builder' and compositeQuery.builderQueries instead of chQueries
    2. Observe the response status code
  • What failed: The old builderQueries format was accepted with HTTP 200 instead of being rejected with HTTP 400. The Zod schema in route.ts requires chQueries, but this validation only applies at the Next.js layer. The agents-api signoz.ts route has no Zod validation and passes any payload directly to SigNoz.

  • Code analysis: Examined agents-manage-ui/src/app/api/signoz/route.ts (Zod schema at lines 14-18) and agents-api/src/domains/manage/routes/signoz.ts (no schema validation). The Next.js route validates payloads, but requests can potentially reach agents-api through other paths or the schema can be bypassed.

  • Relevant code:

    agents-manage-ui/src/app/api/signoz/route.ts (lines 14–18)

    const compositeQuerySchema = z.object({
      queryType: z.string(),
      panelType: z.string(),
      chQueries: z.record(z.string(), z.any()),
    });

    agents-api/src/domains/manage/routes/signoz.ts (lines 14–16)

    app.post('/query', async (c) => {
      let payload = await c.req.json();
      const requestedProjectId = payload.projectId;
  • Why this is likely a bug: The PR migrated from builderQueries to chQueries format but did not add validation at the agents-api layer to reject the old format. This creates inconsistent behavior where the old format may still work depending on the request path.

  • Introduced by this PR: Yes – this PR introduced the chQueries migration but did not add schema enforcement at agents-api

  • Timestamp: 25:07

ADV-3 – Failed
  • Where: SigNoz proxy route in agents-api

  • Steps to reproduce:

    1. Send a POST request to /api/signoz?tenantId=default with compositeQuery.queryType: 'builder' and compositeQuery.builderQueries
    2. Observe that the request is accepted (200 OK)
  • What failed: Builder-format payloads are accepted without validation. While enforceQuerySecurity injects variables.tenant_id, builder-format queries in SigNoz do not use {{.variable}} substitution syntax. This means the injected tenant_id variable may have no effect on query scoping for builder-format queries, potentially allowing tenant isolation bypass.

  • Code analysis: Examined agents-api/src/utils/signozHelpers.ts which only injects variables. It does not validate the query format or reject builder queries. Builder queries use a different filtering mechanism (filters.items array) that the old enforceSecurityFilters function handled, but the new enforceQuerySecurity function does not.

  • Relevant code:

    agents-api/src/utils/signozHelpers.ts (lines 1–18)

    /**
     * Enforce tenant/project security on ClickHouse SQL queries by injecting
     * server-side variables. SQL queries use {{.tenant_id}} / {{.project_id}}
     * via SigNoz variable substitution, preventing SQL injection.
     *
     * Always overrides client-provided values to prevent spoofing.
     */
    export function enforceQuerySecurity(payload: any, tenantId: string, projectId?: string): any {
      const modifiedPayload = JSON.parse(JSON.stringify(payload));
      if (!modifiedPayload.variables) {
        modifiedPayload.variables = {};
      }
      modifiedPayload.variables.tenant_id = tenantId;
      if (projectId) {
        modifiedPayload.variables.project_id = projectId;
      }
      return modifiedPayload;
    }
  • Why this is likely a bug: The new enforceQuerySecurity function assumes all queries use ClickHouse SQL with {{.tenant_id}} variable substitution. Builder-format queries do not use this mechanism and are not filtered. An attacker could craft builder-format queries that bypass tenant isolation. Defense-in-depth via SigNoz API key scoping provides partial mitigation, but this is a security regression from the previous implementation which handled both formats.

  • Introduced by this PR: Yes – this PR removed the builder-format filter logic from enforceSecurityFilters and replaced it with enforceQuerySecurity which only handles ClickHouse SQL queries

  • Timestamp: 25:27

📋 View Recording

Screen Recording

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(1) Total Issues | Risk: Low

Re-review scoped to delta since 5e5c134c9d80 — Prior critical issues have been addressed.

✅ Addressed from Prior Review

The author has successfully addressed all critical and major issues from the prior review:

Issue Fix
🔴 Tenant isolation bypass (missing {{.tenant_id}} validation) enforceQuerySecurity now rejects queries missing {{.tenant_id}} placeholder
🔴 Conversation detail SQL missing tenant filter ✅ Added WHERE attributes_string['tenant.id'] = {{.tenant_id}}
🔴 Span detail SQL missing tenant filter ✅ Added tenant filter to span query
🔴 toISOString() missing on timeline dateTime ✅ Fixed — new Date().toISOString()
🟠 No tests for enforceQuerySecurity ✅ 12 comprehensive tests added covering anti-spoofing, validation, edge cases
🟡 replace vs replaceAll for placeholders ✅ Changed to replaceAll

💭 Consider (2) 💭

💭 1) signozHelpers.ts String inclusion check for tenant filter is syntactically bypassable

Issue: The validation checks if the query string contains {{.tenant_id}} using a simple query.includes(). A technically malicious query could include the literal string in a comment or string literal (e.g., SELECT * FROM t WHERE 1=1 -- {{.tenant_id}}) and pass validation without actually filtering.

Why: Currently low risk since queries are server-constructed, not client-provided. But as a defense-in-depth measure, the validation could be strengthened.

Fix: Consider a regex that validates {{.tenant_id}} appears in a WHERE clause context, e.g., /WHERE[^;]*\{\{\.tenant_id\}\}/i. Or document that queries must always be server-constructed.

Refs: signozHelpers.ts:26


💭 2) signozHelpers.test.ts Missing test for non-string query values

Issue: The test suite doesn't cover the behavior when a chQuery entry has a non-string query value (e.g., null, object, number). The current implementation silently accepts these (the typeof query === 'string' check skips non-strings).

Why: Documents expected behavior and prevents regressions if the validation logic changes.

Fix: Add a test documenting that non-string query values are silently accepted:

it('skips validation for non-string query values', () => {
  const payload = {
    compositeQuery: {
      chQueries: {
        A: { query: null },
        B: { query: { nested: 'object' } },
      },
    },
  };
  const result = enforceQuerySecurity(payload, 'tenant-1');
  expect(result.error).toBeUndefined();
});

Refs: signozHelpers.test.ts

🧹 While You're Here (1) 🧹

🧹 1) conversations/[conversationId]/route.ts Direct SigNoz call path accepts untrusted tenant_id from query params

Issue: When shouldCallSigNozDirectly() returns true (no cookies but SIGNOZ_URL and SIGNOZ_API_KEY env vars set), the tenantId is taken directly from the query string (url.searchParams.get('tenantId') || 'default'). This code path bypasses the authenticated API proxy and doesn't validate that the caller is authorized for the requested tenant.

Why: An attacker could call /api/signoz/conversations/{id}?tenantId=victim-tenant without cookies to access another tenant's trace data, assuming the server has direct SigNoz access configured.

Fix: Either:

  1. Remove the direct SigNoz call path entirely (always go through authenticated API)
  2. Require a service-to-service auth token when cookies are missing
  3. Validate tenant access before the direct call

Refs: route.ts:59-61, route.ts:241

🕐 Pending Recommendations (4)

These items were raised in the prior review and remain relevant but are outside the delta scope:


✅ APPROVE

Summary: The critical security issues from the prior review have been addressed effectively. The enforceQuerySecurity function now validates that all ClickHouse queries include the {{.tenant_id}} placeholder and returns clear error messages when they don't. The new test suite provides good coverage of the security-critical validation logic. The remaining items are either minor considerations or pre-existing issues outside the delta scope. Ship it! 🚀

Discarded (3)
Location Issue Reason Discarded
signozHelpers.test.ts Missing test for absent compositeQuery Code handles this correctly via optional chaining; low risk edge case
signozHelpers.test.ts Missing test for empty chQueries object Code handles this correctly; empty object means no queries to validate
signoz-stats.ts Dead constants after migration Outside delta scope; pre-existing
Reviewers (3)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
pr-review-security-iam 2 0 1 1 0 0 0
pr-review-tests 3 0 1 0 0 0 2
pr-review-standards 0 0 0 0 0 0 0
Total 5 0 2 1 0 0 2

Note: Delta review — reviewers scoped to changes since 5e5c134c9d80. Standards reviewer found no issues in the delta code quality.

@github-actions github-actions Bot deleted a comment from claude Bot Mar 2, 2026
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(3) Total Issues | Risk: Low

This is a full scope review after the pull_request:synchronize event. Prior critical issues (tenant isolation, toISOString() fix, missing tests for security function) have been successfully addressed.

✅ Addressed from Prior Reviews

The author has successfully addressed all critical and major issues from prior reviews:

Issue Status
🔴 Tenant isolation bypass (missing {{.tenant_id}} validation) enforceQuerySecurity now rejects queries missing tenant filter
🔴 Conversation detail SQL missing tenant filter ✅ Added WHERE attributes_string['tenant.id'] = {{.tenant_id}}
🔴 toISOString() missing on timeline dateTime ✅ Fixed — new Date().toISOString() at line 228
🟠 No tests for enforceQuerySecurity ✅ 12 comprehensive tests added
🟡 replace vs replaceAll for placeholders ✅ Changed to replaceAll

🟡 Minor (1) 🟡

Inline Comments:

  • 🟡 Minor: conversations/[conversationId]/route.ts:218 Hardcoded LIMIT 10000 without truncation warning

💭 Consider (2) 💭

💭 1) signoz.ts Backend lacks retry logic for SigNoz calls

Issue: The backend signoz.ts proxy makes direct axios calls to SigNoz without retry configuration (lines 90, 218, 243). While the frontend has axios-retry configured, transient failures from the backend will immediately fail.

Why: Given that the PR description mentions fixing 500/502 errors, adding retry logic here would improve resilience against transient SigNoz failures.

Fix: Configure axios-retry on the agents-api axios instance for SigNoz calls:

import axiosRetry from 'axios-retry';
axiosRetry(axios, { retries: 3, retryDelay: axiosRetry.exponentialDelay });

Refs: signoz.ts:90


💭 2) signozHelpers.test.ts Missing test for injectConversationIdFilter

Issue: The injectConversationIdFilter() function (signoz.ts lines 279-292) performs SQL placeholder replacement with conversation IDs but lacks test coverage.

Why: While IDs are server-generated and the escaping logic appears correct, tests would document expected behavior and catch regressions if the replacement logic changes.

Fix: Add unit tests verifying:

  • Normal case: ["conv-1", "conv-2"]'conv-1','conv-2'
  • Edge case: IDs with single quotes are properly escaped
  • Edge case: Empty array behavior

Refs: signoz.ts:279-292

🧹 While You're Here (1) 🧹

🧹 1) conversations/[conversationId]/route.ts Direct SigNoz call path accepts untrusted tenant_id

Issue: When shouldCallSigNozDirectly() returns true (no cookies but SIGNOZ_URL and SIGNOZ_API_KEY env vars set), the tenantId is sourced from the URL query parameter (line 243). This code path bypasses the authenticated API proxy.

Why: This is a pre-existing pattern that was not introduced by this PR, but it's worth noting: an attacker could potentially call /api/signoz/conversations/{id}?tenantId=victim-tenant without cookies to query another tenant's traces, assuming the server has direct SigNoz access configured.

Fix: Either:

  1. Remove the direct SigNoz call path entirely (always go through authenticated API)
  2. Require a service-to-service auth token when cookies are missing
  3. Document this as an intentional internal-only code path

Refs: route.ts:59-61, route.ts:243

🕐 Pending Recommendations (4)

These items were raised in prior reviews and remain relevant:


💡 APPROVE WITH SUGGESTIONS

Summary: This PR successfully migrates the SigNoz trace API from builder-query format to direct ClickHouse SQL, achieving a net reduction of ~2,500 lines while maintaining robust security controls. The critical tenant isolation issues from prior reviews have been properly addressed with enforceQuerySecurity() now validating tenant filters and the comprehensive test suite. The remaining items are either minor improvements, pre-existing patterns worth addressing separately, or pending recommendations from prior reviews. Ship it! 🚀

Discarded (8)
Location Issue Reason Discarded
signoz-stats.ts No test coverage for 1857-line file Pre-existing code; tests would be valuable but out of PR scope
conversations/route.ts No test coverage for 1180-line route Significant but out of scope for this migration PR
signoz-stats.ts Silent error swallowing Already in Pending Recommendations from prior review
signoz-stats.ts 429 rate limit handling Already in Pending Recommendations from prior review
signoz-stats.ts:1203 100,000 row limit Already in Pending Recommendations from prior review
signoz-stats.ts:105 Duplicated esc() function Already in Pending Recommendations from prior review
conversations/route.ts Cross-tenant data exposure via direct SigNoz Pre-existing pattern, moved to While You're Here
spans/route.ts Cross-tenant data exposure via direct SigNoz Same pre-existing pattern as above
Reviewers (6)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
pr-review-security-iam 2 0 0 1 0 0 1
pr-review-sre 5 0 1 0 1 3 0
pr-review-tests 5 0 1 0 0 0 4
pr-review-errors 1 0 0 0 0 1 0
pr-review-consistency 6 0 0 0 0 0 6
pr-review-standards 0 0 0 0 0 0 0
Total 19 0 2 1 1 4 11

Note: Consistency reviewer returned 6 INFO-level findings confirming patterns are correct (no issues). Standards reviewer found no issues.

AND timestamp BETWEEN {{.start_datetime}} AND {{.end_datetime}}
AND ts_bucket_start BETWEEN {{.start_timestamp}} - 1800 AND {{.end_timestamp}}
ORDER BY timestamp ASC
LIMIT 10000
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor: Hardcoded LIMIT without truncation warning

Issue: The query uses LIMIT 10000 but there's no indication to callers when results are truncated.

Why: Conversations with heavy tool usage could exceed 10,000 spans. When truncated, the timeline silently shows incomplete data with no warning.

Fix: Consider logging a warning when the limit is reached:

// After line 259
if (allSpans.length === 10000) {
  logger.warn({ conversationId }, 'Conversation spans truncated at 10000 limit');
}

Refs:

@github-actions github-actions Bot deleted a comment from claude Bot Mar 2, 2026
@itoqa
Copy link
Copy Markdown

itoqa Bot commented Mar 2, 2026

Ito Test Report ❌

28 test cases ran. 26 passed, 2 failed.

This test run verified the SigNoz Trace API migration from builder-based queries to ClickHouse SQL queries (PR #2465). The majority of tests passed successfully, confirming that the new ClickHouse SQL query format works correctly for stats dashboards, traces overview, conversation detail timeline, batch query mode, search/filter functionality, and security enforcement (tenant isolation, anti-spoofing). Two edge case tests revealed error handling issues in the span detail endpoint and SigNoz proxy that return incorrect HTTP status codes.

✅ Passed (26)
Test Case Summary Timestamp Screenshot
ROUTE-1 Stats dashboard loads correctly with ClickHouse SQL queries, all API requests use chQueries format 85:47 ROUTE-1_85-47.png
ROUTE-2 Traces overview page loads with filter controls, search, stat cards, and time range filter works correctly 89:53 ROUTE-2_89-53.png
ROUTE-3 Conversation detail timeline renders with diverse activity types including tool calls, AI generations, user messages 95:16 ROUTE-3_95-16.png
ROUTE-4 Span detail endpoint includes tenant_id filter in SQL query (verified in source code at route.ts:73) 96:07 ROUTE-4_96-07.png
ROUTE-5 AI Calls Breakdown page loads with stat cards and breakdown tables, time range filter works 84:48 ROUTE-5_84-48.png
ROUTE-6 Tool Calls Breakdown page loads with stat cards and MCP Calls by Project table, time range filter works 85:22 ROUTE-6_85-22.png
ROUTE-7 SigNoz health API returns proper structured response with status and configured fields 83:03 ROUTE-7_83-03.png
ROUTE-8 Batch query endpoint uses paginationPayload and detailPayloadTemplate with chQueries format 91:11 ROUTE-8_91-11.png
ROUTE-10 Copy Trace dropdown provides summarized and full trace options, both work without errors 97:57 ROUTE-10_97-57.png
EDGE-3 Missing conversationId param returns 400 with expected error message 100:56 EDGE-3_100-56.png
EDGE-5 Legacy builderQueries payload rejected with 400 - Zod schema validation failed as expected 100:57 EDGE-5_100-57.png
EDGE-6 Default tenantId fallback works correctly when not provided in query params 100:58 EDGE-6_100-58.png
EDGE-7 Conversation with all activity types renders correctly with proper icons and styling 104:11 EDGE-7_104-11.png
EDGE-8 Search filter and span filters work correctly with chQueries format 91:31 EDGE-8_91-31.png
EDGE-9 Large 182-day time range query completes without timeout errors 93:39 EDGE-9_93-39.png
EDGE-10 getBool helper correctly handles truthy/falsy values, error activities show ERROR status 104:27 EDGE-10_104-27.png
ADV-1 Query without tenant_id rejected with 400 - tenant isolation enforcement working 105:48 ADV-1_105-48.png
ADV-2 Spoofed tenant_id in variables is overwritten by server - anti-spoofing working 107:27 ADV-2_107-27.png
ADV-4 SQL injection prevention confirmed - pagination works correctly without SQL errors 107:25 ADV-4_107-25.png
ADV-5 Payload with builderQueries and chQueries - builderQueries ignored, chQueries validated correctly 107:29 ADV-5_107-29.png
ADV-6 Span detail endpoint always includes tenant filter in SQL query - verified in source code 104:46 ADV-6_104-46.png
MANUAL-1 All 11 enforceQuerySecurity unit tests passed 108:53 MANUAL-1_108-53.png
MANUAL-2 pnpm typecheck passed - 14 tasks successful, 0 failures 109:36 MANUAL-2_109-36.png
MANUAL-3 Conversation API response format verified with activities array and metadata fields 105:12 MANUAL-3_105-12.png
MANUAL-4 tenant_id in comment passed string check validation - known defense-in-depth limitation documented 107:30 MANUAL-4_107-30.png
MANUAL-5 Performance metrics verified - Stats page API calls average 62ms, batch mode 37ms 111:27 MANUAL-5_111-27.png
❌ Failed (2)
Test Case Summary Timestamp Screenshot
EDGE-2 Non-existent span returns 500 instead of expected 404 101:30 EDGE-2_101-30.png
EDGE-4 SigNoz unavailable returns 500 instead of expected 503 102:12 EDGE-4_102-12.png
Non-existent span returns 404 – Failed
  • Where: Span detail API endpoint /api/signoz/spans/[spanId]

  • Steps to reproduce:

    1. Call GET /api/signoz/spans/nonexistent-span-id?conversationId=test-conv&tenantId=default
    2. Observe the HTTP response status code
  • What failed: Expected 404 "Span not found" but received 500 Internal Server Error. When SigNoz is unreachable or returns an error, the endpoint catches all errors and returns 500 without distinguishing between "span not found" and "service unavailable" scenarios.

  • Code analysis: The span detail endpoint at agents-manage-ui/src/app/api/signoz/spans/[spanId]/route.ts has proper 404 handling when the query succeeds but returns no results (lines 114-121). However, the catch block at lines 143-145 returns 500 for ALL errors without distinguishing connection errors (which should be 503) from other failures.

  • Relevant code:

    agents-manage-ui/src/app/api/signoz/spans/[spanId]/route.ts (lines 113-146)

      const series = result?.series;
    
      if (!series || series.length === 0) {
        return NextResponse.json({ error: 'Span not found' }, { status: 404 });
      }
      // ... span parsing code ...
    
    } catch (error) {
      const errorMessage = error instanceof Error ? error.message : 'Failed to fetch span details';
      return NextResponse.json({ error: errorMessage }, { status: 500 });
    }
  • Why this is likely a bug: The catch block does not differentiate between connection errors (ECONNREFUSED should return 503), authentication errors (should return 502), and actual failures. The conversation endpoint at conversations/[conversationId]/route.ts:1151-1177 correctly handles these cases with proper status codes, but the spans endpoint does not follow the same pattern.

  • Introduced by this PR: No – pre-existing bug (the spans endpoint error handling was not modified in this PR, only the tenant_id filter was added at line 73)

  • Timestamp: 101:30

SigNoz unavailable returns 503 not 500 – Failed
  • Where: SigNoz proxy endpoint /api/signoz (POST)

  • Steps to reproduce:

    1. Ensure SigNoz is not running (port 3080 unreachable)
    2. Navigate to /default/stats and observe network requests
    3. Check the HTTP status codes returned by POST /api/signoz requests
  • What failed: Expected 503 Service Unavailable when SigNoz is unreachable, but received 500 Internal Server Error. The error handling does not distinguish between a downstream service being unreachable and an internal application error.

  • Code analysis: The agents-api correctly returns 503 for ECONNREFUSED errors (see agents-api/src/domains/manage/routes/signoz.ts:103-111). However, when SigNoz returns 401/403 authentication errors, the agents-api returns 500 (lines 113-122), and the manage-ui proxy at agents-manage-ui/src/app/api/signoz/route.ts:68-86 passes through the 500 status without mapping it to the appropriate service unavailability code.

  • Relevant code:

    agents-api/src/domains/manage/routes/signoz.ts (lines 101-143)

    } catch (error) {
      if (axios.isAxiosError(error)) {
        if (error.code === 'ECONNREFUSED' || error.code === 'ENOTFOUND') {
          logger.error({ error: error.message }, 'SigNoz service unavailable');
          return c.json(
            { error: 'Service Unavailable', message: 'SigNoz service is unavailable' },
            503
          );
        }
        if (error.response?.status === 401 || error.response?.status === 403) {
          logger.error({ status: error.response.status }, 'SigNoz authentication failed');
          return c.json(
            { error: 'Internal Server Error', message: 'SigNoz authentication failed' },
            500  // Should be 502 Bad Gateway for upstream auth failure
          );
        }
      }
      // Falls through to 500 for all other errors

    agents-manage-ui/src/app/api/signoz/route.ts (lines 68-86)

    function handleProxyError(error: unknown, logger: ReturnType<typeof getLogger>) {
      // ...
      if (axios.isAxiosError(error)) {
        const status = error.response?.status || 500;
        const message = error.response?.data?.message || error.message;
        return NextResponse.json({ error: 'Failed to query SigNoz', details: message }, { status });
      }
      // Does not distinguish service unavailability from internal errors
  • Why this is likely a bug: When SigNoz authentication fails (401/403), the code returns 500 "Internal Server Error" which is semantically incorrect. An authentication failure with an upstream service should return 502 Bad Gateway. Additionally, the manage-ui proxy doesn't handle ECONNREFUSED and other connection errors to map them to 503, relying on the agents-api to do so. The inconsistent status codes make it harder for clients to understand whether the error is recoverable.

  • Introduced by this PR: No – pre-existing bug (the error handling logic in signoz.ts and route.ts was not changed in this PR)

  • Timestamp: 102:12

📋 View Recording

Screen Recording

@shagun-singh-inkeep shagun-singh-inkeep marked this pull request as draft March 2, 2026 21:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant