trace api to clickhouse#2465
Conversation
|
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Replaces the SigNoz Builder Query API with direct ClickHouse SQL to fetch all conversation trace spans in a single query, then filters client-side — eliminating ~1,000 lines of verbose per-span-type builder query definitions.
|
Ito Test Report ❌25 test cases ran. 22 passed, 3 failed. This verification run tested the SigNoz query migration from builder format to raw ClickHouse SQL (PR #2465). Core functionality works well: traces overview pages load correctly, conversation detail pages render, batch queries return paginated data, and tenant/project isolation via variable substitution is enforced. However, three issues were identified: a timestamp format regression in the timeline component, and a missing schema validation that allows the deprecated builder-format queries to bypass the new ClickHouse SQL security model. ✅ Passed (22)❌ Failed (3)ROUTE-6 – Failed
EDGE-3 – Failed
ADV-3 – Failed
📋 View Recording |
There was a problem hiding this comment.
PR Review Summary
(1) Total Issues | Risk: Low
Re-review scoped to delta since
5e5c134c9d80— Prior critical issues have been addressed.
✅ Addressed from Prior Review
The author has successfully addressed all critical and major issues from the prior review:
| Issue | Fix |
|---|---|
🔴 Tenant isolation bypass (missing {{.tenant_id}} validation) |
✅ enforceQuerySecurity now rejects queries missing {{.tenant_id}} placeholder |
| 🔴 Conversation detail SQL missing tenant filter | ✅ Added WHERE attributes_string['tenant.id'] = {{.tenant_id}} |
| 🔴 Span detail SQL missing tenant filter | ✅ Added tenant filter to span query |
🔴 toISOString() missing on timeline dateTime |
✅ Fixed — new Date().toISOString() |
🟠 No tests for enforceQuerySecurity |
✅ 12 comprehensive tests added covering anti-spoofing, validation, edge cases |
🟡 replace vs replaceAll for placeholders |
✅ Changed to replaceAll |
💭 Consider (2) 💭
💭 1) signozHelpers.ts String inclusion check for tenant filter is syntactically bypassable
Issue: The validation checks if the query string contains {{.tenant_id}} using a simple query.includes(). A technically malicious query could include the literal string in a comment or string literal (e.g., SELECT * FROM t WHERE 1=1 -- {{.tenant_id}}) and pass validation without actually filtering.
Why: Currently low risk since queries are server-constructed, not client-provided. But as a defense-in-depth measure, the validation could be strengthened.
Fix: Consider a regex that validates {{.tenant_id}} appears in a WHERE clause context, e.g., /WHERE[^;]*\{\{\.tenant_id\}\}/i. Or document that queries must always be server-constructed.
Refs: signozHelpers.ts:26
💭 2) signozHelpers.test.ts Missing test for non-string query values
Issue: The test suite doesn't cover the behavior when a chQuery entry has a non-string query value (e.g., null, object, number). The current implementation silently accepts these (the typeof query === 'string' check skips non-strings).
Why: Documents expected behavior and prevents regressions if the validation logic changes.
Fix: Add a test documenting that non-string query values are silently accepted:
it('skips validation for non-string query values', () => {
const payload = {
compositeQuery: {
chQueries: {
A: { query: null },
B: { query: { nested: 'object' } },
},
},
};
const result = enforceQuerySecurity(payload, 'tenant-1');
expect(result.error).toBeUndefined();
});Refs: signozHelpers.test.ts
🧹 While You're Here (1) 🧹
🧹 1) conversations/[conversationId]/route.ts Direct SigNoz call path accepts untrusted tenant_id from query params
Issue: When shouldCallSigNozDirectly() returns true (no cookies but SIGNOZ_URL and SIGNOZ_API_KEY env vars set), the tenantId is taken directly from the query string (url.searchParams.get('tenantId') || 'default'). This code path bypasses the authenticated API proxy and doesn't validate that the caller is authorized for the requested tenant.
Why: An attacker could call /api/signoz/conversations/{id}?tenantId=victim-tenant without cookies to access another tenant's trace data, assuming the server has direct SigNoz access configured.
Fix: Either:
- Remove the direct SigNoz call path entirely (always go through authenticated API)
- Require a service-to-service auth token when cookies are missing
- Validate tenant access before the direct call
Refs: route.ts:59-61, route.ts:241
🕐 Pending Recommendations (4)
These items were raised in the prior review and remain relevant but are outside the delta scope:
- 🟠
signoz-stats.tsSilent error swallowing — catch blocks return empty data without surfacing errors to UI - 🟠
signoz-stats.ts:1203Unbounded 100,000 row pagination limit risks memory exhaustion - 🟠
signoz-stats.tsMissing 429 rate limit handling in axios retry config - 🟡
signoz-stats.ts:105Duplicatedesc()SQL escape function
✅ APPROVE
Summary: The critical security issues from the prior review have been addressed effectively. The enforceQuerySecurity function now validates that all ClickHouse queries include the {{.tenant_id}} placeholder and returns clear error messages when they don't. The new test suite provides good coverage of the security-critical validation logic. The remaining items are either minor considerations or pre-existing issues outside the delta scope. Ship it! 🚀
Discarded (3)
| Location | Issue | Reason Discarded |
|---|---|---|
signozHelpers.test.ts |
Missing test for absent compositeQuery | Code handles this correctly via optional chaining; low risk edge case |
signozHelpers.test.ts |
Missing test for empty chQueries object | Code handles this correctly; empty object means no queries to validate |
signoz-stats.ts |
Dead constants after migration | Outside delta scope; pre-existing |
Reviewers (3)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
pr-review-security-iam |
2 | 0 | 1 | 1 | 0 | 0 | 0 |
pr-review-tests |
3 | 0 | 1 | 0 | 0 | 0 | 2 |
pr-review-standards |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Total | 5 | 0 | 2 | 1 | 0 | 0 | 2 |
Note: Delta review — reviewers scoped to changes since 5e5c134c9d80. Standards reviewer found no issues in the delta code quality.
6c90338 to
ee2c3b0
Compare
There was a problem hiding this comment.
PR Review Summary
(3) Total Issues | Risk: Low
This is a full scope review after the
pull_request:synchronizeevent. Prior critical issues (tenant isolation,toISOString()fix, missing tests for security function) have been successfully addressed.
✅ Addressed from Prior Reviews
The author has successfully addressed all critical and major issues from prior reviews:
| Issue | Status |
|---|---|
🔴 Tenant isolation bypass (missing {{.tenant_id}} validation) |
✅ enforceQuerySecurity now rejects queries missing tenant filter |
| 🔴 Conversation detail SQL missing tenant filter | ✅ Added WHERE attributes_string['tenant.id'] = {{.tenant_id}} |
🔴 toISOString() missing on timeline dateTime |
✅ Fixed — new Date().toISOString() at line 228 |
🟠 No tests for enforceQuerySecurity |
✅ 12 comprehensive tests added |
🟡 replace vs replaceAll for placeholders |
✅ Changed to replaceAll |
🟡 Minor (1) 🟡
Inline Comments:
- 🟡 Minor:
conversations/[conversationId]/route.ts:218Hardcoded LIMIT 10000 without truncation warning
💭 Consider (2) 💭
💭 1) signoz.ts Backend lacks retry logic for SigNoz calls
Issue: The backend signoz.ts proxy makes direct axios calls to SigNoz without retry configuration (lines 90, 218, 243). While the frontend has axios-retry configured, transient failures from the backend will immediately fail.
Why: Given that the PR description mentions fixing 500/502 errors, adding retry logic here would improve resilience against transient SigNoz failures.
Fix: Configure axios-retry on the agents-api axios instance for SigNoz calls:
import axiosRetry from 'axios-retry';
axiosRetry(axios, { retries: 3, retryDelay: axiosRetry.exponentialDelay });Refs: signoz.ts:90
💭 2) signozHelpers.test.ts Missing test for injectConversationIdFilter
Issue: The injectConversationIdFilter() function (signoz.ts lines 279-292) performs SQL placeholder replacement with conversation IDs but lacks test coverage.
Why: While IDs are server-generated and the escaping logic appears correct, tests would document expected behavior and catch regressions if the replacement logic changes.
Fix: Add unit tests verifying:
- Normal case:
["conv-1", "conv-2"]→'conv-1','conv-2' - Edge case: IDs with single quotes are properly escaped
- Edge case: Empty array behavior
Refs: signoz.ts:279-292
🧹 While You're Here (1) 🧹
🧹 1) conversations/[conversationId]/route.ts Direct SigNoz call path accepts untrusted tenant_id
Issue: When shouldCallSigNozDirectly() returns true (no cookies but SIGNOZ_URL and SIGNOZ_API_KEY env vars set), the tenantId is sourced from the URL query parameter (line 243). This code path bypasses the authenticated API proxy.
Why: This is a pre-existing pattern that was not introduced by this PR, but it's worth noting: an attacker could potentially call /api/signoz/conversations/{id}?tenantId=victim-tenant without cookies to query another tenant's traces, assuming the server has direct SigNoz access configured.
Fix: Either:
- Remove the direct SigNoz call path entirely (always go through authenticated API)
- Require a service-to-service auth token when cookies are missing
- Document this as an intentional internal-only code path
Refs: route.ts:59-61, route.ts:243
🕐 Pending Recommendations (4)
These items were raised in prior reviews and remain relevant:
- 🟠
signoz-stats.tsSilent error swallowing — catch blocks return empty data without surfacing errors to UI - 🟠
signoz-stats.tsMissing 429 rate limit handling in axios retry config - 🟠
signoz-stats.ts:1203Unbounded 100,000 row pagination limit risks memory exhaustion - 🟡
signoz-stats.ts:105Duplicatedesc()SQL escape function
💡 APPROVE WITH SUGGESTIONS
Summary: This PR successfully migrates the SigNoz trace API from builder-query format to direct ClickHouse SQL, achieving a net reduction of ~2,500 lines while maintaining robust security controls. The critical tenant isolation issues from prior reviews have been properly addressed with enforceQuerySecurity() now validating tenant filters and the comprehensive test suite. The remaining items are either minor improvements, pre-existing patterns worth addressing separately, or pending recommendations from prior reviews. Ship it! 🚀
Discarded (8)
| Location | Issue | Reason Discarded |
|---|---|---|
signoz-stats.ts |
No test coverage for 1857-line file | Pre-existing code; tests would be valuable but out of PR scope |
conversations/route.ts |
No test coverage for 1180-line route | Significant but out of scope for this migration PR |
signoz-stats.ts |
Silent error swallowing | Already in Pending Recommendations from prior review |
signoz-stats.ts |
429 rate limit handling | Already in Pending Recommendations from prior review |
signoz-stats.ts:1203 |
100,000 row limit | Already in Pending Recommendations from prior review |
signoz-stats.ts:105 |
Duplicated esc() function | Already in Pending Recommendations from prior review |
conversations/route.ts |
Cross-tenant data exposure via direct SigNoz | Pre-existing pattern, moved to While You're Here |
spans/route.ts |
Cross-tenant data exposure via direct SigNoz | Same pre-existing pattern as above |
Reviewers (6)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
pr-review-security-iam |
2 | 0 | 0 | 1 | 0 | 0 | 1 |
pr-review-sre |
5 | 0 | 1 | 0 | 1 | 3 | 0 |
pr-review-tests |
5 | 0 | 1 | 0 | 0 | 0 | 4 |
pr-review-errors |
1 | 0 | 0 | 0 | 0 | 1 | 0 |
pr-review-consistency |
6 | 0 | 0 | 0 | 0 | 0 | 6 |
pr-review-standards |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Total | 19 | 0 | 2 | 1 | 1 | 4 | 11 |
Note: Consistency reviewer returned 6 INFO-level findings confirming patterns are correct (no issues). Standards reviewer found no issues.
| AND timestamp BETWEEN {{.start_datetime}} AND {{.end_datetime}} | ||
| AND ts_bucket_start BETWEEN {{.start_timestamp}} - 1800 AND {{.end_timestamp}} | ||
| ORDER BY timestamp ASC | ||
| LIMIT 10000 |
There was a problem hiding this comment.
🟡 Minor: Hardcoded LIMIT without truncation warning
Issue: The query uses LIMIT 10000 but there's no indication to callers when results are truncated.
Why: Conversations with heavy tool usage could exceed 10,000 spans. When truncated, the timeline silently shows incomplete data with no warning.
Fix: Consider logging a warning when the limit is reached:
// After line 259
if (allSpans.length === 10000) {
logger.warn({ conversationId }, 'Conversation spans truncated at 10000 limit');
}Refs:
Ito Test Report ❌28 test cases ran. 26 passed, 2 failed. This test run verified the SigNoz Trace API migration from builder-based queries to ClickHouse SQL queries (PR #2465). The majority of tests passed successfully, confirming that the new ClickHouse SQL query format works correctly for stats dashboards, traces overview, conversation detail timeline, batch query mode, search/filter functionality, and security enforcement (tenant isolation, anti-spoofing). Two edge case tests revealed error handling issues in the span detail endpoint and SigNoz proxy that return incorrect HTTP status codes. ✅ Passed (26)❌ Failed (2)
Non-existent span returns 404 – Failed
SigNoz unavailable returns 503 not 500 – Failed
📋 View Recording |
*Better performance for large datasets