log 500 signoz errors and reduce timerange query#2469
log 500 signoz errors and reduce timerange query#2469shagun-singh-inkeep merged 5 commits intomainfrom
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Narrows SigNoz trace queries to the conversation's actual time window (with 3-hour padding) instead of the previous 180-day lookback, reducing query load and eliminating 500 errors from overly broad time ranges.
|
There was a problem hiding this comment.
The backend changes look solid — the new /bounds endpoint follows existing authorization, schema, and error-handling patterns correctly, and the SigNoz logging improvements are well-guarded. The frontend utility has two potential bugs worth addressing around timestamp parsing and NaN propagation.
| const bounds = await makeManagementApiRequest<{ | ||
| data: { createdAt: string; updatedAt: string }; | ||
| }>(`tenants/${tenantId}/projects/${projectId}/conversations/${conversationId}/bounds`); | ||
| const createdMs = new Date(`${bounds.data.createdAt}Z`).getTime(); |
There was a problem hiding this comment.
Appending Z unconditionally is fragile. The runtime schema uses timestamp without withTimezone, so Drizzle returns strings like 2024-01-01 00:00:00 — appending Z works today. But if the column is ever changed to timestamptz, or if the driver behavior changes, the API could return 2024-01-01T00:00:00+00:00 and appending Z would produce an invalid string.
Safer approach:
const ensureUtc = (s: string) => {
if (/Z$|[+-]\d{2}:\d{2}$/.test(s)) return s;
return `${s}Z`;
};
const createdMs = new Date(ensureUtc(bounds.data.createdAt)).getTime();
const updatedMs = new Date(ensureUtc(bounds.data.updatedAt)).getTime();| if (startParam != null && endParam != null) { | ||
| return { start: Number(startParam), end: Number(endParam) }; |
There was a problem hiding this comment.
Number(startParam) on a non-numeric string (e.g. ?start=abc) returns NaN, which would propagate into the SigNoz query payload silently. Consider validating:
const start = Number(startParam);
const end = Number(endParam);
if (Number.isNaN(start) || Number.isNaN(end)) {
// fall through to bounds fetch
} else {
return { start, end };
}| } catch { | ||
| return { start: now - DEFAULT_LOOKBACK_MS, end: now }; |
There was a problem hiding this comment.
The silent catch swallows all errors — including auth failures or network timeouts — and falls back to a 180-day window. Consider logging a warning so debugging is possible:
} catch (err) {
console.warn('Failed to fetch conversation bounds, using default lookback', err);
return { start: now - DEFAULT_LOOKBACK_MS, end: now };
}| const { start, end } = await getConversationTimeRange({ | ||
| startParam: null, | ||
| endParam: null, | ||
| projectId, | ||
| tenantId, | ||
| conversationId, | ||
| }); |
There was a problem hiding this comment.
Hardcoding startParam: null, endParam: null means every span fetch now makes an extra round-trip to the bounds API (or falls back to 180-day lookback if projectId is missing). The old code used now - DEFAULT_LOOKBACK_MS directly with no network call. If the goal is to narrow the ClickHouse scan window for span lookups too, this trade-off is worth a brief comment explaining the intent.
🦋 Changeset detectedLatest commit: debf745 The changes in this PR will be included in the next version bump. This PR includes changesets to release 9 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
There was a problem hiding this comment.
PR Review Summary
(3) Total Issues | Risk: Low
🟠⚠️ Major (1) 🟠⚠️
🟠 1) signoz-conversation-time-range.ts Missing test coverage for new utility function
Issue: The getConversationTimeRange function has no unit tests despite containing date/time calculation logic with multiple code paths: (1) explicit start/end params, (2) no projectId fallback, (3) successful bounds fetch with padding calculations, and (4) API error fallback.
Why: Without tests, regressions in time range calculation could cause SigNoz queries to search incorrect time windows. The date parsing logic (appending Z to timestamps) and the Math.min/Math.max clamping could produce unexpected results with edge cases (NaN from invalid dates, inverted ranges if timestamps are in the future).
Fix: Add unit tests in agents-manage-ui/src/lib/api/__tests__/signoz-conversation-time-range.test.ts:
import { describe, it, expect, vi, beforeEach } from 'vitest';
import { getConversationTimeRange, BOUNDS_PADDING_MS, DEFAULT_LOOKBACK_MS } from '../signoz-conversation-time-range';
vi.mock('../api-config', () => ({ makeManagementApiRequest: vi.fn() }));
describe('getConversationTimeRange', () => {
it('should return explicit params when both provided', async () => {
const result = await getConversationTimeRange({
startParam: '1000', endParam: '2000',
projectId: 'proj', tenantId: 'tenant', conversationId: 'conv'
});
expect(result).toEqual({ start: 1000, end: 2000 });
});
it('should fallback to default when projectId missing', async () => {
const now = Date.now();
const result = await getConversationTimeRange({
startParam: null, endParam: null,
projectId: undefined, tenantId: 'tenant', conversationId: 'conv'
});
expect(result.end).toBeCloseTo(now, -2);
expect(result.start).toBeCloseTo(now - DEFAULT_LOOKBACK_MS, -2);
});
it('should fallback on API error', async () => {
const { makeManagementApiRequest } = await import('../api-config');
vi.mocked(makeManagementApiRequest).mockRejectedValue(new Error('Network error'));
// ... verify fallback behavior
});
});Refs:
Inline Comments:
- 🟠 Major:
signoz-conversation-time-range.ts:33-35Silent error catch without logging
🟡 Minor (1) 🟡
Inline Comments:
- 🟡 Minor:
signoz-conversation-time-range.ts:28-29Timestamp parsing assumes format without timezone suffix
💭 Consider (1) 💭
Inline Comments:
- 💭 Consider:
signoz-conversation-time-range.ts:16-17Validate parsed numbers for NaN
💡 APPROVE WITH SUGGESTIONS
Summary: This PR is a solid improvement - the conversation bounds optimization will meaningfully reduce SigNoz query times, and the enhanced error logging will help debug 500 errors. The main gap is the lack of test coverage for the new getConversationTimeRange utility, which has multiple code paths and date parsing logic that would benefit from unit tests. The silent error catch should also be updated to log when the fallback is used for observability.
Discarded (5)
| Location | Issue | Reason Discarded |
|---|---|---|
nine-turtles-lay.md:6 |
Changeset message doesn't follow "Fix X" style | Nitpick - message is descriptive enough, not blocking |
signoz.ts:131-137 |
Error logging could include tenantId/projectId | Already an improvement over existing logging, additional context is nice-to-have |
api-config.ts:73 |
Global API requests lack timeout | Pre-existing issue, outside PR scope |
conversations.ts:98-140 |
Endpoint lacks route-level tests | Endpoint is simple and delegates to tested getConversation; acceptable for now |
spans/[spanId]/route.ts:147 |
Error handling loses context from bounds failure | The bounds function catches internally and falls back, so this is moot |
Reviewers (7)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
pr-review-standards |
2 | 0 | 0 | 0 | 1 | 0 | 1 |
pr-review-consistency |
5 | 0 | 0 | 0 | 0 | 0 | 5 |
pr-review-errors |
2 | 0 | 0 | 0 | 1 | 0 | 1 |
pr-review-frontend |
1 | 0 | 0 | 0 | 0 | 0 | 1 |
pr-review-sre |
5 | 0 | 0 | 0 | 0 | 0 | 5 |
pr-review-tests |
2 | 1 | 0 | 0 | 0 | 0 | 1 |
pr-review-types |
3 | 0 | 1 | 0 | 0 | 0 | 2 |
| Total | 20 | 1 | 1 | 0 | 2 | 0 | 16 |
Note: High discard rate due to significant deduplication across reviewers (timestamp parsing, silent catch, NaN validation all raised by 3+ reviewers).
| } catch { | ||
| return { start: now - DEFAULT_LOOKBACK_MS, end: now }; | ||
| } |
There was a problem hiding this comment.
🟠 MAJOR: Silent error catch swallows all failures without logging
Issue: The catch block silently returns the fallback time range without any logging or visibility into what failed.
Why: When the bounds API call fails (network issues, auth errors, service down), users will silently get the slower 180-day time range query. This makes it impossible to detect systemic issues - if the bounds endpoint is consistently failing, all trace queries degrade to the expensive path with zero visibility.
Fix: Add logging to track when fallback is used:
} catch (error) {
console.warn('[getConversationTimeRange] Failed to fetch bounds, using fallback', {
conversationId,
projectId,
tenantId,
error: error instanceof Error ? error.message : String(error),
});
return { start: now - DEFAULT_LOOKBACK_MS, end: now };
}Refs:
| const createdMs = new Date(`${bounds.data.createdAt}Z`).getTime(); | ||
| const updatedMs = new Date(`${bounds.data.updatedAt}Z`).getTime(); |
There was a problem hiding this comment.
🟡 Minor: Timestamp parsing assumes format without timezone suffix
Issue: Appending Z to timestamps assumes the API returns timestamps without timezone info. If the backend ever returns timestamps with timezone (e.g., 2024-01-01T00:00:00.000Z), this produces invalid dates like ...ZZ.
Why: While the current Drizzle schema uses mode: 'string' which returns timestamps without Z, this is fragile if the format changes. Invalid dates produce NaN which propagates silently.
Fix: Add validation that parsed dates are valid:
| const createdMs = new Date(`${bounds.data.createdAt}Z`).getTime(); | |
| const updatedMs = new Date(`${bounds.data.updatedAt}Z`).getTime(); | |
| const createdMs = new Date(`${bounds.data.createdAt}Z`).getTime(); | |
| const updatedMs = new Date(`${bounds.data.updatedAt}Z`).getTime(); | |
| if (!Number.isFinite(createdMs) || !Number.isFinite(updatedMs)) { | |
| return { start: now - DEFAULT_LOOKBACK_MS, end: now }; | |
| } |
Refs:
| if (startParam != null && endParam != null) { | ||
| return { start: Number(startParam), end: Number(endParam) }; |
There was a problem hiding this comment.
💭 Consider: Validate parsed numbers for NaN
Issue: Number() conversion from query params doesn't validate for NaN. If params are malformed strings like ?start=invalid, this returns { start: NaN, end: NaN }.
Why: NaN values would propagate to SigNoz queries causing unclear failures. The current code assumes callers provide valid numeric strings.
Fix: Consider validating before return:
| if (startParam != null && endParam != null) { | |
| return { start: Number(startParam), end: Number(endParam) }; | |
| if (startParam != null && endParam != null) { | |
| const start = Number(startParam); | |
| const end = Number(endParam); | |
| if (Number.isFinite(start) && Number.isFinite(end)) { | |
| return { start, end }; | |
| } | |
| } |
There was a problem hiding this comment.
PR Review Summary
(3) Total Issues | Risk: Low
🟠⚠️ Major (1) 🟠⚠️
Inline Comments:
- 🟠 Major:
signoz-conversation-time-range.ts:34External API call without explicit timeout
🟡 Minor (0) 🟡
No new minor issues.
💭 Consider (1) 💭
💭 1) signoz-conversation-time-range.ts:44-52 Fallback metric for operational visibility
Issue: The fallback to 180-day lookback is now properly logged with structured context, which is good. However, if the bounds API starts failing frequently, this would silently degrade query performance without triggering alerts.
Why: Logging alone may not be easily alertable in high-volume environments.
Fix: If operational alerting on fallback frequency is desired, consider emitting a counter metric (e.g., signoz.conversation_bounds.fallback) alongside the log. Otherwise, the current logging is acceptable for debugging.
Refs:
🕐 Pending Recommendations (2)
- 🟠
signoz-conversation-time-range.tsMissing unit tests forgetConversationTimeRange(prior review finding, still unaddressed) - 🟡
signoz-conversation-time-range.ts:35-36Timestamp parsing assumes format without timezone suffix (prior Pullfrog finding — NaN validation mitigates but doesn't fully resolve)
💡 APPROVE WITH SUGGESTIONS
Summary: Good improvement! This PR successfully addresses the 500 errors and query optimization goal. The delta since the last review shows meaningful improvements: NaN validation was added for timestamp parsing, and the catch block now logs with proper context instead of being silent. The main remaining gap is the lack of a timeout on the bounds API call (Major) and the still-unaddressed test coverage from the prior review. The API endpoint and error logging enhancements are solid.
Discarded (4)
| Location | Issue | Reason Discarded |
|---|---|---|
spans/[spanId]/route.ts:14 |
Duplicate DEFAULT_LOOKBACK_MS constant |
Pre-existing issue — this file was NOT changed in the PR |
signoz.ts:131-137 |
Error logging could include more context | Already an improvement over existing logging; additional context is nice-to-have |
conversations.ts:98-140 |
New endpoint lacks route-level tests | Endpoint is simple, delegates to tested getConversation, acceptable for now |
nine-turtles-lay.md:6 |
Changeset message style | Nitpick — message is descriptive enough |
Reviewers (6)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
pr-review-standards |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
pr-review-consistency |
1 | 0 | 0 | 0 | 0 | 0 | 1 |
pr-review-errors |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
pr-review-tests |
1 | 0 | 0 | 0 | 0 | 1 | 0 |
pr-review-types |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
pr-review-sre |
2 | 0 | 1 | 0 | 1 | 0 | 0 |
| Total | 4 | 0 | 1 | 0 | 1 | 1 | 1 |
Note: Most reviewers found no new issues because the delta addressed prior review feedback (NaN validation, logging). The test coverage finding was already raised in the prior review.
| try { | ||
| const bounds = await makeManagementApiRequest<{ | ||
| data: { createdAt: string; updatedAt: string }; | ||
| }>(`tenants/${tenantId}/projects/${projectId}/conversations/${conversationId}/bounds`); |
There was a problem hiding this comment.
🟠 MAJOR: External API call without explicit timeout
Issue: The makeManagementApiRequest call to fetch conversation bounds does not have a timeout configured. If the management API is slow or unresponsive, this will hang indefinitely, adding significant latency to trace timeline loads.
Why: While the catch block handles total failures, a slow (but not failed) request will still block the calling code. This could cause the trace timeline to appear frozen while waiting for bounds.
Fix: Consider wrapping with a timeout using Promise.race:
const BOUNDS_TIMEOUT_MS = 3000; // 3 seconds
const boundsPromise = makeManagementApiRequest<{
data: { createdAt: string; updatedAt: string };
}>(`tenants/${tenantId}/projects/${projectId}/conversations/${conversationId}/bounds`);
const timeoutPromise = new Promise<never>((_, reject) =>
setTimeout(() => reject(new Error('Bounds fetch timeout')), BOUNDS_TIMEOUT_MS)
);
const bounds = await Promise.race([boundsPromise, timeoutPromise]);Refs:
- api-logging-guidelines skill — recommends logging slow operations
There was a problem hiding this comment.
PR Review Summary
(0) New Issues | Risk: Low
Delta Review Summary
This re-review is scoped to changes since the last automated review. The delta consists of 1 file with 4 lines changed:
timeline-wrapper.tsx — Removed projectId from span fetch URL and useEffect dependency array.
Verification:
- ✅ The
/api/signoz/spans/[spanId]route does NOT useprojectId(onlyconversationIdandtenantId) — removal is correct - ✅ URL encoding via
encodeURIComponentis properly applied to both dynamic values - ✅ Dependency array correctly reflects actual dependencies per React hooks exhaustive-deps rule
- ✅ No stale closure issues — if
projectIdchanges, the effect doesn't need to re-run since the API doesn't use it
No new issues introduced by this delta.
🕐 Pending Recommendations (3)
These items were raised in prior reviews and remain unaddressed:
- 🟠
signoz-conversation-time-range.ts:34External API call without explicit timeout - 🟠
signoz-conversation-time-range.tsMissing unit tests forgetConversationTimeRange - 🟡
signoz-conversation-time-range.ts:35-36Timestamp parsing assumes format without timezone suffix
💡 APPROVE WITH SUGGESTIONS
Summary: The delta change is clean and correct — it removes an unused projectId parameter from the span fetch call, aligning the client with what the server actually expects. No new issues introduced. The pending recommendations from prior reviews (timeout, tests, timestamp parsing) remain open but don't block this PR. Ship it! 🚀
Reviewers (2)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
pr-review-standards |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
pr-review-frontend |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Total | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Note: Both reviewers confirmed the delta is correct with no issues to report.
No description provided.