🚀🌐✨ Telemetry, Health Checks, Security & Reliability Improvements#431
Merged
🚀🌐✨ Telemetry, Health Checks, Security & Reliability Improvements#431
Conversation
- Add ConfigMap-based state management for first-run detection - Add anonymous telemetry to track kwatch usage (opt-in) - Track notified versions to avoid duplicate upgrade notifications - Add startup manager to clean up main.go - Add comprehensive tests for new packages - Update documentation with telemetry config Features: - telemetry.enabled: Send cluster ID and version on first run - notified-version: Track which upgrade version user was notified about - state configmap: Store cluster-id, version, first-run, telemetry-sent
- Add kwatch user with UID/GID 1000 in Dockerfile - Switch to non-root user before running kwatch - Add securityContext to deploy.yaml (runAsNonRoot, readOnlyRootFilesystem) - Update Helm chart values.yaml to use UID 1000 - Update README documentation Addresses issue #411
- Add HTTP health check server with /healthz and /health endpoints - Add configurable port (default: 8060) and enabled flag - Add readiness and liveness probe configuration to deployments - Update Helm chart with service port and probe settings - Add tests for health check server - Update README documentation Addresses issue #295
- Add NodeName field to event.Event struct - Update FormatMarkdown, FormatHtml, FormatText methods to include node name - Populate NodeName from pod spec in executePodFilters and executeContainersFilters - Update all alert providers to include node name in notifications: - Slack, Discord, Teams, Telegram, PagerDuty, OpsGenie - Mattermost, Webhook, Email, Zenduty - Update test files to include NodeName in event structs Fixes #407
Security:
- Fix JSON injection in PagerDuty provider (escape all user fields)
- Fix JSON injection in DingTalk provider (escape title, msg)
- Fix nil response panic in Teams provider
- Add safe type assertion in Webhook provider
- Mask sensitive tokens in Telegram logging
Reliability:
- Add generic retry helper for ConfigMap updates (state/retry.go)
- Update StateManager to use retry logic for all mutations
- Add graceful shutdown handling in main.go
- Add PVC monitor cleanup to prevent memory leaks
Performance:
- Add generic HTTP client with 30s timeout (util/http.go)
- Update all 12 alert providers to use timeout-enabled client
Minor fixes:
- Add Content-Type header to Zenduty requests
- Fix Webhook error message ('teams' -> 'webhook')
- Fix Telegram log message ('initializing with' -> 'initializing webhook with')
- Add mutex to PVC monitor for thread safety
Events can exceed Discord's 6000 char message limit. Added chunking to split events into 1024-char segments, similar to Slack provider. Also added test coverage for the chunks function.
The checkUsage() function was reading and writing to notifiedPvc map without acquiring the mutex, causing a race condition with the cleanup() function. Added mutex lock/unlock to protect all notifiedPvc access.
CodeQL flags json.Marshal usage as potentially unsafe quoting. These are false positives since json.Marshal properly escapes all special characters (quotes, newlines, backslashes, etc). Added //nolint:gosec comments to suppress warnings after manual security review confirms the code is safe.
Refactor SendEvent and SendMessage to use structs with json.Marshal instead of fmt.Sprintf with JsonEscape. This satisfies CodeQL security checks and is the recommended approach for building JSON payloads.
Refactor PagerDuty and Telegram providers to use structs with json.Marshal instead of fmt.Sprintf with JsonEscape. This satisfies CodeQL security checks. Changes: - PagerDuty: Added pagerdutyPayload struct and refactored buildRequestBodyPagerDuty to return (string, error) - Telegram: Added telegramPayload struct and refactored buildRequestBodyTelegram
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
✨ Summary
/healthz,/health)🔧 Changes
🆕 New packages
state/- ConfigMap-based state persistencetelemetry/- Anonymous usage tracking (opt-in)startup/- First-run detection and upgrade notificationshealth/- HTTP health check server🛡️ Security fixes
🔧 Reliability fixes
⚡ Performance fixes
💬 Provider fixes
attachmenttoattachments🐛 Fixes
✅ Testing
All tests pass ✅