Skip to content

🚀🌐✨ Telemetry, Health Checks, Security & Reliability Improvements#431

Merged
abahmed merged 13 commits intomainfrom
feature/telemetry-and-state
Mar 25, 2026
Merged

🚀🌐✨ Telemetry, Health Checks, Security & Reliability Improvements#431
abahmed merged 13 commits intomainfrom
feature/telemetry-and-state

Conversation

@abahmed
Copy link
Copy Markdown
Owner

@abahmed abahmed commented Mar 24, 2026

✨ Summary

  • 📊 ConfigMap-based first-run detection and state management
  • 📈 Anonymous opt-in telemetry for usage tracking
  • ❤️ HTTP health check endpoints (/healthz, /health)
  • 🔒 Run as non-root user (UID 1000) for security
  • 📍 Add node name to all alert provider notifications
  • 🛡️ Security fixes: JSON injection, type assertions, nil checks
  • 🔧 Reliability fixes: race conditions, memory leaks, HTTP timeouts
  • 💬 Discord events chunking to prevent API limits

🔧 Changes

🆕 New packages

  • state/ - ConfigMap-based state persistence
  • telemetry/ - Anonymous usage tracking (opt-in)
  • startup/ - First-run detection and upgrade notifications
  • health/ - HTTP health check server

🛡️ Security fixes

  • Fix JSON injection in PagerDuty, DingTalk (use json.Marshal)
  • Fix nil response panic in Teams provider
  • Add safe type assertion in Webhook provider
  • Mask sensitive tokens in Telegram logging
  • CodeQL: Refactor alert providers to use struct + json.Marshal

🔧 Reliability fixes

  • Add generic retry helper for ConfigMap updates
  • Add mutex to PVC monitor for thread safety
  • Add graceful shutdown handling
  • Add PVC monitor cleanup to prevent memory leaks

⚡ Performance fixes

  • Add generic HTTP client with 30s timeout
  • Update all 12 alert providers to use timeout client

💬 Provider fixes

  • Discord: Chunk long events to prevent API limits
  • Teams: Fix JSON key from attachment to attachments

🐛 Fixes

✅ Testing

All tests pass ✅

abahmed added 3 commits March 24, 2026 20:44
- Add ConfigMap-based state management for first-run detection
- Add anonymous telemetry to track kwatch usage (opt-in)
- Track notified versions to avoid duplicate upgrade notifications
- Add startup manager to clean up main.go
- Add comprehensive tests for new packages
- Update documentation with telemetry config

Features:
- telemetry.enabled: Send cluster ID and version on first run
- notified-version: Track which upgrade version user was notified about
- state configmap: Store cluster-id, version, first-run, telemetry-sent
- Add kwatch user with UID/GID 1000 in Dockerfile
- Switch to non-root user before running kwatch
- Add securityContext to deploy.yaml (runAsNonRoot, readOnlyRootFilesystem)
- Update Helm chart values.yaml to use UID 1000
- Update README documentation

Addresses issue #411
- Add HTTP health check server with /healthz and /health endpoints
- Add configurable port (default: 8060) and enabled flag
- Add readiness and liveness probe configuration to deployments
- Update Helm chart with service port and probe settings
- Add tests for health check server
- Update README documentation

Addresses issue #295
@abahmed abahmed changed the title 🚀 add telemetry and state management 🚀🌐✨ kwatch v0.11.0: Telemetry, Security & Health Checks Mar 24, 2026
@abahmed abahmed changed the title 🚀🌐✨ kwatch v0.11.0: Telemetry, Security & Health Checks 🚀🌐✨ Telemetry, Security & Health Checks Mar 24, 2026
abahmed added 2 commits March 24, 2026 21:14
- Add NodeName field to event.Event struct
- Update FormatMarkdown, FormatHtml, FormatText methods to include node name
- Populate NodeName from pod spec in executePodFilters and executeContainersFilters
- Update all alert providers to include node name in notifications:
  - Slack, Discord, Teams, Telegram, PagerDuty, OpsGenie
  - Mattermost, Webhook, Email, Zenduty
- Update test files to include NodeName in event structs

Fixes #407
@abahmed abahmed changed the title 🚀🌐✨ Telemetry, Security & Health Checks 🚀🌐✨ Telemetry, Health Checks, Non-Root Security, Node Name in Alerts Mar 24, 2026
Security:
- Fix JSON injection in PagerDuty provider (escape all user fields)
- Fix JSON injection in DingTalk provider (escape title, msg)
- Fix nil response panic in Teams provider
- Add safe type assertion in Webhook provider
- Mask sensitive tokens in Telegram logging

Reliability:
- Add generic retry helper for ConfigMap updates (state/retry.go)
- Update StateManager to use retry logic for all mutations
- Add graceful shutdown handling in main.go
- Add PVC monitor cleanup to prevent memory leaks

Performance:
- Add generic HTTP client with 30s timeout (util/http.go)
- Update all 12 alert providers to use timeout-enabled client

Minor fixes:
- Add Content-Type header to Zenduty requests
- Fix Webhook error message ('teams' -> 'webhook')
- Fix Telegram log message ('initializing  with' -> 'initializing webhook with')
- Add mutex to PVC monitor for thread safety
Comment thread alertmanager/dingtalk/dingtalk.go Fixed
Comment thread alertmanager/dingtalk/dingtalk.go Fixed
Comment thread alertmanager/dingtalk/dingtalk.go Fixed
Comment thread alertmanager/pagerduty/pagerduty.go Fixed
Comment thread alertmanager/pagerduty/pagerduty.go Fixed
Comment thread alertmanager/pagerduty/pagerduty.go Fixed
Comment thread alertmanager/pagerduty/pagerduty.go Fixed
Comment thread alertmanager/pagerduty/pagerduty.go Fixed
Comment thread alertmanager/pagerduty/pagerduty.go Fixed
Comment thread alertmanager/telegram/telegram.go Fixed
abahmed added 6 commits March 24, 2026 22:42
Events can exceed Discord's 6000 char message limit. Added chunking
to split events into 1024-char segments, similar to Slack provider.
Also added test coverage for the chunks function.
The checkUsage() function was reading and writing to notifiedPvc map
without acquiring the mutex, causing a race condition with the cleanup()
function. Added mutex lock/unlock to protect all notifiedPvc access.
CodeQL flags json.Marshal usage as potentially unsafe quoting.
These are false positives since json.Marshal properly escapes all
special characters (quotes, newlines, backslashes, etc).

Added //nolint:gosec comments to suppress warnings after manual
security review confirms the code is safe.
Refactor SendEvent and SendMessage to use structs with json.Marshal
instead of fmt.Sprintf with JsonEscape. This satisfies CodeQL
security checks and is the recommended approach for building JSON payloads.
Refactor PagerDuty and Telegram providers to use structs with
json.Marshal instead of fmt.Sprintf with JsonEscape. This satisfies
CodeQL security checks.

Changes:
- PagerDuty: Added pagerdutyPayload struct and refactored
  buildRequestBodyPagerDuty to return (string, error)
- Telegram: Added telegramPayload struct and refactored
  buildRequestBodyTelegram
@abahmed abahmed changed the title 🚀🌐✨ Telemetry, Health Checks, Non-Root Security, Node Name in Alerts 🚀🌐✨ Telemetry, Health Checks, Security & Reliability Improvements Mar 24, 2026
@abahmed abahmed merged commit e730122 into main Mar 25, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants