Skip to content

feat: ShutdownTelemetry, debug endpoint, goroutine-leak tests, and deployment docs#4038

Closed
zishanjawed wants to merge 1 commit intogofiber:mainfrom
zishanjawed:feat/shutdown-telemetry-debug-endpoint
Closed

feat: ShutdownTelemetry, debug endpoint, goroutine-leak tests, and deployment docs#4038
zishanjawed wants to merge 1 commit intogofiber:mainfrom
zishanjawed:feat/shutdown-telemetry-debug-endpoint

Conversation

@zishanjawed
Copy link
Copy Markdown

Summary

  • ShutdownTelemetry struct — captures per-phase wall-clock timing (PreHooksDuration, GracefulCloseDuration, DrainDuration, PostHooksDuration, TotalDuration) and connection-disposition counts (InitialConns, DrainedConns, ForcedConns, WebSocketsClosed, SSEsClosed, TimedOut).
  • app.LastShutdownTelemetry() — lock-free accessor via atomic.Pointer; returns nil before the first ShutdownWithConfig completes, or a fully-populated snapshot afterward.
  • app.ShutdownDebugHandler() — returns a Handler that responds with JSON containing status (running / shutting_down / shutdown), activeConnections, and the last telemetry snapshot with human-readable duration strings. Register at any path: app.Get("/debug/shutdown", app.ShutdownDebugHandler()).
  • GracefulCloseTyped — extended to accept *ShutdownTelemetry and increment WebSocketsClosed / SSEsClosed counters sequentially inside the sync.Map.Range (no additional synchronization needed).

Tests (9 new)

# Test Verifies
1 Test_Telemetry_CleanDrain_PopulatesAllFields Clean shutdown populates all fields; DrainedConns == InitialConns; ForcedConns == 0; TimedOut == false
2 Test_Telemetry_ForcedClose_CountsCorrectly Slow handler + short deadline; ForcedConns >= 1; TimedOut == true
3 Test_Telemetry_HookDurations_Measured Pre-hook sleeps 50 ms; PreHooksDuration >= 50 ms
4 Test_Telemetry_WebSocketSSE_Counters 1 WS + 1 SSE marked; after shutdown WebSocketsClosed == 1, SSEsClosed == 1
5 Test_DebugEndpoint_ReturnsRunningStatus Before shutdown: status == "running", lastShutdown is null
6 Test_DebugEndpoint_ReturnsTelemetryAfterShutdown After shutdown: status == "shutdown", all duration strings parseable by time.ParseDuration
7 Test_Telemetry_NoGoroutineLeak_CleanShutdown Clean shutdown does not leak goroutines
8 Test_Telemetry_NoGoroutineLeak_ForcedClose Force-close shutdown does not leak goroutines
9 Test_Telemetry_NoGoroutineLeak_DebugHandler 10 repeated debug-handler calls do not leak goroutines

Documentation (docs/guide/shutdown.md)

Full reference covering:

  • Quick-start signal-handling example
  • ShutdownConfig field reference with defaults
  • ShutdownTelemetry struct and accessor
  • Debug-endpoint registration and JSON response shapes for all three states
  • Lifecycle phase order table (phases 1–9) mapped to telemetry fields
  • Hook usage: OnPreShutdown, OnPostShutdown, per-connection SetCleanupHook, multi-hook registration order
  • Kubernetes integration: annotated Deployment YAML (terminationGracePeriodSeconds, preStop sleep, liveness/readiness probes), full Go application wiring, alignment rules table
  • Load-balancer integration: AWS ALB/NLB deregistration delay, GCP connection draining, generic architecture diagram
  • Troubleshooting: connections-never-drain, hooks firing zero/twice, telemetry nil, WS/SSE counters stuck at zero, timeout mismatches, debug endpoint stuck at shutting_down
  • Concurrency-safety rationale table

Race-detector note

The race detector flags app.server.IdleTimeout = time.Nanosecond (phase 2 in both ShutdownWithContext and ShutdownWithConfig) racing with fasthttp's serveConn read of the same field. This is a pre-existing race in the upstream codebase — confirmed by running -race against Test_ShutdownWithConfig_OnShutdownStart (a test we did not modify). Our telemetry instrumentation introduces zero new data races.

Test plan

# Build
go build ./...

# Targeted feature + leak tests
go test -v -run "Test_Telemetry_|Test_DebugEndpoint_" -count=1 -timeout 60s .

# Full shutdown regression
go test -run "Test_.*[Ss]hutdown|Test_ActiveConnections|Test_IsShuttingDown|Test_ConnTracking|Test_PostShutdownHooks|Test_GracefulShutdown|Test_MultipleShutdown|Test_ShutdownWithC|Test_RequestContext_|Test_GracefulClose_|Test_CleanShutdown|Test_DrainTime|Test_ContextCancellation|Test_RepeatedShutdown|Test_HookOrder|Test_Integration_|Test_Telemetry_|Test_DebugEndpoint_" -count=1 -timeout 120s .

# Linting
go vet ./...
gofmt -l app.go listen.go shutdown_test.go

🤖 Generated with Claude Code

Adds per-phase timing and connection-disposition metrics collected during
ShutdownWithConfig, exposed through both a programmatic accessor and a
JSON debug handler.  Includes comprehensive documentation covering
configuration, Kubernetes deployment patterns, load-balancer integration,
hook ordering, and a troubleshooting guide.

Key additions:
- ShutdownTelemetry struct capturing timing (drain, hooks, graceful-close)
  and counts (drained, forced, WS/SSE closed)
- app.LastShutdownTelemetry() accessor (atomic read of last snapshot)
- app.ShutdownDebugHandler() returning JSON with status + telemetry
- GracefulCloseTyped now increments WS/SSE counters on the telemetry
- 6 feature tests + 3 goroutine-leak tests
- docs/guide/shutdown.md: full reference including K8s manifests,
  LB drain patterns, hook examples, and troubleshooting

Co-Authored-By: Claude (penguin) <[email protected]>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jan 28, 2026

Walkthrough

Introduces a comprehensive graceful shutdown system with connection tracking, protocol-aware close logic (WebSocket/SSE), configurable lifecycle hooks, drain workflow, and telemetry collection. Adds new types for tracking connections, shutdown configuration, and shutdown metrics. Updates context methods to propagate shutdown state and adds three new error sentinels for shutdown and protocol scenarios.

Changes

Cohort / File(s) Summary
Core Shutdown Lifecycle
src/app.go
Adds atomic live-connection counter, dedicated shutdown context, ShutdownWithContext/ShutdownWithConfig methods, shutdown state tracking, and configurable hooks (OnShutdownStart, OnDrainProgress, OnForceClose, OnPreShutdown, OnPostShutdown). Introduces ShutdownDebug endpoint and telemetry accessors (IsShuttingDown, ActiveConnections, LastShutdownTelemetry).
Connection Tracking & Protocol Handling
src/listen.go
Introduces ConnType enum, TrackedConn interface, and ShutdownConfig/ShutdownTelemetry types. Implements connection registry with per-connection cleanup hooks. Adds protocol-aware graceful close for WebSocket (close frame + ack timeout) and SSE (shutdown event + wait). Wraps listener with connTrackingListener for active connection management and force-close behavior.
Context Propagation
src/ctx.go, src/ctx_interface_gen.go
Updates DefaultCtx methods (Deadline, Done, Err, Context) to reflect live application shutdown context instead of cached/no-op behavior. Adds TrackedConn() method to access connection-specific lifecycle controls. Makes Err() return ErrRequestShutdown on cancellation.
Error Definitions
src/error.go
Adds three new sentinel errors: ErrRequestShutdown (shutdown signal), ErrWebSocketCloseTimeout (WS handshake timeout), ErrSSECloseWriteFailed (SSE write failure).
Documentation
docs/guide/shutdown.md
New comprehensive guide covering shutdown lifecycle phases, three entry points, ShutdownConfig structure, telemetry model, JSON debug endpoint, Kubernetes/load-balancer integration, troubleshooting scenarios, and concurrency notes.

Sequence Diagram

sequenceDiagram
    autonumber
    participant App as App
    participant Listener as Listener
    participant Drain as Drain Monitor
    participant Conns as Active Connections
    participant WS as WebSocket
    participant SSE as SSE

    App->>App: ShutdownWithContext() called
    App->>App: Emit OnShutdownStart(activeConns)
    App->>Drain: Start drain phase with DrainInterval
    
    Drain->>Conns: Poll remaining connections
    Drain->>App: Emit OnDrainProgress(remaining, elapsed)
    Conns-->>Drain: Return count

    rect rgba(100, 200, 100, 0.5)
    Note over App,SSE: Graceful Close Phase
    App->>WS: Close WebSocket connections
    WS->>WS: Send close frame
    WS->>WS: Wait for client ack (timeout)
    WS-->>App: OnWebSocketClose callback
    
    App->>SSE: Close SSE connections
    SSE->>SSE: Write shutdown event
    SSE->>SSE: Wait for disconnect
    SSE-->>App: OnSSEClose callback
    end

    App->>App: OnForceClose if timeout exceeded
    App->>Conns: CloseAll remaining
    
    rect rgba(100, 100, 200, 0.5)
    Note over App: Post-Shutdown Hooks
    App->>App: Execute OnPostShutdown
    App->>App: Populate ShutdownTelemetry
    end

    App-->>Listener: Graceful shutdown complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

The changes introduce intricate logic across multiple domains: atomic connection tracking with protocol-aware graceful close sequences, context propagation with lifecycle awareness, configurable multi-phase shutdown with hooks and telemetry collection, and handler-specific cleanup for WebSocket/SSE. The heterogeneous nature (spanning app lifecycle, connection management, context semantics, and error handling) and density of new types/methods (ShutdownConfig, ShutdownTelemetry, TrackedConn, protocol enums) demand careful verification of correctness, edge cases, and integration points.

Possibly related PRs

Suggested reviewers

  • sixcolors
  • ReneWerner87

Poem

🐰 A hop, a drain, a graceful flow,
Connections close with gentle show,
WebSockets bid farewell with care,
Telemetry tracks each shutdown prayer,
Fiber shuts down, clean and fair!

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main additions: ShutdownTelemetry, debug endpoint, goroutine-leak tests, and deployment docs.
Description check ✅ Passed The PR description is comprehensive and well-structured, covering summary, tests, documentation, race-detector notes, and test plan. It follows the template structure with detailed information on all major changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @zishanjawed, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive graceful shutdown mechanism with detailed telemetry, a debug endpoint, and extensive documentation. It enhances the application's ability to shut down gracefully, providing insights into the process and ensuring no resource leaks. The changes include a new ShutdownTelemetry struct, updates to connection handling, and new tests to validate the shutdown process.

Highlights

  • Shutdown Telemetry: Introduces a ShutdownTelemetry struct to capture per-phase timing and connection-disposition counts during shutdown, providing insights into the shutdown process.
  • Debug Endpoint: Adds a /debug/shutdown endpoint that returns JSON containing the server's status, active connections, and the last shutdown telemetry snapshot, aiding in monitoring and debugging.
  • Graceful Close Extension: GracefulCloseTyped is extended to handle WebSocket and SSE connections, ensuring they are gracefully closed during shutdown, with counters for tracking.
  • Goroutine Leak Tests: Includes nine new tests to verify that clean and forced shutdowns, as well as repeated debug-handler calls, do not leak goroutines, ensuring stability.
  • Documentation: Comprehensive documentation (docs/guide/shutdown.md) covers various aspects of graceful shutdown, including configuration, telemetry, debug endpoint, lifecycle phases, hooks, Kubernetes integration, load balancer integration, and troubleshooting.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@app.go`:
- Around line 1200-1207: The write/read of app.shutdownCtx must be made atomic:
change the field type from context.Context to atomic.Pointer[context.Context]
(importing sync/atomic and unsafe), update the writer in the init/swap site
(where cfg.RequestContext is assigned under app.mutex.Lock()) to use
(*atomic.Pointer[context.Context]).Store((*context.Context)(unsafe.Pointer(&cfg.RequestContext)))
or equivalent atomic.StorePointer wrapping the cfg.RequestContext value, and
update reads (e.g., in ctx.go where code does "if parent := c.app.shutdownCtx;
parent != nil") to use atomic.LoadPointer and convert back to *context.Context
for nil-checking and use; ensure all places that previously accessed
app.shutdownCtx directly are replaced with atomic loads/stores so concurrent
access is race-free.

In `@ctx.go`:
- Around line 149-154: The read of c.app.shutdownCtx in ctx.go is a potential
data race against the write in ShutdownWithConfig (where RequestContext replaces
it); make shutdownCtx an atomic pointer (e.g., atomic.Pointer[context.Context]
or atomic.Value) on the App struct and update all accesses: replace direct reads
of c.app.shutdownCtx with an atomic load (e.g., shutdownCtx.Load()) and replace
the write in ShutdownWithConfig with an atomic store
(shutdownCtx.Store(newCtx)), keeping the same nil-check semantics so the
function returns the loaded context if non-nil or context.Background()
otherwise.
- Around line 177-185: DefaultCtx.Err currently replaces any context error with
ErrRequestShutdown, losing deadline/cancellation details; change Err to fetch
ctxErr := c.Context().Err() and return ctxErr unless the cancellation is
specifically due to server shutdown—i.e., detect shutdown (use the existing
shutdown signal/state available on DefaultCtx such as a shutdown flag, shutdown
channel or shutdown context) and only then return ErrRequestShutdown; reference
DefaultCtx.Err, c.Context().Err(), ErrRequestShutdown and RequestDeadline when
making the conditional mapping.

In `@docs/guide/shutdown.md`:
- Around line 50-54: The example reads telemetry too early: calling
app.LastShutdownTelemetry() before ShutdownWithConfig runs always returns nil;
move or remove that block so telemetry is only accessed after ShutdownWithConfig
completes (i.e., call LastShutdownTelemetry() after the ShutdownWithConfig call
and check for nil before logging), referencing the LastShutdownTelemetry() and
ShutdownWithConfig functions to locate and update the example.
🧹 Nitpick comments (6)
docs/guide/shutdown.md (3)

266-268: Add language specifier to fenced code block.

The static analysis tool flagged this code block as missing a language specifier (MD040). Consider adding text or an appropriate language identifier.

Suggested fix
-```
+```text
 OnShutdownStart → pre-hooks → GracefulCloseTyped → drain → force-close → post-hooks → return

</details>

---

`487-495`: **Add language specifier to fenced code block.**

This ASCII diagram code block is missing a language specifier (MD040). Consider adding `text` for plain text content.



<details>
<summary>Suggested fix</summary>

```diff
-```
+```text
 ┌─────────────┐   stop routing   ┌────────────────┐  drain timeout  ┌──────────┐
 │ Load Balancer│ ─────────────>   │  Fiber Instance │ ──────────────> │  Done    │

560-562: Add language specifier to fenced code block.

This code block showing the timeout relationship is missing a language specifier (MD040).

Suggested fix
-```
+```text
 terminationGracePeriodSeconds  <  preStop + ShutdownTimeout + hook_time

</details>

</blockquote></details>
<details>
<summary>app.go (1)</summary><blockquote>

`1376-1394`: **Minor: Double atomic load of `lastTelemetry`.**

The handler loads `lastTelemetry` twice (lines 1380 and 1390). While functionally correct due to atomic guarantees, a single load would be slightly more efficient.



<details>
<summary>Suggested optimization</summary>

```diff
 func (app *App) ShutdownDebugHandler() Handler {
 	return func(c Ctx) error {
 		status := "running"
+		tel := app.lastTelemetry.Load()
 		if app.IsShuttingDown() {
-			if app.lastTelemetry.Load() != nil {
+			if tel != nil {
 				status = "shutdown"
 			} else {
 				status = "shutting_down"
 			}
 		}
 
 		resp := shutdownDebugResponse{
 			Status:            status,
 			ActiveConnections: app.ActiveConnections(),
-			LastShutdown:      telemetryToJSON(app.lastTelemetry.Load()),
+			LastShutdown:      telemetryToJSON(tel),
 		}
 		return c.JSON(resp)
 	}
 }
listen.go (2)

380-389: Close() may call underlying Conn.Close() multiple times.

When CloseAll() wins the CAS race, it calls tc.Conn.Close(). If the handler subsequently calls tc.Close(), the CAS fails but c.Conn.Close() is still called (line 388). While double-close on net.Conn is generally safe (returns an error on second call), this behavior should be documented or guarded.

Optional: Guard the underlying close
 func (c *connTrackingConn) Close() error {
 	if atomic.CompareAndSwapInt32(&c.closed, 0, 1) {
 		atomic.AddInt64(c.activeConns, -1)
 		c.registry.Delete(c.id)
+		return c.Conn.Close()
 	}
-	return c.Conn.Close()
+	return nil // Already closed by CloseAll or previous Close call
 }

420-467: WebSocket close handshake reads only 2 bytes, which may not fully consume the client's close frame.

A WebSocket close frame from the client is at minimum 2 bytes (opcode + length) but typically 4+ bytes (with status code). Reading only 2 bytes leaves data in the buffer. While this doesn't affect correctness (the connection is about to close), reading a larger buffer would be more robust.

Suggested improvement
-	var buf [2]byte
+	var buf [128]byte // Accommodate full close frame with reason
 	_, readErr := tc.Conn.Read(buf[:])

Comment thread app.go
Comment on lines +1200 to +1207
// --- phase 3c: swap request-level parent context ---------------------------
// If the caller supplied a RequestContext, replace the app's shutdown
// context so that every subsequent c.Context() derivation inherits the
// caller's deadline. This lets handlers observe a concrete deadline via
// c.Deadline() in addition to the immediate Done() signal.
if cfg.RequestContext != nil {
app.shutdownCtx = cfg.RequestContext
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, let's check the current shutdownCtx usage in ctx.go
rg -n 'shutdownCtx' ctx.go -C 5

Repository: gofiber/fiber

Length of output: 1044


🏁 Script executed:

#!/bin/bash
# Check how Context() method is implemented in ctx.go
rg -n -A 10 'func.*Context\(\)' ctx.go

Repository: gofiber/fiber

Length of output: 481


🏁 Script executed:

#!/bin/bash
# Search for all shutdownCtx accesses across the codebase
rg -n 'shutdownCtx' --type go

Repository: gofiber/fiber

Length of output: 1493


🏁 Script executed:

#!/bin/bash
# Check mutex usage patterns around shutdownCtx
rg -n 'mu.*Lock\|shutdownCtx' app.go | head -50

Repository: gofiber/fiber

Length of output: 39


🏁 Script executed:

#!/bin/bash
# Get more context around line 1200-1207 in app.go to see the function and mutex usage
sed -n '1180,1220p' app.go

Repository: gofiber/fiber

Length of output: 1850


🏁 Script executed:

#!/bin/bash
# Look for the function containing line 1206 and check for mutex locks
rg -n -B 30 'app.shutdownCtx = cfg.RequestContext' app.go | head -60

Repository: gofiber/fiber

Length of output: 1638


🏁 Script executed:

#!/bin/bash
# Find the function definition containing line 1206
sed -n '1140,1210p' app.go | head -80

Repository: gofiber/fiber

Length of output: 3073


🏁 Script executed:

#!/bin/bash
# Search for the function name around line 1206
rg -n -B 50 'app.shutdownCtx = cfg.RequestContext' app.go | grep 'func ' | tail -1

Repository: gofiber/fiber

Length of output: 145


Use atomic.Pointer[context.Context] for shutdownCtx to prevent data race.

The assignment at line 1206 is protected by app.mutex.Lock(), but the read in ctx.go line 151 (if parent := c.app.shutdownCtx; parent != nil) accesses shutdownCtx without acquiring the mutex. This creates a data race under Go's memory model: a guarded write paired with an unguarded read. Although the unsynchronized access is intentional (to make the swap immediately visible during shutdown), it should be protected using atomic.Pointer[context.Context] to ensure safe concurrent access.

🤖 Prompt for AI Agents
In `@app.go` around lines 1200 - 1207, The write/read of app.shutdownCtx must be
made atomic: change the field type from context.Context to
atomic.Pointer[context.Context] (importing sync/atomic and unsafe), update the
writer in the init/swap site (where cfg.RequestContext is assigned under
app.mutex.Lock()) to use
(*atomic.Pointer[context.Context]).Store((*context.Context)(unsafe.Pointer(&cfg.RequestContext)))
or equivalent atomic.StorePointer wrapping the cfg.RequestContext value, and
update reads (e.g., in ctx.go where code does "if parent := c.app.shutdownCtx;
parent != nil") to use atomic.LoadPointer and convert back to *context.Context
for nil-checking and use; ensure all places that previously accessed
app.shutdownCtx directly are replaced with atomic loads/stores so concurrent
access is race-free.

Comment thread ctx.go
Comment on lines +149 to +154
// Return the live shutdown context without caching so that a
// RequestContext swap during shutdown is immediately visible.
if parent := c.app.shutdownCtx; parent != nil {
return parent
}
return context.Background()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Potential data race reading app.shutdownCtx without synchronization.

This read of app.shutdownCtx is unsynchronized with the write in app.go line 1206 (ShutdownWithConfig where RequestContext replaces it). While the mutex is held during the write, this read does not acquire the mutex.

This relates to the issue flagged in app.go — consider using an atomic.Pointer for shutdownCtx.

🤖 Prompt for AI Agents
In `@ctx.go` around lines 149 - 154, The read of c.app.shutdownCtx in ctx.go is a
potential data race against the write in ShutdownWithConfig (where
RequestContext replaces it); make shutdownCtx an atomic pointer (e.g.,
atomic.Pointer[context.Context] or atomic.Value) on the App struct and update
all accesses: replace direct reads of c.app.shutdownCtx with an atomic load
(e.g., shutdownCtx.Load()) and replace the write in ShutdownWithConfig with an
atomic store (shutdownCtx.Store(newCtx)), keeping the same nil-check semantics
so the function returns the loaded context if non-nil or context.Background()
otherwise.

Comment thread ctx.go
Comment on lines +177 to 185
// Err returns nil while the request context is still active. Once shutdown
// begins and the context is cancelled, Err returns ErrRequestShutdown so
// handlers can distinguish a shutdown cancellation from other errors.
func (c *DefaultCtx) Err() error {
if err := c.Context().Err(); err != nil {
return ErrRequestShutdown
}
return nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Err() masks the original context error, losing deadline information.

When Context().Err() returns context.DeadlineExceeded (e.g., from RequestDeadline), this method still returns ErrRequestShutdown. Handlers cannot distinguish between shutdown cancellation and per-request deadline expiration.

Consider preserving the original error or checking specifically for shutdown:

Suggested approach
 func (c *DefaultCtx) Err() error {
-	if err := c.Context().Err(); err != nil {
-		return ErrRequestShutdown
-	}
+	err := c.Context().Err()
+	if err == nil {
+		return nil
+	}
+	// If the app is shutting down, return the shutdown sentinel.
+	// Otherwise, return the actual context error (e.g., DeadlineExceeded).
+	if c.app.IsShuttingDown() {
+		return ErrRequestShutdown
+	}
+	return err
 	return nil
 }
🤖 Prompt for AI Agents
In `@ctx.go` around lines 177 - 185, DefaultCtx.Err currently replaces any context
error with ErrRequestShutdown, losing deadline/cancellation details; change Err
to fetch ctxErr := c.Context().Err() and return ctxErr unless the cancellation
is specifically due to server shutdown—i.e., detect shutdown (use the existing
shutdown signal/state available on DefaultCtx such as a shutdown flag, shutdown
channel or shutdown context) and only then return ErrRequestShutdown; reference
DefaultCtx.Err, c.Context().Err(), ErrRequestShutdown and RequestDeadline when
making the conditional mapping.

Comment thread docs/guide/shutdown.md
Comment on lines +50 to +54
tel := app.LastShutdownTelemetry()
if tel != nil {
log.Printf("Shutdown completed in %s (drained=%d, forced=%d)",
tel.TotalDuration, tel.DrainedConns, tel.ForcedConns)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Telemetry access before shutdown completes will always return nil.

The quick start example accesses app.LastShutdownTelemetry() before ShutdownWithConfig is called. At this point, telemetry is always nil because it's only populated after shutdown completes. This code block should be removed or moved after the shutdown call.

Suggested fix
     go func() {
         <-quit
 
-        tel := app.LastShutdownTelemetry()
-        if tel != nil {
-            log.Printf("Shutdown completed in %s (drained=%d, forced=%d)",
-                tel.TotalDuration, tel.DrainedConns, tel.ForcedConns)
-        }
-
         ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
         defer cancel()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
tel := app.LastShutdownTelemetry()
if tel != nil {
log.Printf("Shutdown completed in %s (drained=%d, forced=%d)",
tel.TotalDuration, tel.DrainedConns, tel.ForcedConns)
}
go func() {
<-quit
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
🤖 Prompt for AI Agents
In `@docs/guide/shutdown.md` around lines 50 - 54, The example reads telemetry too
early: calling app.LastShutdownTelemetry() before ShutdownWithConfig runs always
returns nil; move or remove that block so telemetry is only accessed after
ShutdownWithConfig completes (i.e., call LastShutdownTelemetry() after the
ShutdownWithConfig call and check for nil before logging), referencing the
LastShutdownTelemetry() and ShutdownWithConfig functions to locate and update
the example.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive and well-designed graceful shutdown mechanism, complete with detailed telemetry, a debug endpoint, and extensive documentation. While the implementation aims to improve server robustness, critical security and stability issues were identified. Specifically, there is a potential Denial of Service (DoS) in the shutdown process due to sequential connection closing in GracefulCloseTyped, and concurrency issues in ShutdownWithConfig that allow for multiple hook executions and a data race on the application's shutdown context. Additionally, there are a few suggestions to enhance API consistency and code clarity.

Comment thread listen.go
Comment on lines +397 to +418
func (l *connTrackingListener) GracefulCloseTyped(ctx context.Context, cfg *ShutdownConfig, tel *ShutdownTelemetry) {
l.conns.Range(func(_, value any) bool {
tc, ok := value.(*connTrackingConn)
if !ok {
return true
}

switch tc.ConnType() {
case ConnTypeWebSocket:
l.closeWebSocket(ctx, tc, cfg)
tel.WebSocketsClosed++
case ConnTypeSSE:
l.closeSSE(ctx, tc, cfg)
tel.SSEsClosed++
default:
// Plain HTTP — run cleanup hook if registered, but no
// protocol-specific close logic.
_ = tc.runCleanup()
}
return true
})
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The GracefulCloseTyped function iterates over all tracked connections sequentially using sync.Map.Range. For each WebSocket or SSE connection, it performs network I/O and waits for a client response or a timeout (defaulting to 5s for WebSockets and 2s for SSE). If a server has a large number of such connections, the shutdown process will be extremely slow, potentially exceeding the system's termination grace period (e.g., in Kubernetes) and preventing post-shutdown hooks from running. An attacker can exploit this by opening many long-lived connections to delay or disrupt the shutdown process, leading to a Denial of Service (DoS) on the shutdown lifecycle.

Comment thread app.go
Comment on lines +1161 to +1254
app.mutex.Lock()

if app.server == nil {
app.mutex.Unlock()
return ErrNotRunning
}

// Execute the Shutdown hook
tel := &ShutdownTelemetry{}
tel.StartedAt = time.Now()
tel.InitialConns = int(atomic.LoadInt64(&app.activeConns))

// --- phase 1: mark state & signal in-flight requests ----------------------
atomic.StoreInt32(&app.shuttingDown, 1)
if app.shutdownCancel != nil {
app.shutdownCancel()
}

// --- phase 2: evict idle keepalives ----------------------------------------
// A very small IdleTimeout causes fasthttp to close connections that are
// waiting for a new request on the next internal tick, without waiting for
// the (potentially unbounded) user-configured IdleTimeout.
app.server.IdleTimeout = time.Nanosecond

// --- phase 3: stop accepting new connections --------------------------------
// Closing the tracking listener immediately rejects every subsequent Dial /
// Accept. Connections already accepted continue to be served.
if app.trackingListener != nil {
_ = app.trackingListener.Close()
}

// --- phase 3b: apply per-request deadline -----------------------------------
// RequestDeadline adds an additional bound on top of the caller's context.
// The effective deadline is whichever expires first.
if cfg.RequestDeadline > 0 {
var cancel context.CancelFunc
ctx, cancel = context.WithTimeout(ctx, cfg.RequestDeadline)
defer cancel()
}

// --- phase 3c: swap request-level parent context ---------------------------
// If the caller supplied a RequestContext, replace the app's shutdown
// context so that every subsequent c.Context() derivation inherits the
// caller's deadline. This lets handlers observe a concrete deadline via
// c.Deadline() in addition to the immediate Done() signal.
if cfg.RequestContext != nil {
app.shutdownCtx = cfg.RequestContext
}

// --- phase 4: report initial state -----------------------------------------
if cfg.OnShutdownStart != nil {
cfg.OnShutdownStart(int(atomic.LoadInt64(&app.activeConns)))
}

// --- phase 5: pre-shutdown hooks --------------------------------------------
preStart := time.Now()
app.hooks.executeOnPreShutdownHooks()
defer app.hooks.executeOnPostShutdownHooks(err)
tel.PreHooksDuration = time.Since(preStart)

// --- phase 5b: graceful close of typed (WS/SSE) connections ----------------
// Only connections explicitly marked as WebSocket or SSE are touched.
// Plain HTTP connections proceed to the normal drain / force-close path.
gcStart := time.Now()
if app.trackingListener != nil {
app.trackingListener.GracefulCloseTyped(ctx, &cfg, tel)
}
tel.GracefulCloseDuration = time.Since(gcStart)

// --- phase 6: start drain monitor ------------------------------------------
interval := cfg.DrainInterval
if interval <= 0 {
interval = 500 * time.Millisecond
}

drainDone := make(chan struct{})
startedAt := time.Now()

go func() {
ticker := time.NewTicker(interval)
defer ticker.Stop()
for {
select {
case <-drainDone:
return
case <-ticker.C:
if cfg.OnDrainProgress != nil {
cfg.OnDrainProgress(int(atomic.LoadInt64(&app.activeConns)), time.Since(startedAt))
}
}
}
}()

// Release the mutex before blocking so ActiveConnections / IsShuttingDown
// remain callable from other goroutines (e.g. health checks).
app.mutex.Unlock()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The ShutdownWithConfig function releases the app.mutex before the connection draining process is complete. This allows multiple concurrent calls to ShutdownWithConfig if not externally synchronized. Concurrent calls will result in multiple executions of pre-shutdown hooks, multiple drain monitor goroutines, and multiple executions of post-shutdown hooks. This violates the "exactly once" guarantee for hooks and can lead to unexpected behavior or crashes if hooks are not idempotent, potentially causing a Denial of Service during shutdown.

Comment thread app.go
Comment on lines +1205 to +1207
if cfg.RequestContext != nil {
app.shutdownCtx = cfg.RequestContext
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

Swapping the application's shutdownCtx mid-shutdown in ShutdownWithConfig creates a data race with request handlers that access this field via DefaultCtx.Context(). Since shutdownCtx is a standard interface and not protected by synchronization primitives during these accesses, concurrent read/write operations can lead to "torn reads," resulting in application panics and Denial of Service.

Comment thread app.go
DrainedConns int `json:"drainedConns"`
ForcedConns int `json:"forcedConns"`
WebSocketsClosed int `json:"webSocketsClosed"`
SSEsClosed int `json:"sseClosed"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The JSON tag sseClosed is inconsistent with the Go field name SSEsClosed and the pattern used for WebSocketsClosed (webSocketsClosed). For consistency, it would be better to change the JSON tag to ssesClosed.

Suggested change
SSEsClosed int `json:"sseClosed"`
SSEsClosed int `json:"ssesClosed"`

Comment thread app.go
Comment on lines +1377 to +1393
return func(c Ctx) error {
status := "running"
if app.IsShuttingDown() {
if app.lastTelemetry.Load() != nil {
status = "shutdown"
} else {
status = "shutting_down"
}
}

resp := shutdownDebugResponse{
Status: status,
ActiveConnections: app.ActiveConnections(),
LastShutdown: telemetryToJSON(app.lastTelemetry.Load()),
}
return c.JSON(resp)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The app.lastTelemetry.Load() call is performed twice: once to determine the status and again to populate the response. This is slightly inefficient and could lead to a subtle race condition where the telemetry data changes between the two calls. It's better to load the value once into a local variable and reuse it.

	return func(c Ctx) error {
		status := "running"
		lastTel := app.lastTelemetry.Load()
		if app.IsShuttingDown() {
			if lastTel != nil {
				status = "shutdown"
			} else {
				status = "shutting_down"
			}
		}

		resp := shutdownDebugResponse{
			Status:            status,
			ActiveConnections: app.ActiveConnections(),
			LastShutdown:      telemetryToJSON(lastTel),
		}
		return c.JSON(resp)
	}

Comment thread docs/guide/shutdown.md
"drainedConns": 12,
"forcedConns": 3,
"webSocketsClosed": 4,
"sseClosed": 1,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To align with the suggested code change for API consistency, please update the JSON key in this example from sseClosed to ssesClosed.

Suggested change
"sseClosed": 1,
"ssesClosed": 1,

@ReneWerner87
Copy link
Copy Markdown
Member

The code is too dependent on telemetry.
The core should be free of external tools and run with all tools.

Can you make the adjustment so that it requires few adjustments to the core and is generic, so that other external tools, plugins, etc. can use it?

@gaby
Copy link
Copy Markdown
Member

gaby commented Jan 28, 2026

@zishanjawed What issue is this related to? I'm confused as to why this PR was created without first creating an issue and discussing it with the team.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants