feat: ShutdownTelemetry, debug endpoint, goroutine-leak tests, and deployment docs#4038
feat: ShutdownTelemetry, debug endpoint, goroutine-leak tests, and deployment docs#4038zishanjawed wants to merge 1 commit intogofiber:mainfrom
Conversation
Adds per-phase timing and connection-disposition metrics collected during ShutdownWithConfig, exposed through both a programmatic accessor and a JSON debug handler. Includes comprehensive documentation covering configuration, Kubernetes deployment patterns, load-balancer integration, hook ordering, and a troubleshooting guide. Key additions: - ShutdownTelemetry struct capturing timing (drain, hooks, graceful-close) and counts (drained, forced, WS/SSE closed) - app.LastShutdownTelemetry() accessor (atomic read of last snapshot) - app.ShutdownDebugHandler() returning JSON with status + telemetry - GracefulCloseTyped now increments WS/SSE counters on the telemetry - 6 feature tests + 3 goroutine-leak tests - docs/guide/shutdown.md: full reference including K8s manifests, LB drain patterns, hook examples, and troubleshooting Co-Authored-By: Claude (penguin) <[email protected]>
WalkthroughIntroduces a comprehensive graceful shutdown system with connection tracking, protocol-aware close logic (WebSocket/SSE), configurable lifecycle hooks, drain workflow, and telemetry collection. Adds new types for tracking connections, shutdown configuration, and shutdown metrics. Updates context methods to propagate shutdown state and adds three new error sentinels for shutdown and protocol scenarios. Changes
Sequence DiagramsequenceDiagram
autonumber
participant App as App
participant Listener as Listener
participant Drain as Drain Monitor
participant Conns as Active Connections
participant WS as WebSocket
participant SSE as SSE
App->>App: ShutdownWithContext() called
App->>App: Emit OnShutdownStart(activeConns)
App->>Drain: Start drain phase with DrainInterval
Drain->>Conns: Poll remaining connections
Drain->>App: Emit OnDrainProgress(remaining, elapsed)
Conns-->>Drain: Return count
rect rgba(100, 200, 100, 0.5)
Note over App,SSE: Graceful Close Phase
App->>WS: Close WebSocket connections
WS->>WS: Send close frame
WS->>WS: Wait for client ack (timeout)
WS-->>App: OnWebSocketClose callback
App->>SSE: Close SSE connections
SSE->>SSE: Write shutdown event
SSE->>SSE: Wait for disconnect
SSE-->>App: OnSSEClose callback
end
App->>App: OnForceClose if timeout exceeded
App->>Conns: CloseAll remaining
rect rgba(100, 100, 200, 0.5)
Note over App: Post-Shutdown Hooks
App->>App: Execute OnPostShutdown
App->>App: Populate ShutdownTelemetry
end
App-->>Listener: Graceful shutdown complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes The changes introduce intricate logic across multiple domains: atomic connection tracking with protocol-aware graceful close sequences, context propagation with lifecycle awareness, configurable multi-phase shutdown with hooks and telemetry collection, and handler-specific cleanup for WebSocket/SSE. The heterogeneous nature (spanning app lifecycle, connection management, context semantics, and error handling) and density of new types/methods (ShutdownConfig, ShutdownTelemetry, TrackedConn, protocol enums) demand careful verification of correctness, edge cases, and integration points. Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @zishanjawed, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a comprehensive graceful shutdown mechanism with detailed telemetry, a debug endpoint, and extensive documentation. It enhances the application's ability to shut down gracefully, providing insights into the process and ensuring no resource leaks. The changes include a new Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@app.go`:
- Around line 1200-1207: The write/read of app.shutdownCtx must be made atomic:
change the field type from context.Context to atomic.Pointer[context.Context]
(importing sync/atomic and unsafe), update the writer in the init/swap site
(where cfg.RequestContext is assigned under app.mutex.Lock()) to use
(*atomic.Pointer[context.Context]).Store((*context.Context)(unsafe.Pointer(&cfg.RequestContext)))
or equivalent atomic.StorePointer wrapping the cfg.RequestContext value, and
update reads (e.g., in ctx.go where code does "if parent := c.app.shutdownCtx;
parent != nil") to use atomic.LoadPointer and convert back to *context.Context
for nil-checking and use; ensure all places that previously accessed
app.shutdownCtx directly are replaced with atomic loads/stores so concurrent
access is race-free.
In `@ctx.go`:
- Around line 149-154: The read of c.app.shutdownCtx in ctx.go is a potential
data race against the write in ShutdownWithConfig (where RequestContext replaces
it); make shutdownCtx an atomic pointer (e.g., atomic.Pointer[context.Context]
or atomic.Value) on the App struct and update all accesses: replace direct reads
of c.app.shutdownCtx with an atomic load (e.g., shutdownCtx.Load()) and replace
the write in ShutdownWithConfig with an atomic store
(shutdownCtx.Store(newCtx)), keeping the same nil-check semantics so the
function returns the loaded context if non-nil or context.Background()
otherwise.
- Around line 177-185: DefaultCtx.Err currently replaces any context error with
ErrRequestShutdown, losing deadline/cancellation details; change Err to fetch
ctxErr := c.Context().Err() and return ctxErr unless the cancellation is
specifically due to server shutdown—i.e., detect shutdown (use the existing
shutdown signal/state available on DefaultCtx such as a shutdown flag, shutdown
channel or shutdown context) and only then return ErrRequestShutdown; reference
DefaultCtx.Err, c.Context().Err(), ErrRequestShutdown and RequestDeadline when
making the conditional mapping.
In `@docs/guide/shutdown.md`:
- Around line 50-54: The example reads telemetry too early: calling
app.LastShutdownTelemetry() before ShutdownWithConfig runs always returns nil;
move or remove that block so telemetry is only accessed after ShutdownWithConfig
completes (i.e., call LastShutdownTelemetry() after the ShutdownWithConfig call
and check for nil before logging), referencing the LastShutdownTelemetry() and
ShutdownWithConfig functions to locate and update the example.
🧹 Nitpick comments (6)
docs/guide/shutdown.md (3)
266-268: Add language specifier to fenced code block.The static analysis tool flagged this code block as missing a language specifier (MD040). Consider adding
textor an appropriate language identifier.Suggested fix
-``` +```text OnShutdownStart → pre-hooks → GracefulCloseTyped → drain → force-close → post-hooks → return</details> --- `487-495`: **Add language specifier to fenced code block.** This ASCII diagram code block is missing a language specifier (MD040). Consider adding `text` for plain text content. <details> <summary>Suggested fix</summary> ```diff -``` +```text ┌─────────────┐ stop routing ┌────────────────┐ drain timeout ┌──────────┐ │ Load Balancer│ ─────────────> │ Fiber Instance │ ──────────────> │ Done │
560-562: Add language specifier to fenced code block.This code block showing the timeout relationship is missing a language specifier (MD040).
Suggested fix
-``` +```text terminationGracePeriodSeconds < preStop + ShutdownTimeout + hook_time</details> </blockquote></details> <details> <summary>app.go (1)</summary><blockquote> `1376-1394`: **Minor: Double atomic load of `lastTelemetry`.** The handler loads `lastTelemetry` twice (lines 1380 and 1390). While functionally correct due to atomic guarantees, a single load would be slightly more efficient. <details> <summary>Suggested optimization</summary> ```diff func (app *App) ShutdownDebugHandler() Handler { return func(c Ctx) error { status := "running" + tel := app.lastTelemetry.Load() if app.IsShuttingDown() { - if app.lastTelemetry.Load() != nil { + if tel != nil { status = "shutdown" } else { status = "shutting_down" } } resp := shutdownDebugResponse{ Status: status, ActiveConnections: app.ActiveConnections(), - LastShutdown: telemetryToJSON(app.lastTelemetry.Load()), + LastShutdown: telemetryToJSON(tel), } return c.JSON(resp) } }listen.go (2)
380-389:Close()may call underlyingConn.Close()multiple times.When
CloseAll()wins the CAS race, it callstc.Conn.Close(). If the handler subsequently callstc.Close(), the CAS fails butc.Conn.Close()is still called (line 388). While double-close onnet.Connis generally safe (returns an error on second call), this behavior should be documented or guarded.Optional: Guard the underlying close
func (c *connTrackingConn) Close() error { if atomic.CompareAndSwapInt32(&c.closed, 0, 1) { atomic.AddInt64(c.activeConns, -1) c.registry.Delete(c.id) + return c.Conn.Close() } - return c.Conn.Close() + return nil // Already closed by CloseAll or previous Close call }
420-467: WebSocket close handshake reads only 2 bytes, which may not fully consume the client's close frame.A WebSocket close frame from the client is at minimum 2 bytes (opcode + length) but typically 4+ bytes (with status code). Reading only 2 bytes leaves data in the buffer. While this doesn't affect correctness (the connection is about to close), reading a larger buffer would be more robust.
Suggested improvement
- var buf [2]byte + var buf [128]byte // Accommodate full close frame with reason _, readErr := tc.Conn.Read(buf[:])
| // --- phase 3c: swap request-level parent context --------------------------- | ||
| // If the caller supplied a RequestContext, replace the app's shutdown | ||
| // context so that every subsequent c.Context() derivation inherits the | ||
| // caller's deadline. This lets handlers observe a concrete deadline via | ||
| // c.Deadline() in addition to the immediate Done() signal. | ||
| if cfg.RequestContext != nil { | ||
| app.shutdownCtx = cfg.RequestContext | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, let's check the current shutdownCtx usage in ctx.go
rg -n 'shutdownCtx' ctx.go -C 5Repository: gofiber/fiber
Length of output: 1044
🏁 Script executed:
#!/bin/bash
# Check how Context() method is implemented in ctx.go
rg -n -A 10 'func.*Context\(\)' ctx.goRepository: gofiber/fiber
Length of output: 481
🏁 Script executed:
#!/bin/bash
# Search for all shutdownCtx accesses across the codebase
rg -n 'shutdownCtx' --type goRepository: gofiber/fiber
Length of output: 1493
🏁 Script executed:
#!/bin/bash
# Check mutex usage patterns around shutdownCtx
rg -n 'mu.*Lock\|shutdownCtx' app.go | head -50Repository: gofiber/fiber
Length of output: 39
🏁 Script executed:
#!/bin/bash
# Get more context around line 1200-1207 in app.go to see the function and mutex usage
sed -n '1180,1220p' app.goRepository: gofiber/fiber
Length of output: 1850
🏁 Script executed:
#!/bin/bash
# Look for the function containing line 1206 and check for mutex locks
rg -n -B 30 'app.shutdownCtx = cfg.RequestContext' app.go | head -60Repository: gofiber/fiber
Length of output: 1638
🏁 Script executed:
#!/bin/bash
# Find the function definition containing line 1206
sed -n '1140,1210p' app.go | head -80Repository: gofiber/fiber
Length of output: 3073
🏁 Script executed:
#!/bin/bash
# Search for the function name around line 1206
rg -n -B 50 'app.shutdownCtx = cfg.RequestContext' app.go | grep 'func ' | tail -1Repository: gofiber/fiber
Length of output: 145
Use atomic.Pointer[context.Context] for shutdownCtx to prevent data race.
The assignment at line 1206 is protected by app.mutex.Lock(), but the read in ctx.go line 151 (if parent := c.app.shutdownCtx; parent != nil) accesses shutdownCtx without acquiring the mutex. This creates a data race under Go's memory model: a guarded write paired with an unguarded read. Although the unsynchronized access is intentional (to make the swap immediately visible during shutdown), it should be protected using atomic.Pointer[context.Context] to ensure safe concurrent access.
🤖 Prompt for AI Agents
In `@app.go` around lines 1200 - 1207, The write/read of app.shutdownCtx must be
made atomic: change the field type from context.Context to
atomic.Pointer[context.Context] (importing sync/atomic and unsafe), update the
writer in the init/swap site (where cfg.RequestContext is assigned under
app.mutex.Lock()) to use
(*atomic.Pointer[context.Context]).Store((*context.Context)(unsafe.Pointer(&cfg.RequestContext)))
or equivalent atomic.StorePointer wrapping the cfg.RequestContext value, and
update reads (e.g., in ctx.go where code does "if parent := c.app.shutdownCtx;
parent != nil") to use atomic.LoadPointer and convert back to *context.Context
for nil-checking and use; ensure all places that previously accessed
app.shutdownCtx directly are replaced with atomic loads/stores so concurrent
access is race-free.
| // Return the live shutdown context without caching so that a | ||
| // RequestContext swap during shutdown is immediately visible. | ||
| if parent := c.app.shutdownCtx; parent != nil { | ||
| return parent | ||
| } | ||
| return context.Background() |
There was a problem hiding this comment.
Potential data race reading app.shutdownCtx without synchronization.
This read of app.shutdownCtx is unsynchronized with the write in app.go line 1206 (ShutdownWithConfig where RequestContext replaces it). While the mutex is held during the write, this read does not acquire the mutex.
This relates to the issue flagged in app.go — consider using an atomic.Pointer for shutdownCtx.
🤖 Prompt for AI Agents
In `@ctx.go` around lines 149 - 154, The read of c.app.shutdownCtx in ctx.go is a
potential data race against the write in ShutdownWithConfig (where
RequestContext replaces it); make shutdownCtx an atomic pointer (e.g.,
atomic.Pointer[context.Context] or atomic.Value) on the App struct and update
all accesses: replace direct reads of c.app.shutdownCtx with an atomic load
(e.g., shutdownCtx.Load()) and replace the write in ShutdownWithConfig with an
atomic store (shutdownCtx.Store(newCtx)), keeping the same nil-check semantics
so the function returns the loaded context if non-nil or context.Background()
otherwise.
| // Err returns nil while the request context is still active. Once shutdown | ||
| // begins and the context is cancelled, Err returns ErrRequestShutdown so | ||
| // handlers can distinguish a shutdown cancellation from other errors. | ||
| func (c *DefaultCtx) Err() error { | ||
| if err := c.Context().Err(); err != nil { | ||
| return ErrRequestShutdown | ||
| } | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Err() masks the original context error, losing deadline information.
When Context().Err() returns context.DeadlineExceeded (e.g., from RequestDeadline), this method still returns ErrRequestShutdown. Handlers cannot distinguish between shutdown cancellation and per-request deadline expiration.
Consider preserving the original error or checking specifically for shutdown:
Suggested approach
func (c *DefaultCtx) Err() error {
- if err := c.Context().Err(); err != nil {
- return ErrRequestShutdown
- }
+ err := c.Context().Err()
+ if err == nil {
+ return nil
+ }
+ // If the app is shutting down, return the shutdown sentinel.
+ // Otherwise, return the actual context error (e.g., DeadlineExceeded).
+ if c.app.IsShuttingDown() {
+ return ErrRequestShutdown
+ }
+ return err
return nil
}🤖 Prompt for AI Agents
In `@ctx.go` around lines 177 - 185, DefaultCtx.Err currently replaces any context
error with ErrRequestShutdown, losing deadline/cancellation details; change Err
to fetch ctxErr := c.Context().Err() and return ctxErr unless the cancellation
is specifically due to server shutdown—i.e., detect shutdown (use the existing
shutdown signal/state available on DefaultCtx such as a shutdown flag, shutdown
channel or shutdown context) and only then return ErrRequestShutdown; reference
DefaultCtx.Err, c.Context().Err(), ErrRequestShutdown and RequestDeadline when
making the conditional mapping.
| tel := app.LastShutdownTelemetry() | ||
| if tel != nil { | ||
| log.Printf("Shutdown completed in %s (drained=%d, forced=%d)", | ||
| tel.TotalDuration, tel.DrainedConns, tel.ForcedConns) | ||
| } |
There was a problem hiding this comment.
Telemetry access before shutdown completes will always return nil.
The quick start example accesses app.LastShutdownTelemetry() before ShutdownWithConfig is called. At this point, telemetry is always nil because it's only populated after shutdown completes. This code block should be removed or moved after the shutdown call.
Suggested fix
go func() {
<-quit
- tel := app.LastShutdownTelemetry()
- if tel != nil {
- log.Printf("Shutdown completed in %s (drained=%d, forced=%d)",
- tel.TotalDuration, tel.DrainedConns, tel.ForcedConns)
- }
-
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| tel := app.LastShutdownTelemetry() | |
| if tel != nil { | |
| log.Printf("Shutdown completed in %s (drained=%d, forced=%d)", | |
| tel.TotalDuration, tel.DrainedConns, tel.ForcedConns) | |
| } | |
| go func() { | |
| <-quit | |
| ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) | |
| defer cancel() |
🤖 Prompt for AI Agents
In `@docs/guide/shutdown.md` around lines 50 - 54, The example reads telemetry too
early: calling app.LastShutdownTelemetry() before ShutdownWithConfig runs always
returns nil; move or remove that block so telemetry is only accessed after
ShutdownWithConfig completes (i.e., call LastShutdownTelemetry() after the
ShutdownWithConfig call and check for nil before logging), referencing the
LastShutdownTelemetry() and ShutdownWithConfig functions to locate and update
the example.
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive and well-designed graceful shutdown mechanism, complete with detailed telemetry, a debug endpoint, and extensive documentation. While the implementation aims to improve server robustness, critical security and stability issues were identified. Specifically, there is a potential Denial of Service (DoS) in the shutdown process due to sequential connection closing in GracefulCloseTyped, and concurrency issues in ShutdownWithConfig that allow for multiple hook executions and a data race on the application's shutdown context. Additionally, there are a few suggestions to enhance API consistency and code clarity.
| func (l *connTrackingListener) GracefulCloseTyped(ctx context.Context, cfg *ShutdownConfig, tel *ShutdownTelemetry) { | ||
| l.conns.Range(func(_, value any) bool { | ||
| tc, ok := value.(*connTrackingConn) | ||
| if !ok { | ||
| return true | ||
| } | ||
|
|
||
| switch tc.ConnType() { | ||
| case ConnTypeWebSocket: | ||
| l.closeWebSocket(ctx, tc, cfg) | ||
| tel.WebSocketsClosed++ | ||
| case ConnTypeSSE: | ||
| l.closeSSE(ctx, tc, cfg) | ||
| tel.SSEsClosed++ | ||
| default: | ||
| // Plain HTTP — run cleanup hook if registered, but no | ||
| // protocol-specific close logic. | ||
| _ = tc.runCleanup() | ||
| } | ||
| return true | ||
| }) | ||
| } |
There was a problem hiding this comment.
The GracefulCloseTyped function iterates over all tracked connections sequentially using sync.Map.Range. For each WebSocket or SSE connection, it performs network I/O and waits for a client response or a timeout (defaulting to 5s for WebSockets and 2s for SSE). If a server has a large number of such connections, the shutdown process will be extremely slow, potentially exceeding the system's termination grace period (e.g., in Kubernetes) and preventing post-shutdown hooks from running. An attacker can exploit this by opening many long-lived connections to delay or disrupt the shutdown process, leading to a Denial of Service (DoS) on the shutdown lifecycle.
| app.mutex.Lock() | ||
|
|
||
| if app.server == nil { | ||
| app.mutex.Unlock() | ||
| return ErrNotRunning | ||
| } | ||
|
|
||
| // Execute the Shutdown hook | ||
| tel := &ShutdownTelemetry{} | ||
| tel.StartedAt = time.Now() | ||
| tel.InitialConns = int(atomic.LoadInt64(&app.activeConns)) | ||
|
|
||
| // --- phase 1: mark state & signal in-flight requests ---------------------- | ||
| atomic.StoreInt32(&app.shuttingDown, 1) | ||
| if app.shutdownCancel != nil { | ||
| app.shutdownCancel() | ||
| } | ||
|
|
||
| // --- phase 2: evict idle keepalives ---------------------------------------- | ||
| // A very small IdleTimeout causes fasthttp to close connections that are | ||
| // waiting for a new request on the next internal tick, without waiting for | ||
| // the (potentially unbounded) user-configured IdleTimeout. | ||
| app.server.IdleTimeout = time.Nanosecond | ||
|
|
||
| // --- phase 3: stop accepting new connections -------------------------------- | ||
| // Closing the tracking listener immediately rejects every subsequent Dial / | ||
| // Accept. Connections already accepted continue to be served. | ||
| if app.trackingListener != nil { | ||
| _ = app.trackingListener.Close() | ||
| } | ||
|
|
||
| // --- phase 3b: apply per-request deadline ----------------------------------- | ||
| // RequestDeadline adds an additional bound on top of the caller's context. | ||
| // The effective deadline is whichever expires first. | ||
| if cfg.RequestDeadline > 0 { | ||
| var cancel context.CancelFunc | ||
| ctx, cancel = context.WithTimeout(ctx, cfg.RequestDeadline) | ||
| defer cancel() | ||
| } | ||
|
|
||
| // --- phase 3c: swap request-level parent context --------------------------- | ||
| // If the caller supplied a RequestContext, replace the app's shutdown | ||
| // context so that every subsequent c.Context() derivation inherits the | ||
| // caller's deadline. This lets handlers observe a concrete deadline via | ||
| // c.Deadline() in addition to the immediate Done() signal. | ||
| if cfg.RequestContext != nil { | ||
| app.shutdownCtx = cfg.RequestContext | ||
| } | ||
|
|
||
| // --- phase 4: report initial state ----------------------------------------- | ||
| if cfg.OnShutdownStart != nil { | ||
| cfg.OnShutdownStart(int(atomic.LoadInt64(&app.activeConns))) | ||
| } | ||
|
|
||
| // --- phase 5: pre-shutdown hooks -------------------------------------------- | ||
| preStart := time.Now() | ||
| app.hooks.executeOnPreShutdownHooks() | ||
| defer app.hooks.executeOnPostShutdownHooks(err) | ||
| tel.PreHooksDuration = time.Since(preStart) | ||
|
|
||
| // --- phase 5b: graceful close of typed (WS/SSE) connections ---------------- | ||
| // Only connections explicitly marked as WebSocket or SSE are touched. | ||
| // Plain HTTP connections proceed to the normal drain / force-close path. | ||
| gcStart := time.Now() | ||
| if app.trackingListener != nil { | ||
| app.trackingListener.GracefulCloseTyped(ctx, &cfg, tel) | ||
| } | ||
| tel.GracefulCloseDuration = time.Since(gcStart) | ||
|
|
||
| // --- phase 6: start drain monitor ------------------------------------------ | ||
| interval := cfg.DrainInterval | ||
| if interval <= 0 { | ||
| interval = 500 * time.Millisecond | ||
| } | ||
|
|
||
| drainDone := make(chan struct{}) | ||
| startedAt := time.Now() | ||
|
|
||
| go func() { | ||
| ticker := time.NewTicker(interval) | ||
| defer ticker.Stop() | ||
| for { | ||
| select { | ||
| case <-drainDone: | ||
| return | ||
| case <-ticker.C: | ||
| if cfg.OnDrainProgress != nil { | ||
| cfg.OnDrainProgress(int(atomic.LoadInt64(&app.activeConns)), time.Since(startedAt)) | ||
| } | ||
| } | ||
| } | ||
| }() | ||
|
|
||
| // Release the mutex before blocking so ActiveConnections / IsShuttingDown | ||
| // remain callable from other goroutines (e.g. health checks). | ||
| app.mutex.Unlock() |
There was a problem hiding this comment.
The ShutdownWithConfig function releases the app.mutex before the connection draining process is complete. This allows multiple concurrent calls to ShutdownWithConfig if not externally synchronized. Concurrent calls will result in multiple executions of pre-shutdown hooks, multiple drain monitor goroutines, and multiple executions of post-shutdown hooks. This violates the "exactly once" guarantee for hooks and can lead to unexpected behavior or crashes if hooks are not idempotent, potentially causing a Denial of Service during shutdown.
| if cfg.RequestContext != nil { | ||
| app.shutdownCtx = cfg.RequestContext | ||
| } |
There was a problem hiding this comment.
Swapping the application's shutdownCtx mid-shutdown in ShutdownWithConfig creates a data race with request handlers that access this field via DefaultCtx.Context(). Since shutdownCtx is a standard interface and not protected by synchronization primitives during these accesses, concurrent read/write operations can lead to "torn reads," resulting in application panics and Denial of Service.
| DrainedConns int `json:"drainedConns"` | ||
| ForcedConns int `json:"forcedConns"` | ||
| WebSocketsClosed int `json:"webSocketsClosed"` | ||
| SSEsClosed int `json:"sseClosed"` |
There was a problem hiding this comment.
| return func(c Ctx) error { | ||
| status := "running" | ||
| if app.IsShuttingDown() { | ||
| if app.lastTelemetry.Load() != nil { | ||
| status = "shutdown" | ||
| } else { | ||
| status = "shutting_down" | ||
| } | ||
| } | ||
|
|
||
| resp := shutdownDebugResponse{ | ||
| Status: status, | ||
| ActiveConnections: app.ActiveConnections(), | ||
| LastShutdown: telemetryToJSON(app.lastTelemetry.Load()), | ||
| } | ||
| return c.JSON(resp) | ||
| } |
There was a problem hiding this comment.
The app.lastTelemetry.Load() call is performed twice: once to determine the status and again to populate the response. This is slightly inefficient and could lead to a subtle race condition where the telemetry data changes between the two calls. It's better to load the value once into a local variable and reuse it.
return func(c Ctx) error {
status := "running"
lastTel := app.lastTelemetry.Load()
if app.IsShuttingDown() {
if lastTel != nil {
status = "shutdown"
} else {
status = "shutting_down"
}
}
resp := shutdownDebugResponse{
Status: status,
ActiveConnections: app.ActiveConnections(),
LastShutdown: telemetryToJSON(lastTel),
}
return c.JSON(resp)
}| "drainedConns": 12, | ||
| "forcedConns": 3, | ||
| "webSocketsClosed": 4, | ||
| "sseClosed": 1, |
|
The code is too dependent on telemetry. Can you make the adjustment so that it requires few adjustments to the core and is generic, so that other external tools, plugins, etc. can use it? |
|
@zishanjawed What issue is this related to? I'm confused as to why this PR was created without first creating an issue and discussing it with the team. |
Summary
ShutdownTelemetrystruct — captures per-phase wall-clock timing (PreHooksDuration,GracefulCloseDuration,DrainDuration,PostHooksDuration,TotalDuration) and connection-disposition counts (InitialConns,DrainedConns,ForcedConns,WebSocketsClosed,SSEsClosed,TimedOut).app.LastShutdownTelemetry()— lock-free accessor viaatomic.Pointer; returnsnilbefore the firstShutdownWithConfigcompletes, or a fully-populated snapshot afterward.app.ShutdownDebugHandler()— returns aHandlerthat responds with JSON containingstatus(running/shutting_down/shutdown),activeConnections, and the last telemetry snapshot with human-readable duration strings. Register at any path:app.Get("/debug/shutdown", app.ShutdownDebugHandler()).GracefulCloseTyped— extended to accept*ShutdownTelemetryand incrementWebSocketsClosed/SSEsClosedcounters sequentially inside thesync.Map.Range(no additional synchronization needed).Tests (9 new)
Test_Telemetry_CleanDrain_PopulatesAllFieldsDrainedConns == InitialConns;ForcedConns == 0;TimedOut == falseTest_Telemetry_ForcedClose_CountsCorrectlyForcedConns >= 1;TimedOut == trueTest_Telemetry_HookDurations_MeasuredPreHooksDuration >= 50 msTest_Telemetry_WebSocketSSE_CountersWebSocketsClosed == 1,SSEsClosed == 1Test_DebugEndpoint_ReturnsRunningStatusstatus == "running",lastShutdownisnullTest_DebugEndpoint_ReturnsTelemetryAfterShutdownstatus == "shutdown", all duration strings parseable bytime.ParseDurationTest_Telemetry_NoGoroutineLeak_CleanShutdownTest_Telemetry_NoGoroutineLeak_ForcedCloseTest_Telemetry_NoGoroutineLeak_DebugHandlerDocumentation (
docs/guide/shutdown.md)Full reference covering:
ShutdownConfigfield reference with defaultsShutdownTelemetrystruct and accessorOnPreShutdown,OnPostShutdown, per-connectionSetCleanupHook, multi-hook registration orderterminationGracePeriodSeconds,preStopsleep, liveness/readiness probes), full Go application wiring, alignment rules tableshutting_downRace-detector note
The race detector flags
app.server.IdleTimeout = time.Nanosecond(phase 2 in bothShutdownWithContextandShutdownWithConfig) racing with fasthttp'sserveConnread of the same field. This is a pre-existing race in the upstream codebase — confirmed by running-raceagainstTest_ShutdownWithConfig_OnShutdownStart(a test we did not modify). Our telemetry instrumentation introduces zero new data races.Test plan
🤖 Generated with Claude Code