Skip to content

Fix: Prevent CrashLoopBackOff during image upgrade with RollingUpdate (#1926)#1956

Merged
alex-zaitsev merged 1 commit intoAltinity:0.27.0from
dashashutosh80:0.26.3
Apr 15, 2026
Merged

Fix: Prevent CrashLoopBackOff during image upgrade with RollingUpdate (#1926)#1956
alex-zaitsev merged 1 commit intoAltinity:0.27.0from
dashashutosh80:0.26.3

Conversation

@dashashutosh80
Copy link
Copy Markdown
Contributor

@dashashutosh80 dashashutosh80 commented Apr 15, 2026

Fix: Prevent CrashLoopBackOff during image upgrade with RollingUpdate

Description

This PR fixes issue #1926 where upgrading ClickHouse version while simultaneously adding version-specific settings causes pods to enter CrashLoopBackOff when using spec.restart: RollingUpdate.

Problem

When upgrading ClickHouse and adding version-specific configuration simultaneously:

  1. Operator updates ConfigMaps with new version-specific settings
  2. Operator detects spec.restart: RollingUpdate and attempts forced restart via SYSTEM SHUTDOWN
  3. Pod with old image restarts with new (incompatible) configuration
  4. Pod crashes → CrashLoopBackOff

Example: Upgrading from 25.3 to 25.8 while adding write_marks_for_substreams_in_compact_parts (only valid in 25.8+) causes 25.3 pods to crash.

Root Cause

In pkg/controller/chi/worker.go, the shouldForceRestartHost() function checks IsRollingUpdate() before checking for image changes:

case host.GetCR().IsRollingUpdate():  // Checked first
    return true  // Always forces restart

case model.IsConfigurationChangeRequiresReboot(host):
    if w.isImageChangeRequested(host) {  // Never reached!
        return false  // The existing fix that gets bypassed
    }
    return true

The v0.26.2 fix (commit e2e8e16de) added image change detection, but it's inside the IsConfigurationChangeRequiresReboot() case, which is never reached when IsRollingUpdate() returns true first.

Solution

Move image change detection before the IsRollingUpdate() check:

// Check for image changes BEFORE checking RollingUpdate
case w.isImageChangeRequested(host):
    w.a.V(1).M(host).F().Info("Image change detected - deferring restart to STS rollout")
    return false

case host.GetCR().IsRollingUpdate():
    return true

case model.IsConfigurationChangeRequiresReboot(host):
    return true

The image change check is now a top-level case (not nested inside IsConfigurationChangeRequiresReboot). This provides universal protection because:

  1. Consistent behavior: Image changes should always defer to StatefulSet, regardless of the restart policy
  2. Race condition prevention: The core issue (old image + new config) can occur in any scenario, not just config-reboot cases
  3. Simpler logic: Easier to reason about - one rule applies everywhere
  4. Follows StatefulSet semantics: When image changes, K8s StatefulSet will handle pod recreation; no need for SYSTEM SHUTDOWN

Now when an image upgrade is happening:

  • Image change is detected first → skip forced restart
  • StatefulSet rolling update handles the upgrade
  • Pod starts with new image + new config together ✅

Testing

Tested with:

  • Operator version: v0.26.2 + this fix
  • CHI configuration: spec.restart: RollingUpdate
  • Upgrade scenario: 25.3.8.23 → 25.8.17.37
  • Version-specific setting: write_marks_for_substreams_in_compact_parts: 0

Result: ✅ Successful upgrade with no CrashLoopBackOff

Operator log evidence:

I0415 04:57:43.792516 shouldForceRestartHost():Image change detected - deferring restart to STS rollout. Host: 0-0

Complete testing documentation available in the issue comment.

Impact

  • Affected users: Anyone using spec.restart: RollingUpdate with simultaneous image upgrades and version-specific settings
  • Risk: Low - only changes the order of checks in switch statement
  • Backward compatibility: ✅ Maintains existing behavior for all other scenarios

Behavior Changes

Scenario Before After Impact
Image change + RollingUpdate + Config reboot Force restart (❌ CrashLoop) Skip restart Fixed
Image change + RollingUpdate + No config reboot Force restart Skip restart Better - no unnecessary restart
Image change + No RollingUpdate + Config reboot Skip restart Skip restart Same - original v0.26.2 fix still works
No image change + RollingUpdate + Config reboot Force restart Force restart Same
No image change + No RollingUpdate + Config reboot Force restart Force restart Same

Key insight: Making image change check universal provides broader protection without breaking existing functionality.

Release Notes

Bug Fix: Prevent CrashLoopBackOff when upgrading ClickHouse version with spec.restart: RollingUpdate and version-specific settings. The operator now correctly defers restart to StatefulSet rolling update when image changes are detected.

Checklist

  • All commits in the PR are squashed. More info
  • The PR is made into dedicated next-release branch, not into master branch1. More info
  • The PR is signed. More info

Suggested Target Branch

This is a bug fix suitable for a patch release (0.26.3) as it:

  • Fixes a critical issue causing CrashLoopBackOff
  • Has minimal risk (only reorders existing checks)
  • Doesn't add new features
  • Is backward compatible

If 0.26.3 branch doesn't exist yet, this can be:

  1. Merged to a branch where 0.26.3 will be cut from, OR
  2. Backported to 0.26.x after merging to main development branch

Please advise on the preferred target branch.

Additional Notes

During testing, we also discovered a separate issue where the onUpdateFailure: abort reconcile policy (introduced to protect Keeper clusters) conflicts with mutating admission webhooks. This is documented separately and not part of this fix. See the issue comment for details.


Fixes #1926

Move image change detection before IsRollingUpdate check to prevent
forced restart (SYSTEM SHUTDOWN) when upgrading ClickHouse version
while simultaneously adding version-specific settings.
Fixes Altinity#1926

Signed-off-by: dashashutosh80 <[email protected]>
@alex-zaitsev alex-zaitsev merged commit b45c85e into Altinity:0.27.0 Apr 15, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Image upgrade with version-specific settings causes CrashLoopBackOff due to ConfigMap update before StatefulSet reconciliation

2 participants