dev: prioritize shard recovery and guard unsafe host disruption by shankeleven · Pull Request #1967 · Altinity/clickhouse-operator

shankeleven · 2026-04-24T09:31:22Z

Problem
During interrupted rollouts (e.g. operator restart while a replica is down), reconciliation could continue and temporarily take all replicas in a shard offline, causing query failures or partial results.

Root cause

Reconcile was skipped when generation was unchanged, even if some hosts were unhealthy.
Host processing didn’t prioritize recovery and allowed disruptive actions without checking shard state.

Changes

Health-aware skip: only skip reconcile if generation is unchanged and all hosts are healthy. Otherwise, run recovery.
Recovery-first ordering: unhealthy hosts are reconciled before rollout.
Shard safety guard: block/disallow disruptive actions (restart/exclude) if no healthy peer exists in the shard.
Deferral + retry: unsafe actions are deferred; if nothing progresses, return ErrCRUDAbort to retry via normal controller flow.
Added unit + regression tests for the above behavior.

Validation

go test ./pkg/controller/chi/... passes.

shankeleven · 2026-04-24T09:32:24Z

@sunsingerus please review this PR and let me know if there are any changes required

dev: prioritize shard recovery and guard unsafe host disruption

4639ccb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dev: prioritize shard recovery and guard unsafe host disruption#1967

dev: prioritize shard recovery and guard unsafe host disruption#1967
shankeleven wants to merge 1 commit intoAltinity:0.26.2from
shankeleven:unsafe_host_disruption

shankeleven commented Apr 24, 2026

Uh oh!

shankeleven commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shankeleven commented Apr 24, 2026

Uh oh!

shankeleven commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant