Skip to content

dev: prioritize shard recovery and guard unsafe host disruption#1967

Open
shankeleven wants to merge 1 commit intoAltinity:0.26.2from
shankeleven:unsafe_host_disruption
Open

dev: prioritize shard recovery and guard unsafe host disruption#1967
shankeleven wants to merge 1 commit intoAltinity:0.26.2from
shankeleven:unsafe_host_disruption

Conversation

@shankeleven
Copy link
Copy Markdown

Solves #1704

Problem
During interrupted rollouts (e.g. operator restart while a replica is down), reconciliation could continue and temporarily take all replicas in a shard offline, causing query failures or partial results.

Root cause

  • Reconcile was skipped when generation was unchanged, even if some hosts were unhealthy.
  • Host processing didn’t prioritize recovery and allowed disruptive actions without checking shard state.

Changes

  • Health-aware skip: only skip reconcile if generation is unchanged and all hosts are healthy. Otherwise, run recovery.
  • Recovery-first ordering: unhealthy hosts are reconciled before rollout.
  • Shard safety guard: block/disallow disruptive actions (restart/exclude) if no healthy peer exists in the shard.
  • Deferral + retry: unsafe actions are deferred; if nothing progresses, return ErrCRUDAbort to retry via normal controller flow.
  • Added unit + regression tests for the above behavior.

Validation

go test ./pkg/controller/chi/... passes.

@shankeleven
Copy link
Copy Markdown
Author

@sunsingerus please review this PR and let me know if there are any changes required

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant