Skip to content

Antalya 26.1: Fix rescheduleTasksFromReplica#1568

Merged
zvonand merged 3 commits intoantalya-26.1from
bugfix/antalya-26.1/1554_fix_crash
Apr 23, 2026
Merged

Antalya 26.1: Fix rescheduleTasksFromReplica#1568
zvonand merged 3 commits intoantalya-26.1from
bugfix/antalya-26.1/1554_fix_crash

Conversation

@ianton-ru
Copy link
Copy Markdown

@ianton-ru ianton-ru commented Mar 23, 2026

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fix rescheduleTasksFromReplica

Documentation entry for user-facing changes

Fix incorrect change from c523f29
getReplicaForFile uses replica_to_files_to_be_processed to find best replica for file, With removing lost replica after getReplicaForFile call, getReplicaForFile chooses the same replica, so rescheduling makes no sense, files will be choosen only in getAnyUnprocessedFile and executed on random replicas.
This PR fixes the order, now files are matched with new best replicas.

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • S3 Export (2h)
  • Swarms (30m)
  • Tiered Storage (2h)

@ianton-ru
Copy link
Copy Markdown
Author

@codex review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 23, 2026

Workflow [PR], commit [31baade]

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Already looking forward to the next diff.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown
Collaborator

@mkmkme mkmkme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for the record


AI audit note: This review comment was generated by AI.

Audit update for PR #1568 (Fix rescheduleTasksFromReplica):

Confirmed defects

No confirmed defects in reviewed scope.

The change is correct:

  • std::move(processed_file_list_ptr->second) safely transfers list ownership to a local variable; std::list move leaves the source empty and well-defined.
  • processed_file_list_ptr (a map iterator) is invalidated by the subsequent erase, but is never dereferenced after that point.
  • getReplicaForFile is called within the mutex-held region, iterating over replica_to_files_to_be_processed which now correctly excludes the lost replica, so rendezvous hashing picks the best surviving replica.
  • The replica_to_files_to_be_processed.size() < 2 guard on line 307 still executes before the erase, correctly verifying at least one other replica remains.

Coverage summary

  • Scope reviewed: rescheduleTasksFromReplica (changed function), getReplicaForFile (callee), getNextTask / getPreQueuedFile / getMatchingFileFromIterator / getAnyUnprocessedFile (concurrent accessors), RemoteQueryExecutor::processPacket (caller on ConnectionLost), class header and all member data structures.
  • Categories passed: move-semantics safety, iterator invalidation, logical correctness of erase-before-hash, RAII/resource leaks, integer types, exception safety within the function, mutex coverage within rescheduleTasksFromReplica, rollback/partial-update (single-function scope, all mutations under one lock).
  • Categories failed: none.
  • Pre-existing issue noted (not introduced by this PR): getNextTask accesses replica_to_files_to_be_processed at lines 52 and 70 without the mutex, while rescheduleTasksFromReplica mutates it under the mutex. getMatchingFileFromIterator calls getReplicaForFile (which iterates the same map) at line 197 without the mutex. These are data races that can lead to use-after-free if a replica's in-flight getNextTask overlaps with rescheduleTasksFromReplica for the same replica. This PR does not introduce or worsen this race in any meaningful way.
  • Assumptions/limits: audit is static-only; no test for rescheduleTasksFromReplica was found in the test suite (gtest_rendezvous_hashing.cpp covers getNextTask only).

@ianton-ru ianton-ru changed the title Fix rescheduleTasksFromReplica Antalya 26.1: Fix rescheduleTasksFromReplica Mar 24, 2026
@ianton-ru ianton-ru added the port-antalya PRs to be ported to all new Antalya releases label Apr 22, 2026
@CarlosFelipeOR
Copy link
Copy Markdown
Collaborator

AI audit note: This review comment was generated by AI (gpt-5.3-codex).

Audit update for PR #1568 (Fix rescheduleTasksFromReplica):

Confirmed defects

No confirmed defects in reviewed scope.

Coverage summary

  • Scope reviewed: src/Storages/ObjectStorage/StorageObjectStorageStableTaskDistributor.cpp (rescheduleTasksFromReplica) plus its call path through StorageObjectStorageCluster::TaskDistributor and RemoteQueryExecutor connection-loss rerun flow.
  • Categories failed: none.
  • Categories passed: call-graph/transition consistency, branch outcome checks (normal/lost-replica/all-replicas-lost), fault-injection reasoning (missing replica, no active replicas, duplicate requeue key behavior), multithreaded interleaving sanity (shared-state guarded by distributor mutex in reschedule path), rollback/partial-update review, and C++ bug-class sweep (lifetime, iterator invalidation, race/deadlock, exception safety, overflow/signedness, RAII/UB) within changed scope.
  • Assumptions/limits: static audit only (no runtime fault injection/TSAN), limited to this PR delta and directly connected code paths.

@CarlosFelipeOR
Copy link
Copy Markdown
Collaborator

CarlosFelipeOR commented Apr 22, 2026

QA Verification

1. Code review

Two independent AI audit reviews conducted on the changed function and adjacent call paths (rescheduleTasksFromReplica, getReplicaForFile, concurrent accessors, and the connection-loss rerun flow through RemoteQueryExecutor). Both reviews reported no confirmed defects.

References: #1568 (review), #1568 (comment).

2. CI

All CI builds and tests passed. The only regression failure is /swarms/feature/node failure/initiator out of disk space, which is a known pre-existing failure already being addressed by the QA team. The CVEs reported by the grype scan are also known and being tracked separately.

3. Related regression coverage

A swarms regression scenario covering rescheduleTasksFromReplica in the bucket-split path (kill + replica failure) was added in Altinity/clickhouse-regression#118, originally targeting #1486 / PR #1493. Since this PR touches the same function, that scenario will also exercise the code path changed here once it runs against builds containing both fixes.

Verdict

Considering this is a small, focused change confirmed by the audit reviews and a green CI, this PR can be considered safe to merge. Adding verified label.

@CarlosFelipeOR CarlosFelipeOR added the verified Approved for release label Apr 22, 2026
@zvonand zvonand merged commit b19fb22 into antalya-26.1 Apr 23, 2026
765 of 786 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

antalya antalya-26.1 antalya-26.1.11.20001 bugfix port-antalya PRs to be ported to all new Antalya releases verified Approved for release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants