Implement version-specific default chunk and shard sizes in iohub utils by ieivanov · Pull Request #403 · czbiohub-sf/iohub

ieivanov · 2026-04-22T21:55:32Z

Closes #401.

What this changes

Version-specific chunk and shard defaults for iohub.ngff.utils.create_empty_plate, plus the test coverage to pin them down.

`src/iohub/ngff/utils.py`

v0.5 (Zarr v3): default chunks = (1, 1, 16, 256, 256) clamped to shape; default shards_ratio yields a shard of (1, 1, Z, Y, X) — one shard per (T, C) slot, per the DCA spec referenced in Set version-specific default chunk and shard sizes in create_empty_plate #401.
v0.4 (Zarr v2): default chunks = (1, 1, Z, Y, X) with Z halved until the single-chunk byte size fits under _V04_MAX_CHUNK_SIZE_BYTES (500 MB). No sharding, as Zarr v2 has no sharding codec.
Extracted the branching into _default_chunks(shape, dtype, version) and _default_shards_ratio(shape, chunks) helpers, and promoted the two tunables to module-level constants _V05_DEFAULT_ZYX_CHUNKS and _V04_MAX_CHUNK_SIZE_BYTES.
Removed the max_chunk_size_bytes parameter from create_empty_plate; the 500 MB cap is now the module-level constant _V04_MAX_CHUNK_SIZE_BYTES and applies only to v0.4 chunks (v0.5 chunks are driven by the DCA spec, not a byte cap).
Updated the docstring to describe the version-specific defaults.

Breaking change?

Technically yes, but low blast radius. create_empty_plate(..., max_chunk_size_bytes=...) is no longer accepted — callers who passed it by keyword will get a TypeError. I confirmed that waveorder and biahub — the two primary downstream consumers of this function — never set the parameter explicitly, so this change is a no-op for them. No other in-tree callers pass it either (grep -R max_chunk_size_bytes in iohub turns up no hits outside utils.py).

If we want to be strict about API stability we can keep max_chunk_size_bytes as a deprecated passthrough that overrides the constant for v0.4, but given the known downstream callsites don't touch it, removing it is cleaner.

`tests/ngff/test_ngff_utils.py`

Added explicit tests that pin the defaults so regressions fail deterministically instead of depending on which examples hypothesis happens to draw:

test_v05_default_chunks (parametrized) — asserts the DCA-aligned chunk shape, clamped to each axis.
test_v05_default_shards_cover_zyx — one shard per (T, C) for divisible dims.
test_v05_default_shards_with_non_divisible_zyx — shard still covers Z/Y/X when they're not multiples of the chunk size.
test_v05_explicit_shards_ratio_is_honored — caller override wins.
test_v04_default_chunks_cover_full_zyx / test_v04_default_chunks_capped_by_byte_limit — full-ZYX chunks; Z halves when the chunk would exceed the byte cap.
test_v04_default_has_no_sharding — arr.shards is None on v0.4.
test_v04_rejects_explicit_shards_ratio — raises on v0.4.
test_process_single_position_on_sharded_v05_store — end-to-end round-trip through the default-sharded v0.5 layout.
test_apply_transform_to_tczyx_on_multi_time_shard — direct oindex write with shard_t > 1. No xfail: if the write ever stops succeeding, this fires.

Tightened the hypothesis strategies (apply_transform_czyx_setup, process_single_position_setup) so integer time_indices / channel_indices are drawn unique=True and sorted via .map(sorted). See the "Why hypothesis needed tightening" section below.

Why hypothesis needed tightening

Before this change, the existing hypothesis-based tests (test_apply_transform_to_tczyx_and_save et al.) passed in CI under the new defaults but only by luck: @settings(max_examples=5) on each test meant CI drew 5 random examples per run, none of which happened to trip the failure. At max_examples=200 the same tests failed reliably.

Root cause: the new v0.5 defaults make stores sharded, and the write path goes through arr.oindex[...]. iohub's active codec pipeline is ZarrsCodecPipeline (zarrs is a required dep), which handles most oindex patterns correctly — but it raises UnsupportedVIndexingError and falls back to the default BatchedCodecPipeline when an indexer is duplicate or unsorted. The fallback pipeline has an upstream bug in its sharding codec merge step: zarr-developers/zarr-python#2834.

Direct repro on a sharded v0.5 store under iohub's default config:

Indexer pattern	Result
`oindex[[0, 1], [0]]` — unique, sorted	✅ OK
`oindex[[0, 2], [0]]` — unique, non-contiguous but sorted	✅ OK
`oindex[[0, 0], [0]]` — duplicates	❌ fallback → zarr#2834
`oindex[[1, 0, 2], [0]]` — unsorted	❌ fallback → zarr#2834

The hypothesis strategy was generating duplicate / unsorted integer lists like [0, 0] and [1, 0, 2]. Real callers (biahub, the hybrid_scheme processing pipelines, etc.) always build sorted unique index lists, so the duplicate/unsorted draws weren't exercising a real use case — they were exercising a zarrs fallback path that wouldn't otherwise fire. Enforcing unique=True + .map(sorted) at the strategy level matches what callers pass and keeps the tests on the zarrs fast path.

Verification

Full file:

pytest tests/ngff/test_ngff_utils.py
19 passed in 3.49s

Stress run (all three hypothesis-based tests, max_examples=200 each, generate-only so the 600 draws are all fresh):

3 passed in 140.54s

No flakes, no xfail, and the new explicit tests fail loudly if any of the pinned defaults regress.

Test plan

tests/ngff/test_ngff_utils.py passes at default settings
Hypothesis-based tests pass at max_examples=200 (no duplicate/unsorted fallbacks)
v0.5 sharded write path round-trips correctly via process_single_position
v0.5 multi-time oindex write into a shared shard succeeds
v0.4 rejects shards_ratio and has shards is None by default
Confirmed no downstream callers (waveorder, biahub) pass max_chunk_size_bytes
Test in biahub processing pipeline on mantis v2

Follow-up

Should we update and Position.create_zeros (and therefore create_image which calls create_zeros) with better chunk and shard defaults? Currently chunking defaults to (1, 1, Z, Y, X), even for v0.5 and don't use sharding by default for v0.5

Transferred to #409

    def create_zeros(
        self,
        name: str,
        shape: tuple[int, ...],
        dtype: DTypeLike,
        chunks: tuple[int, ...] | None = None,
        shards_ratio: tuple[int, ...] | None = None,
        transform: list[TransformationMeta] | None = None,
        check_shape: bool = True,
    ):
        """Create a new zero-filled image array in the position.
        ...
        """
        if not chunks:
            chunks = self._default_chunks(shape, 3)

        if check_shape:
            self._check_shape(shape)
        arr_handle = self._create_zarr_array(name, shape, dtype, chunks, shards_ratio)
        img_arr = ImageArray.from_handle(arr_handle, self._impl)
        self._create_image_meta(name, transform=transform)
        return img_arr

    @staticmethod
    def _default_chunks(shape, last_data_dims: int):
        chunks = shape[-min(last_data_dims, len(shape)) :]
        return pad_shape(chunks, target=len(shape))

apply_transform_czyx_setup and process_single_position_setup now draw integer time_indices / channel_indices with unique=True and .map(sorted). That matches the pattern real callers pass and keeps the oindex selectors on the zarrs fast path -- unsorted or duplicate integer arrays cause ZarrsCodecPipeline to raise UnsupportedVIndexingError and fall back to BatchedCodecPipeline, which hits zarr-developers/zarr-python#2834 on sharded v0.5 stores (tracked in iohub#404). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

srivarra

Changes look good overall, just a question about what we are testing.

talonchandler

All looks good to me.

Friendly reminder to not delete this branch when you merge (waveorder's main now depends on it), then release, then update waveorder's dependency. Thank you!

srivarra

Looks good!

add sharding defaults to create_empty_plate

851fc1c

ieivanov mentioned this pull request Apr 22, 2026

apply_transform_to_tczyx_and_save / process_single_position incompatible with sharded v0.5 stores #404

Closed

ieivanov and others added 2 commits April 22, 2026 15:47

add tests

a9a4a84

ieivanov marked this pull request as ready for review April 22, 2026 23:25

ieivanov requested review from alxndrkalinin, edyoshikun, srivarra and talonchandler April 22, 2026 23:25

srivarra requested changes Apr 23, 2026

View reviewed changes

Comment thread src/iohub/ngff/utils.py Outdated

Comment thread tests/ngff/test_ngff_utils.py Outdated

ieivanov mentioned this pull request Apr 23, 2026

Replace create_empty_hcs_zarr with iohub's create_empty_plate czbiohub-sf/biahub#215

Merged

6 tasks

ieivanov added 2 commits April 23, 2026 12:43

move V04_MAX_CHUNK_SIZE_BYTES to compat

3bf23db

update sharding test

9a6da43

talonchandler approved these changes Apr 24, 2026

View reviewed changes

ieivanov requested a review from srivarra April 24, 2026 22:59

srivarra approved these changes Apr 24, 2026

View reviewed changes

ieivanov mentioned this pull request Apr 24, 2026

Update Position.create_zeros with better defaults for chunks and shards #409

Open

edyoshikun approved these changes Apr 24, 2026

View reviewed changes

srivarra merged commit f281ac3 into main Apr 25, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement version-specific default chunk and shard sizes in iohub utils#403

Implement version-specific default chunk and shard sizes in iohub utils#403
srivarra merged 5 commits intomainfrom
feat/sharding_in_utils

ieivanov commented Apr 22, 2026 •

edited

Loading

Uh oh!

srivarra left a comment

Uh oh!

Uh oh!

Uh oh!

talonchandler left a comment

Uh oh!

srivarra left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ieivanov commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this changes

src/iohub/ngff/utils.py

Breaking change?

tests/ngff/test_ngff_utils.py

Why hypothesis needed tightening

Verification

Test plan

Follow-up

Uh oh!

srivarra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

talonchandler left a comment

Choose a reason for hiding this comment

Uh oh!

srivarra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ieivanov commented Apr 22, 2026 •

edited

Loading

`src/iohub/ngff/utils.py`

`tests/ngff/test_ngff_utils.py`