Skip to content

feat(ozx): pin zip metadata so pack_ozx is byte-reproducible#411

Open
alxndrkalinin wants to merge 1 commit intongff/rfc-9/zipfrom
ngff/rfc-9/zip-deterministic-pack
Open

feat(ozx): pin zip metadata so pack_ozx is byte-reproducible#411
alxndrkalinin wants to merge 1 commit intongff/rfc-9/zipfrom
ngff/rfc-9/zip-deterministic-pack

Conversation

@alxndrkalinin
Copy link
Copy Markdown

Stacked on top of #408 (`feat: Added RFC-9 Zipped OME-Zarr`).

Problem

Two `pack_ozx` runs over the same OME-Zarr source produce different `.ozx` bytes — and therefore different sha256s — because Python's `zipfile` module stamps every Local File Header with `time.localtime()[:6]` when called as `zout.open(arcname, mode="w")`. Two more variable fields piggyback on that path: `create_system` (3 on Unix, 0 on Windows) and `external_attr` (file-mode bits sniffed from the OS).

This breaks any artifact-integrity workflow downstream — the ML Commons Croissant 1.1 spec embeds `cr:FileObject.sha256` per archive, and per-dataset MANIFEST.json files (e.g. dynacell-paper's distribution layer) record the sha256 that gets published to AWS Open Data. Without byte reproducibility, the verification chain breaks the moment someone re-packs a source zarr that has not actually changed.

Fix

Build a `ZipInfo` per entry with every variable field pinned, then pass it to `zout.open(zinfo, mode="w", force_zip64=True)` instead of the bare arcname. New helper `_reproducible_zip_info` sets:

Field Pinned to Why
`date_time` `(1980, 1, 1, 0, 0, 0)` zip epoch — earliest representable timestamp
`create_system` `3` (Unix) regardless of host OS
`external_attr` `(S_IFREG | 0o644) << 16` varies otherwise with umask + filesystem ACLs
`compress_type` `ZIP_STORED` already at ZipFile level; per-entry is belt-and-braces
`force_zip64` `True` always zipfile otherwise picks per-entry based on size — flips for borderline-sized chunks

What this does not change

  • BFS entry order — still emitted by `_bfs_order` and is the RFC-9 SHOULD.
  • Archive comment — already deterministic (sorted JSON, fixed version literal).
  • Round-trip behaviour — every existing test in this PR still passes.
  • Source-zarr non-determinism — content drift in the source legitimately drifts the archive. That's outside the scope of `pack_ozx`.

Tests

Two new tests in `tests/ngff/test_ozx.py`:

  • `test_pack_ozx_is_byte_reproducible`: pack the same source twice, assert `a.read_bytes() == b.read_bytes()`. Fails on the pre-fix branch.
  • `test_pack_ozx_pinned_zip_metadata`: walk every entry's `ZipInfo` and assert `date_time`, `create_system`, and `external_attr` carry the pinned values. Catches a future refactor that drops one of the pins — that case would still pass byte-equality for two same-machine same-second runs but silently break cross-machine reproducibility.

```
======================= 10 passed, 8 warnings in 22.49s ========================
```

🤖 Generated with Claude Code

Stacked on top of feat: Added RFC-9 Zipped OME-Zarr (PR #408).

Two ``pack_ozx`` runs over the same OME-Zarr source previously
produced different ``.ozx`` bytes — and therefore different sha256s
— because Python's ``zipfile`` module stamps every Local File Header
with ``time.localtime()[:6]`` when called as ``zout.open(arcname,
mode="w")``. Two more variable fields piggyback on that path:
``create_system`` (3 on Unix, 0 on Windows) and ``external_attr``
(file-mode bits sniffed from the OS).

Non-determinism here is a problem for any artifact-integrity
workflow downstream — the ML Commons Croissant 1.1 spec embeds
``cr:FileObject.sha256`` per archive, and per-dataset MANIFEST.json
files (e.g. dynacell-paper's distribution layer) record the sha256
that gets published to AWS Open Data. Without byte reproducibility,
the verification chain breaks the moment someone re-packs a source
zarr that has not actually changed.

Fix: build a ``ZipInfo`` per entry with every variable field pinned,
then pass it to ``zout.open(zinfo, mode="w", force_zip64=True)``
instead of the bare arcname. New helper ``_reproducible_zip_info``
sets:

- ``date_time = (1980, 1, 1, 0, 0, 0)`` — the zip epoch, earliest
  representable timestamp.
- ``create_system = 3`` — Unix, regardless of the host OS that ran
  the pack.
- ``external_attr = (S_IFREG | 0o644) << 16`` — regular-file 0o644
  in the upper 16 bits where Unix-creator entries store mode bits.
- ``compress_type = ZIP_STORED`` — already enforced at the ZipFile
  level; per-entry pin is belt-and-braces.
- ``force_zip64=True`` on every ``zout.open`` call so the ZIP64
  decision is uniform; otherwise zipfile picks per-entry based on
  size, which can flip for borderline-sized chunks.

Existing functional behaviour is unchanged — the round-trip,
HCS-layout, and version-sniffing tests all still pass. RFC-9
compliance also unaffected: BFS entry order, archive comment, and
ZIP_STORED compression are all preserved.

Tests:
- ``test_pack_ozx_is_byte_reproducible``: pack the same source
  twice, assert ``a.read_bytes() == b.read_bytes()``. Fails on the
  pre-fix branch.
- ``test_pack_ozx_pinned_zip_metadata``: walk every entry's
  ``ZipInfo`` and assert ``date_time``, ``create_system``, and
  ``external_attr`` carry the pinned values. Catches a future
  refactor that drops one of the pins (which would still pass the
  byte-equality test for two same-machine same-second runs but
  silently break cross-machine reproducibility).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@srivarra
Copy link
Copy Markdown
Collaborator

srivarra commented Apr 28, 2026

@alxndrkalinin

Isn't this okay though? You'd only need to create the ozx once right? Does it cause some other issue downstream?

@alxndrkalinin
Copy link
Copy Markdown
Author

alxndrkalinin commented Apr 28, 2026

@srivarra actually, you're right. I was thinking about end-to-end reproducibility. Like, if we accidentally rebuild or overwrite and then downstream does not match anymore. Would be nice to know that packing the same thing twice yields same bytes, but it's a nice to have, not a must. Feel free to close if not needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants