feat(ozx): pin zip metadata so pack_ozx is byte-reproducible#411
Open
alxndrkalinin wants to merge 1 commit intongff/rfc-9/zipfrom
Open
feat(ozx): pin zip metadata so pack_ozx is byte-reproducible#411alxndrkalinin wants to merge 1 commit intongff/rfc-9/zipfrom
alxndrkalinin wants to merge 1 commit intongff/rfc-9/zipfrom
Conversation
Stacked on top of feat: Added RFC-9 Zipped OME-Zarr (PR #408). Two ``pack_ozx`` runs over the same OME-Zarr source previously produced different ``.ozx`` bytes — and therefore different sha256s — because Python's ``zipfile`` module stamps every Local File Header with ``time.localtime()[:6]`` when called as ``zout.open(arcname, mode="w")``. Two more variable fields piggyback on that path: ``create_system`` (3 on Unix, 0 on Windows) and ``external_attr`` (file-mode bits sniffed from the OS). Non-determinism here is a problem for any artifact-integrity workflow downstream — the ML Commons Croissant 1.1 spec embeds ``cr:FileObject.sha256`` per archive, and per-dataset MANIFEST.json files (e.g. dynacell-paper's distribution layer) record the sha256 that gets published to AWS Open Data. Without byte reproducibility, the verification chain breaks the moment someone re-packs a source zarr that has not actually changed. Fix: build a ``ZipInfo`` per entry with every variable field pinned, then pass it to ``zout.open(zinfo, mode="w", force_zip64=True)`` instead of the bare arcname. New helper ``_reproducible_zip_info`` sets: - ``date_time = (1980, 1, 1, 0, 0, 0)`` — the zip epoch, earliest representable timestamp. - ``create_system = 3`` — Unix, regardless of the host OS that ran the pack. - ``external_attr = (S_IFREG | 0o644) << 16`` — regular-file 0o644 in the upper 16 bits where Unix-creator entries store mode bits. - ``compress_type = ZIP_STORED`` — already enforced at the ZipFile level; per-entry pin is belt-and-braces. - ``force_zip64=True`` on every ``zout.open`` call so the ZIP64 decision is uniform; otherwise zipfile picks per-entry based on size, which can flip for borderline-sized chunks. Existing functional behaviour is unchanged — the round-trip, HCS-layout, and version-sniffing tests all still pass. RFC-9 compliance also unaffected: BFS entry order, archive comment, and ZIP_STORED compression are all preserved. Tests: - ``test_pack_ozx_is_byte_reproducible``: pack the same source twice, assert ``a.read_bytes() == b.read_bytes()``. Fails on the pre-fix branch. - ``test_pack_ozx_pinned_zip_metadata``: walk every entry's ``ZipInfo`` and assert ``date_time``, ``create_system``, and ``external_attr`` carry the pinned values. Catches a future refactor that drops one of the pins (which would still pass the byte-equality test for two same-machine same-second runs but silently break cross-machine reproducibility). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Collaborator
|
Isn't this okay though? You'd only need to create the |
Author
|
@srivarra actually, you're right. I was thinking about end-to-end reproducibility. Like, if we accidentally rebuild or overwrite and then downstream does not match anymore. Would be nice to know that packing the same thing twice yields same bytes, but it's a nice to have, not a must. Feel free to close if not needed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on top of #408 (`feat: Added RFC-9 Zipped OME-Zarr`).
Problem
Two `pack_ozx` runs over the same OME-Zarr source produce different `.ozx` bytes — and therefore different sha256s — because Python's `zipfile` module stamps every Local File Header with `time.localtime()[:6]` when called as `zout.open(arcname, mode="w")`. Two more variable fields piggyback on that path: `create_system` (3 on Unix, 0 on Windows) and `external_attr` (file-mode bits sniffed from the OS).
This breaks any artifact-integrity workflow downstream — the ML Commons Croissant 1.1 spec embeds `cr:FileObject.sha256` per archive, and per-dataset MANIFEST.json files (e.g. dynacell-paper's distribution layer) record the sha256 that gets published to AWS Open Data. Without byte reproducibility, the verification chain breaks the moment someone re-packs a source zarr that has not actually changed.
Fix
Build a `ZipInfo` per entry with every variable field pinned, then pass it to `zout.open(zinfo, mode="w", force_zip64=True)` instead of the bare arcname. New helper `_reproducible_zip_info` sets:
What this does not change
Tests
Two new tests in `tests/ngff/test_ozx.py`:
```
======================= 10 passed, 8 warnings in 22.49s ========================
```
🤖 Generated with Claude Code