feat: Initial Zip file and OME-Zarr Archive (RFC-9) support #306
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Purpose.
Support OME-Zarr zip archives (RFC-9). I am an author of OME-Zarr RFC-9. Zarr zip file implementations are wide spread and implemented in a number of Zarr implementations.
Notably Neuroglancer implements Zip file key-value store.
Background
Zip files are documented by a PKWare appnote. A zip file contains local file entries followed by a central directory at the end of the file. While the local file entries at the beginning file could be read in streaming fashion, it is often necessary to read the central directory at the end of the file first. Duplicate local file entries may exist in the beginning of the archive, but only the central directory can correctly indicate the latest version for example.
OME Zarr RFC-9 proposes a standard way to store OME-Zarr datasets in zip files with a OZX extension. A particular challenge for Zarr in zip files in the large numbers of files which may make the central directory more difficult to parse. The RFC recommends a number of recommendations to minimize the number of files such as using sharding.
Another recommendation by the RFC is to list zarr.json metadata first in the central directory. This allows an application like fileglancer which is mainly concerned with parsing metadata to quickly access the metadata without having the parse the entire central directory. It also allows the entire tree of the hieararchy to be elucidated. To facilitate detection of this optimization, JSON is stored in the comment of the zip which occurs at the end of the central directory. This contains a
jsonFirstflag to indicate that the writer of the file placed the JSON files first after the central directory. If this flag is true, a reader may assume that no further zarr.json files exist within the archive once the first non-zarr.json file is read.Design
While Python does provide a zipfile standard library, a key performance optimization here is the abilty to read a partial central directory. While browsing, only the metadata may be needed. As such this pull request contains new code to read the central directory partially, which the standard library does not support.
If there becomes a need stream the contents of the zip file for external applications, then some additional reading and caching of the central directory may be needed.
While I was initially working on this feature with OZX in mind, implementing generic zip file support also seemed useful. Thus the core implementation splits generic zip support and then builds OME-Zarr support on top of that.