Introduction

zarrs is a Rust library for the Zarr V2 and Zarr V3 array storage formats. If you don't know what Zarr is, check out:

the official Zarr website: zarr.dev, and
the Zarr V3 specification.

zarrs was originally designed exclusively as a Rust library for Zarr V3. However, it now supports a V3 compatible subset of Zarr V2, and has Python and C/C++ bindings. This book details the Rust implementation.

🚀 `zarrs` is Fast 🚀

The repository includes benchmarks of zarrs against other Zarr V3 implementations. Check out the benchmarks below that measure the time to round trip a \(1024x2048x2048\) uint16 array encoded in various ways. The zarr_benchmarks repository includes additional benchmarks.

benchmark standalone

Python Bindings: `zarrs-python`

zarrs-python exposes a high-performance zarrs-backed codec pipeline to the reference Python package. It is enabled as follows:

from zarr import config
import zarrs # noqa: F401

config.set({"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"})

That's it! There is no need to learn a new API and it is supported by downstream libraries like dask. However, zarrs-python has some limitations. Consult the zarrs-python README or PyPi docs for more details.

Rust Crates

The Zarr specification is inherently unstable. It is under active development and new extensions are regularly being introduced.

The zarrs crate has been split into multiple crates to:

allow external implementations of stores and extensions points to target a relatively stable API compatible with a range of zarrs versions,
enable automatic backporting of metadata compatibility fixes and changes due to standardisation,
stay up-to-date with unstable public dependencies (e.g. opendal, object_store, icechunk, etc) without impacting the release cycle of zarrs, and
improve compilation times.

Below is a slightly simplified overview of the crate structure:

graph LR
    subgraph tools[CLI Tools]
        zarrs_tools
    end
    subgraph metadata_conventions[Zarr Metadata Conventions]
        ome_zarr_metadata
    end
    subgraph Stores
        direction LR
        zarrs_filesystem[zarrs_filesystem <br> zarrs::filesystem]
        zarrs_object_store
        zarrs_opendal
        zarrs_http
        zarrs_icechunk
    end
    subgraph Core
        zarrs_storage[zarrs_storage <br> zarrs::storage]
        zarrs_metadata_ext[zarrs_metadata_ext <br> zarrs::metadata_ext]
        zarrs_metadata[zarrs_metadata <br> zarrs::metadata]
        zarrs_registry[zarrs_registry <br> zarrs::registry]
        zarrs_plugin[zarrs_plugin <br> zarrs::plugin]
        subgraph Extensions
            direction LR
            zarrs_data_type[zarrs_data_type <br> zarrs::array:data_type]
            %% zarrs_codec TODO
            %% zarrs_chunk_grid TODO
        end
        zarrs
    end
    subgraph storage_adapters[Storage Adapters]
        zarrs_zip
    end
    subgraph Bindings
        %% direction LR
        zarrs_ffi[zarrs_ffi <br> C/C++]
        zarrs-python[zarrs-python <br> Python]
    end
    zarrs_storage --> zarrs
    %% zarrs_registry --> zarrs
    zarrs_metadata_ext --> zarrs
    zarrs_metadata --> zarrs_metadata_ext
    zarrs_registry --> zarrs_metadata_ext
    %% zarrs_metadata --> zarrs
    %% zarrs_metadata --> Extensions
    zarrs_metadata_ext --> Extensions
    zarrs_plugin --> Extensions
    Extensions --> zarrs
    %% zarrs_plugin ---> zarrs
    ome_zarr_metadata --> zarrs_tools
    Stores --> storage_adapters
    storage_adapters --> zarrs_storage
    Stores --> zarrs_storage
    Core --> tools
    Core --> Bindings

The core crate is:

zarrs

For local filesystem stores (referred to as native Zarr), this is the only crate you need to depend on. zarrs has quite a few supplementary crates that are typically just used as transitive dependencies:

zarrs_metadata
zarrs_metadata_ext
zarrs_storage
zarrs_plugin
zarrs_data_type
zarrs_registry

Additional crates need to be added as dependencies in order to use:

remote stores (e.g. HTTP, S3, GCP, etc.),
zip stores, or
icechunk transactional storage.

The Stores chapter details the various types of stores and their associated crates.

C/C++ Bindings: `zarrs_ffi`

A subset of zarrs exposed as a C/C++ API. zarrs_ffi is a single header library: zarrs.h. Consult the zarrs_ffi README and API docs for more information.

CLI Tools: `zarrs_tools`

Various tools for creating and manipulating Zarr v3 data with the zarrs rust crate. This crate is detailed in the zarrs_tools chapter.

Zarr Metadata Conventions

`ome_zarr_metadata`

A Rust library for OME-Zarr (previously OME-NGFF) metadata.

OME-Zarr, formerly known as OME-NGFF (Open Microscopy Environment Next Generation File Format), is a specification designed to support modern scientific imaging needs. It is widely used in microscopy, bioimaging, and other scientific fields requiring high-dimensional data management, visualisation, and analysis.

Installation

Prerequisites

The most recent zarrs requires Rust version msrv or newer.

You can check your current Rust version by running:

rustc --version

If you don’t have Rust installed, follow the official Rust installation guide.

Some optional zarrs codecs require:

The CMake build system.
The Clang compiler.

These are typically available through package managers on Linux, Homebrew on Mac, etc.

Adding `zarrs` to Your Rust Library/Application

zarrs is a Rust library. To use it as a dependency in your Rust project, add it to your Cargo.toml file:

[dependencies]
zarrs = "18.0" # Replace with the latest version

The latest version is . See crates.io for a full list of versions.

Crate Features

zarrs has a number of features for stores, codecs, or APIs, many of which are enabled by default. The below example demonstrates how to disable default features and explicitly enable required features:

[dependencies.zarrs]
version = "18.0"
default-features = false
features = ["filesystem", "blosc"]

See zarrs (docs.rs) - Crate Features for an up-to-date list of all available features.

Supplementary Crates

Remote store support and other capabilities are provided by supplementary crates. See the Rust Crates section and Stores chapter for an overview of the crates available.

Zarr Stores

A Zarr store is a system that can be used to store and retrieve data from a Zarr hierarchy. For example: a filesystem, HTTP server, FTP server, Amazon S3 bucket, etc. A store implements a key/value store interface for storing, retrieving, listing, and erasing keys.

The Zarr V3 storage API is detailed here in the Zarr V3 specification.

The Sync and Async API

Zarr Groups and Arrays are the core components of a Zarr hierarchy. In zarrs, both structures have both a synchronous and asynchronous API. The applicable API depends on the storage that the group or array is created with.

Async API methods typically have an async_ prefix. In subsequent chapters, async API method calls are shown commented out below their sync equivalent.

warning

The async API is still considered experimental, and it requires the async feature.

Synchronous Stores

Memory

MemoryStore is a synchronous in-memory store available in the zarrs_storage crate (re-exported as zarrs::storage).

#![allow(unused)]
fn main() {
use zarrs::storage::ReadableWritableListableStorage;
use zarrs::storage::store::MemoryStore;

let store: ReadableWritableListableStorage = Arc::new(MemoryStore::new());
}

Note that in-memory stores do not persist data, and they are not suited to distributed (i.e. multi-process) usage.

Filesystem

FilesystemStore is a synchronous filesystem store available in the zarrs_filesystem crate (re-exported as zarrs::filesystem with the filesystem feature).

#![allow(unused)]
fn main() {
use zarrs::storage::ReadableWritableListableStorage;
use zarrs::filesystem::FilesystemStore;

let base_path = "/";
let store: ReadableWritableListableStorage =
    Arc::new(FilesystemStore::new(base_path));
}

The base path is the root of the filesystem store. Node paths are relative to the base path.

The filesystem store also has a new_with_options constructor. Currently the only option available for filesystem stores is whether or not to enable direct I/O on Linux.

HTTP

HTTPStore is a read-only synchronous HTTP store available in the zarrs_http crate.

#![allow(unused)]
fn main() {
use zarrs::storage::ReadableStorage;
use zarrs_http::HTTPStore;

let http_store: ReadableStorage = Arc::new(HTTPStore::new("http://...")?);
}

note

The HTTP stores provided by object_store and opendal (see below) provide a more comprehensive feature set.

Asynchronous Stores

`object_store`

The object_store crate is an async object store library for interacting with object stores. Supported object stores include:

AWS S3
Azure Blob Storage
Google Cloud Storage
Local files
Memory
HTTP/WebDAV Storage
Custom implementations

zarrs_object_store::AsyncObjectStore wraps object_store::ObjectStore stores.

#![allow(unused)]
fn main() {
use zarrs::storage::::AsyncReadableStorage;
use zarrs_object_store::AsyncObjectStore;

let options = object_store::ClientOptions::new().with_allow_http(true);
let store = object_store::http::HttpBuilder::new()
    .with_url("http://...")
    .with_client_options(options)
    .build()?;
let store: AsyncReadableStorage = Arc::new(AsyncObjectStore::new(store));
}

OpenDAL

The opendal crate offers a unified data access layer, empowering users to seamlessly and efficiently retrieve data from diverse storage services. It supports a huge range of services and layers to extend their behaviour.

zarrs_object_store::AsyncOpendalStore wraps opendal::Operator.

#![allow(unused)]
fn main() {
use zarrs::storage::::AsyncReadableStorage;
use zarrs_opendal::AsyncOpendalStore;

let builder = opendal::services::Http::default().endpoint("http://...");
let operator = opendal::Operator::new(builder)?.finish();
let store: AsyncReadableStorage =
    Arc::new(AsyncOpendalStore::new(operator));
}

note

Some opendal stores can also be used in a synchronous context with zarrs_object_store::OpendalStore, which wraps opendal::BlockingOperator.

Icechunk

icechunk is a transactional storage engine for Zarr designed for use on cloud object storage. It enables git-like functionality for array data.

See an up-to-date example at https://github.com/zarrs/zarrs_icechunk.

Storage Adapters

Storage adapters can be layered on top of stores to change their functionality.

Zip

A storage adapter for zip files.

#![allow(unused)]
fn main() {
use zarrs_storage::StoreKey;
use zarrs_filesystem::FilesystemStore;
use zarrs_zip::ZipStorageAdapter;

let fs_root = PathBuf::from("/path/to/a/directory");
let fs_store = Arc::new(FilesystemStore::new(&fs_root)?);
let zip_key = StoreKey::new("zarr.zip")?;
let zip_store = Arc::new(ZipStorageAdapter::new(fs_store, zip_key)?);
// or ZipStorageAdapter::new_with_path
}

Async to Sync

Asynchronous stores can be used in a synchronous context with the zarrs::storage::AsyncToSyncStorageAdapter.

The AsyncToSyncBlockOn trait must be implemented for a runtime or runtime handle in order to block on futures. See the below tokio example:

#![allow(unused)]
fn main() {
use zarrs::storage::storage_adapter::async_to_sync::AsyncToSyncBlockOn;

struct TokioBlockOn(tokio::runtime::Runtime); // or handle

impl AsyncToSyncBlockOn for TokioBlockOn {
    fn block_on<F: core::future::Future>(&self, future: F) -> F::Output {
        self.0.block_on(future)
    }
}
}

#![allow(unused)]
fn main() {
use zarrs::storage::::{AsyncReadableStorage, ReadableStorage};

// Create an async store as normal
let builder = opendal::services::Http::default().endpoint(path);
let operator = opendal::Operator::new(builder)?.finish();
let storage: AsyncReadableStorage =
    Arc::new(AsyncOpendalStore::new(operator));

// Create a tokio runtime and adapt the store to sync
let block_on = TokioBlockOn(tokio::runtime::Runtime::new()?);
let store: ReadableStorage =
    Arc::new(AsyncToSyncStorageAdapter::new(storage, block_on))
}

warning

Many async stores are not runtime-agnostic (i.e. only support tokio).

Usage Log

The zarrs::storage::UsageLogStorageAdapter logs storage method calls.

It is intended to aid in debugging and optimising performance by revealing storage access patterns.

#![allow(unused)]
fn main() {
let store = Arc::new(MemoryStore::new());
let log_writer = Arc::new(Mutex::new(
    // std::io::BufWriter::new(
    std::io::stdout(),
    //    )
));
let store = Arc::new(UsageLogStorageAdapter::new(store, log_writer, || {
    chrono::Utc::now().format("[%T%.3f] ").to_string()
}));
}

Performance Metrics

The zarrs::storage::PerformanceMetricsStorageAdapter accumulates metrics, such as bytes read and written.

It is intended to aid in testing by allowing the application to validate that metrics (e.g., bytes read/written, total read/write operations) match expected values for specific operations.

#![allow(unused)]
fn main() {
let store = Arc::new(MemoryStore::new());
let store = Arc::new(PerformanceMetricsStorageAdapter::new(store));

assert_eq!(store.bytes_read(), ...);
}

Zarr Groups

A group is a node in a Zarr hierarchy that may have child nodes (arrays or groups).

zarr overview

Each array or group in a hierarchy is represented by a metadata document, which is a machine-readable document containing essential processing information about the node. For a group, the metadata document contains the Zarr Version and optional user attributes.

Opening an Existing Group

An existing group can be opened with Group::open (or async_open):

let group = Group::open(store.clone(), "/group")?;
// let group = Group::async_open(store.clone(), "/group").await?;

note

These methods will open a Zarr V2 or Zarr V3 group. If you only want to open a specific Zarr version, see open_opt and MetadataRetrieveVersion.

Creating Attributes

Attributes are encoded in a JSON object (serde_json::Object).

Here are a few different approaches for constructing a JSON object:

let value = serde_json::json!({
    "spam": "ham",
    "eggs": 42
});

let attributes: serde_json::Object =
    serde_json::json!(value).as_object().unwrap().clone()

let serde_json::Value::Object(attributes) = value else { unreachable!() };

let mut attributes =  serde_json::Object::default();
attributes.insert("spam".to_string(), Value::String("ham".to_string()));
attributes.insert("eggs".to_string(), Value::Number(42.into()));

Alternatively, you can encode your attributes in a struct deriving Serialize, and serialize to a serde_json::Object.

Creating a Group with the `GroupBuilder`

note

The GroupBuilder only supports Zarr V3 groups.

let group = zarrs::group::GroupBuilder::new()
    .attributes(attributes)
    .build(store.clone(), "/group")?;
group.store_metadata()?;
// group.async_store_metadata().await?;

Note that the /group path is relative to the root of the store.

Remember to Store Metadata!

Group metadata must always be stored explicitly, even if the attributes are empty. Support for implicit groups without metadata was removed long after provisional acceptance of the Zarr V3 specification.

tip

Consider deferring storage of group metadata until child group/array operations are complete. The presence of valid metadata can act as a signal that the data is ready.

Creating a Group from `GroupMetadata`

Zarr V3

/// Specify the group metadata
let metadata: GroupMetadata =
    GroupMetadataV3::new().with_attributes(attributes).into();

/// Create the group and write the metadata
let group =
    Group::new_with_metadata(store.clone(), "/group", metadata)?;
group.store_metadata()?;
// group.async_store_metadata().await?;

/// Specify the group metadata
let metadata: GroupMetadataV3 = serde_json::from_str(
    r#"{
    "zarr_format": 3,
    "node_type": "group",
    "attributes": {
        "spam": "ham",
        "eggs": 42
    },
    "unknown": {
        "must_understand": false
    }
}"#,
)?;

/// Create the group and write the metadata
let group =
    Group::new_with_metadata(store.clone(), "/group", metadata.into())?;
group.store_metadata()?;
// group.async_store_metadata().await?;

Zarr V2

/// Specify the group metadata
let metadata: GroupMetadata =
    GroupMetadataV2::new().with_attributes(attributes).into();

/// Create the group and write the metadata
let group = Group::new_with_metadata(store.clone(), "/group", metadata)?;
group.store_metadata()?;
// group.async_store_metadata().await?;

Mutating Group Metadata

Group attributes can be changed after initialisation with Group::attributes_mut:

#![allow(unused)]
fn main() {
group
    .attributes_mut()
    .insert("foo".into(), serde_json::Value::String("bar".into()));
group.store_metadata()?;
}

Don't forget to store the updated metadata after attributes have been mutated.

Zarr Arrays

An array is a node in a hierarchy that may not have any child nodes.

An array is a data structure with zero or more dimensions whose lengths define the shape of the array. An array contains zero or more data elements all of the same data type.

array overview

The following sections will detail the initialisation, reading, and writing of arrays.

Array Initialisation

Opening an Existing Array

An existing array can be opened with Array::open (or async_open):

let array_path = "/group/array";
let array = Array::open(store.clone(), array_path)?;
// let array = Array::async_open(store.clone(), array_path).await?;

note

These methods will open a Zarr V2 or Zarr V3 array. If you only want to open a specific Zarr version, see open_opt and MetadataRetrieveVersion.

Creating a Zarr V3 Array with the `ArrayBuilder`

note

The ArrayBuilder only supports Zarr V3 groups.

let array_path = "/group/array";
let array = ArrayBuilder::new(
    vec![8, 8], // array shape
    DataType::Float32,
    vec![4, 4].try_into()?, // regular chunk shape
    FillValue::from(ZARR_NAN_F32),
)
// .bytes_to_bytes_codecs(vec![]) // uncompressed
.bytes_to_bytes_codecs(vec![
    Arc::new(GzipCodec::new(5)?),
])
.dimension_names(["y", "x"].into())
// .attributes(...)
// .storage_transformers(vec![].into())
.build(store.clone(), array_path)?;
array.store_metadata()?;
// array.async_store_metadata().await?;

tip

The Group Initialisation Chapter has tips for creating attributes.

Remember to Store Metadata!

Array metadata must always be stored explicitly, otherwise an array cannot be opened.

tip

Consider deferring storage of array metadata until after chunks operations are complete. The presence of valid metadata can act as a signal that the data is ready.

Creating a Zarr V3 Sharded Array

The ShardingCodecBuilder is useful for creating an array that uses the sharding_indexed codec.

let mut sharding_codec_builder = ShardingCodecBuilder::new(
    vec![4, 4].try_into()? // inner chunk shape
);
sharding_codec_builder.bytes_to_bytes_codecs(vec![
    Arc::new(codec::GzipCodec::new(5)?),
]);

let array = ArrayBuilder::new(
    ...
)
.array_to_bytes_codec(sharding_codec_builder.build_arc())
.build(store.clone(), array_path)?;
array.store_metadata()?;
// array.async_store_metadata().await?;

Creating a Zarr V3 Array from Metadata

An array can be created from ArrayMetadata instead of an ArrayBuilder if needed.

let json: &str = r#"{
    "zarr_format": 3,
    "node_type": "array",
    ...
}#";

Full Zarr V3 array JSON example

let json: &str = r#"{
    "zarr_format": 3,
    "node_type": "array",
    "shape": [
        10000,
        1000
    ],
    "data_type": "float64",
    "chunk_grid": {
        "name": "regular",
        "configuration": {
        "chunk_shape": [
            1000,
            100
        ]
        }
    },
    "chunk_key_encoding": {
        "name": "default",
        "configuration": {
        "separator": "/"
        }
    },
    "fill_value": "NaN",
    "codecs": [
        {
        "name": "bytes",
        "configuration": {
            "endian": "little"
        }
        },
        {
        "name": "gzip",
        "configuration": {
            "level": 1
        }
        }
    ],
    "attributes": {
        "foo": 42,
        "bar": "apples",
        "baz": [
        1,
        2,
        3,
        4
        ]
    },
    "dimension_names": [
        "rows",
        "columns"
    ]
}"#;

/// Parse the JSON metadata
let array_metadata: ArrayMetadata = serde_json::from_str(json)?;

/// Create the array
let array = Array::new_with_metadata(
    store.clone(),
    "/group/array",
    array_metadata.into(),
)?;
array.store_metadata()?;
// array.async_store_metadata().await?;

Alternatively, ArrayMetadataV3 can be constructed with ArrayMetadataV3::new() and subsequent with_ methods:

/// Specify the array metadata
let array_metadata: ArrayMetadata = ArrayMetadataV3::new(
    serde_json::from_str("[10, 10]"),
    serde_json::from_str(r#"{"name": "regular", "configuration":{"chunk_shape": [5, 5]}}"#)?,
    serde_json::from_str(r#""float32""#)?,
    serde_json::from_str("0.0")?,
    serde_json::from_str(r#"[ { "name": "blosc", "configuration": { "cname": "blosclz", "clevel": 9, "shuffle": "bitshuffle", "typesize": 2, "blocksize": 0 } } ]"#)?,
).with_chunk_key_encoding(
    serde_json::from_str(r#"{"name": "default", "configuration": {"separator": "/"}}"#)?,
).with_attributes(
    serde_json::from_str(r#"{"foo": 42, "bar": "apples", "baz": [1, 2, 3, 4]}"#)?,
).with_dimension_names(
    Some(serde_json::from_str(r#"["y", "x"]"#)?),
)
.into();

/// Create the array
let array = Array::new_with_metadata(
    store.clone(),
    "/group/array",
    array_metadata,
)?;
array.store_metadata()?;
// array.async_store_metadata().await?;

Creating a Zarr V2 Array

The ArrayBuilder does not support Zarr V2 arrays. Instead, they must be built from ArrayMetadataV2.

/// Specify the array metadata
let array_metadata: ArrayMetadata = ArrayMetadataV2::new(
    vec![10, 10], // array shape
    vec![5, 5].try_into()?, // regular chunk shape
    ">f4".into(), // big endian float32
    FillValueMetadataV2::NaN, // fill value
    None, // compressor
    None, // filters
)
.with_dimension_separator(ChunkKeySeparator::Slash)
.with_order(ArrayMetadataV2Order::F)
.with_attributes(attributes.clone())
.into();

/// Create the array
let array = Array::new_with_metadata(
    store.clone(),
    "/group/array",
    array_metadata,
)?;
array.store_metadata()?;
// array.async_store_metadata().await?;

warning

Array::new_with_metadata can fail if Zarr V2 metadata is unsupported by zarrs.

Mutating Array Metadata

The shape, dimension names, attributes, and additional fields of an array are mutable.

Don't forget to write the metadata after mutating array metadata!

The next chapters detail the reading and writing of array data.

Reading Arrays

Overview

Array operations are divided into several categories based on the traits implemented for the backing storage. This section focuses on the [Async]ReadableStorageTraits methods:

Additional methods are offered by extension traits:

ArrayShardedExt and ArrayShardedReadableExt: see Reading Sharded Arrays
ArrayChunkCacheExt: see Chunk Caching

Method Variants

Many retrieve and store methods have multiple variants:

Standard variants store or retrieve data represented as ArrayBytes (representing fixed or variable length bytes).
_elements suffix variants can store or retrieve chunks with a known type.
_ndarray suffix variants can store or retrieve an ndarray::Array (requires ndarray feature).
_opt suffix variants have a CodecOptions parameter for fine-grained concurrency control and more.
Variants without the _opt suffix use default CodecOptions.
async_ prefix variants can be used with async stores (requires async feature).

Reading a Chunk

Reading and Decoding a Chunk

let chunk_indices: Vec<u64> = vec![1, 2];
let chunk_bytes: ArrayBytes = array.retrieve_chunk(&chunk_indices)?;
let chunk_elements: Vec<f32> =
    array.retrieve_chunk_elements(&chunk_indices)?;
let chunk_array: ndarray::ArrayD<f32> =
    array.retrieve_chunk_ndarray(&chunk_indices)?;

warning

_element and _ndarray variants will fail if the element type does not match the array data type. They do not perform any conversion.

Skipping Empty Chunks

Use retrieve_chunk_if_exists to only retrieve a chunk if it exists (i.e. is not composed entirely of the fill value, or has yet to be written to the store):

let chunk_bytes: Option<ArrayBytes> =
    array.retrieve_chunk_if_exists(&chunk_indices)?;
let chunk_elements: Option<Vec<f32>> =
    array.retrieve_chunk_elements_if_exists(&chunk_indices)?;
let chunk_array: Option<ndarray::ArrayD<f32>> =
    array.retrieve_chunk_ndarray_if_exists(&chunk_indices)?;

Retrieving an Encoded Chunk

An encoded chunk can be retrieved without decoding with retrieve_encoded_chunk:

let chunk_bytes_encoded: Option<Vec<u8>> =
    array.retrieve_encoded_chunk(&chunk_indices)?;

This returns None if a chunk does not exist.

Parallelism and Concurrency

Codec and Chunk Parallelism

Codecs run in parallel on a threadpool. Array store and retrieve methods will also run in parallel when they involve multiple chunks. zarrs will automatically choose where to prioritise parallelism between codecs/chunks based on the codecs and number of chunks.

By default, all available CPU cores will be used (where possible/efficient). Concurrency can be limited globally with Config::set_codec_concurrent_target or as required using _opt methods with CodecOptions populated with CodecOptions::set_concurrent_target.

Async API Concurrency

This crate is async runtime-agnostic. Async methods do not spawn tasks internally, so asynchronous storage calls are concurrent but not parallel. Codec encoding and decoding operations still execute in parallel (where supported) in an asynchronous context.

Due the lack of parallelism, methods like async_retrieve_array_subset or async_retrieve_chunks do not parallelise over chunks and can be slow compared to the API. Parallelism over chunks can be achieved by spawning tasks outside of zarrs. If executing many tasks concurrently, consider reducing the codec concurrent_target.

Reading Chunks in Parallel

The retrieve_chunks methods perform chunk retrieval with chunk parallelism.

Rather than taking a &[u64] parameter of the indices of a single chunk, these methods take an ArraySubset representing the chunks. Rather than returning a Vec for each chunk, the chunks are assembled into a single output for the entire region they cover:

let chunks = ArraySubset::new_with_ranges(&[0..2, 0..4]);
let chunks_bytes: ArrayBytes = array.retrieve_chunks(&chunks)?;
let chunks_elements: Vec<f32> = array.retrieve_chunks_elements(&chunks)?;
let chunks_array: ndarray::ArrayD<f32> =
    array.retrieve_chunks_ndarray(&chunks)?;

retrieve_encoded_chunks differs in that it does not assemble the output. Chunks returned are in order of the chunk indices returned by chunks.indices().into_iter():

let chunk_bytes_encoded: Vec<Option<Vec<u8>>> =
    array.retrieve_encoded_chunks(&chunk_indices, &CodecOptions::default()?);

Reading a Chunk Subset

An ArraySubset represents a subset (region) of an array or chunk. It encodes a starting coordinate and a shape, and is foundational for many array operations.

The below array subsets are all identical:

let subset = ArraySubset::new_with_ranges(&[2..6, 3..5]);
let subset = ArraySubset::new_with_start_shape(vec![2, 3], vec![4, 2])?;
let subset = ArraySubset::new_with_start_end_exc(vec![2, 3], vec![6, 5])?;
let subset = ArraySubset::new_with_start_end_inc(vec![2, 3], vec![5, 4])?;

The retrieve_chunk_subset methods can be used to retrieve a subset of a chunk:

let chunk_subset: ArraySubset = ...;
let chunk_subset_bytes: ArrayBytes =
    array.retrieve_chunk_subset(&chunk_indices, &chunk_subset)?;
let chunk_subset_elements: Vec<f32> =
    array.retrieve_chunk_subset_elements(&chunk_indices, &chunk_subset)?;
let chunk_subset_array: ndarray::ArrayD<f32> =
    array.retrieve_chunk_subset_ndarray(&chunk_indices, &chunk_subset)?;

It is important to understand what is going on behind the scenes in these methods. A partial decoder is created that decodes the requested subset.

warning

Many codecs do not support partial decoding, so partial decoding may result in reading and decoding entire chunks!

Reading Multiple Chunk Subsets

If multiple chunk subsets are needed from a chunk, prefer to create a partial decoder and reuse it for each chunk subset.

let partial_decoder = array.partial_decoder(&chunk_indices)?;
let chunk_subsets_bytes_a_b: Vec<ArrayBytes> =
    partial_decoder.partial_decode(&[chunk_subset_a, chunk_subset_b, ...])?;
let chunk_subsets_bytes_c: Vec<ArrayBytes> =
    partial_decoder.partial_decode(&[chunk_subset_c])?;

On initialisation, partial decoders may insert a cache (depending on the codecs). For example, if a codec does not support partial decoding, its output (or an output of one of its predecessors in the codec chain) will be cached, and subsequent partial decoding operations will not access the store.

Reading an Array Subset

An arbitrary subset of an array can be read with the retrieve_chunk methods:

let array_subset: ArraySubset = ...;
let subset_bytes: Vec<u8> =
    array.retrieve_array_subset(&array_subset)?;
let subset_elements: Vec<f32> =
    array.retrieve_array_subset_elements(&array_subset)?;
let subset_array: ndarray::ArrayD<f32> =
    array.retrieve_array_subset_ndarray(&array_subset)?;

Internally, these methods identify the overlapping chunks, call retrieve_chunk / retrieve_chunk_subset with chunk parallelism, and assemble the output.

Reading Inner Chunks (Sharded Arrays)

The sharding_indexed codec enables multiple sub-chunks ("inner chunks") to be stored in a single chunk ("shard"). With a sharded array, the chunk_grid and chunk indices in store/retrieve methods reference the chunks ("shards") of an array.

The ArrayShardedExt trait provides additional methods to Array to query if an array is sharded and retrieve the inner chunk shape. Additionally, the inner chunk grid can be queried, which is a ChunkGrid where chunk indices refer to inner chunks rather than shards.

The ArrayShardedReadableExt trait adds Array methods to conveniently and efficiently access the data in a sharded array (with _elements and _ndarray variants):

For unsharded arrays, these methods gracefully fallback to referencing standard chunks. Each method has a cache parameter (ArrayShardedReadableExtCache) that stores shard indexes so that they do not have to be repeatedly retrieved and decoded.

Querying Chunk Bounds

Several convenience methods are available for querying the underlying chunk grid:

chunk_origin: Get the origin of a chunk.
chunk_shape: Get the shape of a chunk.
chunk_subset: Get the ArraySubset of a chunk.
chunk_subset_bounded: Get the ArraySubset of a chunk, bounded by the array shape.
chunks_subset / chunks_subset_bounded: Get the ArraySubset of a group of chunks.
chunks_in_array_subset: Get the chunks in an ArraySubset.

An ArraySubset spanning an array can be retrieved with subset_all.

Iterating Over Chunks / Regions

Iterating over chunks or regions is a common pattern. There are several approaches.

Serial Chunk Iteration

let indices = chunks.indices();
for chunk_indices in indices {
    ...
}

Parallel Chunk Iteration

let indices = chunks.indices();
chunks.into_par_iter().try_for_each(|chunk_indices| {
    ...
})?;

warning

Reading chunks in parallel (as above) can use a lot of memory if chunks are large.

The zarrs crate internally uses a macro from the rayon_iter_concurrent_limit crate to limit chunk parallelism where reasonable. This macro is a simple wrapper over .into_par_iter().chunks(...).<func>. For example:

let chunk_concurrent_limit: usize = 4;
rayon_iter_concurrent_limit::iter_concurrent_limit!(
    chunk_concurrent_limit,
    indices,
    try_for_each,
    |chunk_indices| { 
        ...
    }
)?;

Chunk Caching

The standard Array retrieve methods do not perform any chunk caching. This means that requesting the same chunk again will result in another read from the store.

The ArrayChunkCacheExt trait adds Array retrieve methods that support chunk caching. Various type of chunk caches are supported (e.g. encoded cache, decoded cache, chunk limited, size limited, thread local, etc.). See the Chunk Caching section of the Array docs for more information on these methods.

Chunk caching is likely to be effective for remote stores where redundant retrievals are costly. However, chunk caching may not outperform disk caching with a filesystem store. The caches use internal locking to support multithreading, which has a performance overhead.

warning

Prefer not to use a chunk cache if chunks are not accessed repeatedly. Cached retrieve methods do not use partial decoders, and any intersected chunk is fully decoded if not present in the cache.

For many access patterns, chunk caching may reduce performance. Benchmark your algorithm/data.

Reading a String Array

A string array can be read as normal with any of the array retrieve methods.

let chunks_elements: Vec<String> = array.retrieve_chunks_elements(&chunks)?;
let chunks_array: ndarray::ArrayD<String> =
    array.retrieve_chunks_ndarray(&chunks)?;

However, this results in a string allocation per element. This can be avoided by retrieving the bytes directly and converting them to a Vec of string references. For example:

let chunks_bytes: ArrayBytes = array.retrieve_chunks(&chunks)?;
let (bytes, offsets) = chunks_bytes.into_variable()?;
let string = String::from_utf8(bytes.into_owned())?;
let chunks_elements: Vec<&str> = offsets
    .iter()
    .tuple_windows()
    .map(|(&curr, &next)| &string[curr..next])
    .collect();
let chunks_array =
    ArrayD::<&str>::from_shape_vec(subset_all.shape_usize(), chunks_elements)?;

Writing Arrays

Array write methods are separated based on two storage traits:

[Async]WritableStorageTraits methods perform write operations exclusively, and
[Async]ReadableWritableStorageTraits methods perform write operations and may perform read operations.

warning

Misuse of [Async]ReadableWritableStorageTraits Array methods can result in data loss due to partial writes being lost. zarrs does not currently offer a “synchronisation” API for locking chunks or array subsets.

Write-Only Methods

The [Async]WritableStorageTraits grouped methods exclusively perform write operations:

Store a Chunk

let chunk_indices: Vec<u64> = vec![1, 2];
let chunk_bytes: Vec<u8> = vec![...];
array.store_chunk(&chunk_indices, chunk_bytes.into())?;
let chunk_elements: Vec<f32> = vec![...];
array.store_chunk_elements(&chunk_indices, &chunk_elements)?;
let chunk_array = ArrayD::<f32>::from_shape_vec(
    vec![2, 2], // chunk shape
    chunk_elements
)?;
array.store_chunk_elements(&chunk_indices, chunk_array)?;

tip

If a chunk is written more than once, its element values depend on whichever operation wrote to the chunk last.

Store Chunks

store_chunks (and variants) will dissasemble the input into chunks, and encode and store them in parallel.

let chunks = ArraySubset::new_with_ranges(&[0..2, 0..4]);
let chunks_bytes: Vec<u8> = vec![...];
array.store_chunks(&chunks, chunks_bytes.into())?;
// store_chunks_elements, store_chunks_ndarray...

Store an Encoded Chunk

An encoded chunk can be stored directly with store_encoded_chunk, bypassing the zarrs codec pipeline.

let encoded_chunk_bytes: Vec<u8> = ...;
array.store_encoded_chunk(&chunks, encoded_chunk_bytes.into())?;

tip

Currently, the most performant path for uncompressed writing on Linux is to reuse page aligned buffers via store_encoded_chunk with direct IO enabled for the FilesystemStore. See zarrs GitHub issue #58 for a discussion of this method.

Read-Write Methods

The [Async]ReadableWritableStorageTraits grouped methods perform write operations and may perform read operations:

These methods perform partial encoding. Codecs that do not support true partial encoding will retrieve chunks in their entirety, then decode, update, and store them.

It is the responsibility of zarrs consumers to ensure:

store_chunk_subset is not called concurrently on the same chunk, and
store_array_subset is not called concurrently on array subsets sharing chunks.

Partial writes to a chunk may be lost if these rules are not respected.

Store a Chunk Subset

array.store_chunk_subset_elements::<f32>(
    // chunk indices
    &[3, 1],
    // subset within chunk
    &ArraySubset::new_with_ranges(&[1..2, 0..4]),
    // subset elements
    &[-4.0; 4],
)?;

Store an Array Subset

array.store_array_subset_elements::<f32>(
    &ArraySubset::new_with_ranges(&[0..8, 6..7]),
    &[123.0; 8],
)?;

Partial Encoding with the Sharding Codec

In zarrs, the sharding_indexed codec is the only codec that supports real partial encoding if the Experimental Partial Encoding option is enabled. If disabled (default), chunks are always fully decoded and updated before being stored.

To enable partial encoding:

// Set experimental_partial_encoding to true by default
zarrs::config::global_config_mut().set_experimental_partial_encoding(true);

// Manually set experimental_partial_encoding to true for an operation
let mut options = CodecOptions::default();
options.set_experimental_partial_encoding(true);

warning

The asynchronous API does not yet support partial encoding.

This enables Array::store_array_subset, Array::store_chunk_subset, Array::partial_encoder, and variants to use partial encoding for sharded arrays. Inner chunks can be written in an append-only fashion without reading previously written inner chunks (if their elements do not require updating).

warning

Since partial encoding is append-only for sharded arrays, updating a chunk does not remove the originally encoded data. Make sure to align writes to the inner chunks, otherwise your shards will be much larger than they should be.

Converting Zarr V2 to V3

CLI tool

zarrs_reencode is a CLI tool that supports Zarr V2 to V3 conversion. See the zarrs_reencode section.

Changing the Internal Representation

When an array or group is initialised, it internally holds metadata in the Zarr version it was created with.

To change the internal representation to Zarr V3, call to_v3() on an Array or Group, then call store_metadata() to update the stored metadata. V2 metadata must be explicitly erased if needed (see below).

note

While zarrs fully supports manipulation of Zarr V2 and V3 hierarchies (with supported codecs, data types, etc.), it only supports forward conversion of metadata from Zarr V2 to V3.

Convert a Group to V3

let group: Group = group.to_v3();
group.store_metadata()?;
// group.async_store_metadata().await?;
group.erase_metadata_opt(MetadataEraseVersion::V2)?;
// group.async_erase_metadata_opt(MetadataEraseVersion::V2).await?;

Convert an Array to V3

let array: Array = array.to_v3()?;
array.store_metadata()?;
// array.async_store_metadata().await?;
array.erase_metadata_opt(MetadataEraseVersion::V2)?;
// array.async_erase_metadata_opt(MetadataEraseVersion::V2).await?;

Note that Array::to_v3() is fallible because some V2 metadata is not V3 compatible.

Writing Versioned Metadata Explicitly

Rather than changing the internal representation, an alternative is to just write metadata with a specified version.

For groups, the store_metadata_opt accepts a GroupMetadataOptions argument. GroupMetadataOptions currently has only one option that impacts the Zarr version of the metadata. By default, GroupMetadataOptions keeps the current Zarr version.

To write Zarr V3 metadata:

group.store_metadata_opt(&
    GroupMetadataOptions::default()
    .with_metadata_convert_version(MetadataConvertVersion::V3)
)?;
// group.async_store_metadata_opt(...).await?;

warning

zarrs does not support converting Zarr V3 metadata to Zarr V2.

Note that the original metadata is not automatically deleted. If you want to delete it:

group.erase_metadata()?;
// group.async_erase_metadata().await?;

tip

The store_metadata methods of Array and Group internally call store_metadata_opt. Global defaults can be changed, see zarrs::global::Config.

ArrayMetadataOptions has similar options for changing the Zarr version of the metadata. It also has various other configuration options, see its documentation.

The Zarr v3 specification defines several explicit extension points, which are specific components of the Zarr model that can be replaced or augmented with custom implementations:

Codecs: Codecs define how chunk data is transformed between its in-memory representation and its stored (bytes) representation. Zarr allows chaining multiple codecs, creating sophisticated data transformation pipelines.
Data types: The element representation of the array. Array-to-array codecs may change the data type of chunk data on encoding, but this is reversed on decoding.
Chunk Grids: Define how the N-dimensional array space is partitioned into chunks. The specification only defines a regular grid, and a rectangular grid is proposed.
Chunk Key Encoding: Specifies how the logical coordinates of a chunk are mapped to the string key used for storage (e.g., (0, 1, 0) to c/0/1/0, c.0.1.0, /0/1/0, etc.). The specification defines default and v2 encodings.
Storage Transformers: These modify the interaction between the logical Zarr hierarchy (groups, arrays, chunks) and the underlying storage system. The specification does not define nay storage transformers.
Stores While not strictly an extension point, the ability to interact with different storage backends is a crucial aspect of Zarr's flexibility. The zarrs library, like other Zarr implementations, leverages this by providing its own Storage API for reading, writing, and listing data. zarrs can work with virtually any storage system – in-memory buffers, local filesystems, object stores (like S3 or GCS), databases, etc.

Codec Extensions

Among the most impactful and frequently utilised extension points are codecs. At their core, codecs define the transformations applied to array chunk data as it moves between its logical, in-memory representation and its serialized, stored representation as a sequence of bytes.

Codecs are the workhorses behind essential Zarr features like compression (reducing storage size and transfer time) and filtering (rearranging or modifying data to improve compression effectiveness). The Zarr v3 specification allows for a pipeline of codecs to be defined for each array, where the output of one codec becomes the input for the next during encoding, and the process is reversed during decoding.

Types of Codecs

The Zarr v3 specification categorizes codecs based on the type of data they operate on and produce:

Array-to-Array (A->A) Codecs: These codecs transform an array chunk before it is serialized into bytes. They operate on an in-memory array representation and produce another in-memory array representation. Examples could include codecs that transpose data within a chunk, change the data type (e.g., float32 to float16), or apply operations that make an array more amenable to compression.
Array-to-Bytes (A->B) Codecs: This type of codec handles the crucial step of converting an in-memory array chunk into a sequence of bytes. This typically involves flattening the multidimensional array data and handling endianness conversions if necessary. Every codec pipeline must include at least one A->B codec.
Bytes-to-Bytes (B->B) Codecs: These codecs take a sequence of bytes as input and produce a sequence of bytes as output. This is the category where most common compression algorithms (like blosc, zstd, gzip) and byte-level filters (like shuffle for improving compressibility, or checksums) reside. Multiple B->B codecs can be chained together.

Codecs in `zarrs`

The zarrs library mirrors these conceptual types using a set of Rust traits. To implement a custom codec, you must implement the following traits depending on the codec type:

A->A: CodecTraits + ArrayCodecTraits + ArrayToArrayCodecTraits
A->B: CodecTraits + ArrayCodecTraits + ArrayToBytesCodecTraits
B->B: CodecTraits + BytesToBytesCodecTraits

The traits are:

CodecTraits: Defines the codec configuration creation method, the unique zarrs codec identifier, and some hints related to partial decoding.
ArrayCodecTraits: defines the recommended_concurrency and partial_decode_granularity.
ArrayToArrayCodecTraits / ArrayToBytesCodecTraits / BytesToBytesCodecTraits: Defines the codec encode and decode methods (including partial encoding and decoding), as well as methods for querying the encoded representation.

These traits define the necessary encode and decode methods (complete and partial), methods for inspecting the encoded chunk representation, hints for concurrent processing, and more.

The best way to learn to implement a new codec is to look at the existing codecs implemented in zarrs.

Example: An `LZ4` Bytes-to-Bytes Codec

LZ4 is common lossless compression algorithm. Let's implement the numcodecs.lz4 codec, which is supported by zarr-python 3.0.0+ for Zarr V3 data.

The `Lz4CodecConfiguration` Struct

Looking at the docs for the numcodecs LZ4 codec, it has a single "acceleration" parameter. The valid range for "accleration" is not documented, but the LZ4 library itself will clamp the acceleration between 1 and the maximum supported compression level. So, any i32 can be permitted here and there is no need follow New Type Idiom.

The expected form of the codec in array metadata is:

[
    ...
    {
        "name": "numcodecs.lz4",
        "configuration": {
            "acceleration": 1
        }
    }
    ...
]

The configuration can be represented by a simple struct:

#![allow(unused)]
fn main() {
/// `lz4` codec configuration parameters
#[derive(Serialize, Deserialize, Clone, Eq, PartialEq, Debug, Display)]
pub struct Lz4CodecConfiguration {
    pub acceleration: i32
}
}

Note that codec configurations in zarrs_metadata are versioned so that they can adapt to potential codec specification revisions.

Lz4CodecConfiguration is JSON serialisable, so implement the ConfigurationSerialize trait:

#![allow(unused)]
fn main() {
impl ConfigurationSerialize for Lz4CodecConfiguration {}
}

This trait requires Serialize + DeserializeOwned, and enables any implementing struct to be infallibly converted from a JSON object or anything convertible to a JSON object. A codec configuration must not be able to hold unrepresentable JSON state, otherwise such a conversion could panic at runtime.

The `Lz4Codec` Struct

Now create the codec struct. For encoding, the acceleration needs to be known, so this must be a field of the struct:

#![allow(unused)]
fn main() {
pub struct Lz4Codec {
    acceleration: i32
}
}

Next we define two constructors. These are not officially required for the codec to be used, but it is common practice in zarrs to include constructors based on the underlying codec parameters as well as a constructor from configuration.

#![allow(unused)]
fn main() {
impl Lz4Codec {
    #[must_use]
    pub fn new(acceleration: i32) -> Self {
        Self { acceleration }
    }

    #[must_use]
    pub fn new_with_configuration(
        configuration: &Lz4CodecConfiguration,
    ) -> Self {
        Self { acceleration: configuration.acceleration }
    }
}
}

`CodecTraits`

Now we implement the CodecTraits, which are required for every codec.

#![allow(unused)]
fn main() {
/// Unique identifier for the LZ4 codec
pub const LZ4: &str = "example.lz4";

impl CodecTraits for Lz4Codec {
    /// Unique identifier for the codec.
    fn identifier(&self) -> &str {
        LZ4
    }

    /// Create the codec configuration.
    fn configuration_opt(
        &self,
        _name: &str,
        _options: &CodecMetadataOptions,
    ) -> Option<Configuration> {
        // The into comes from the auto implementation of From<T: ConfigurationSerialize> for Configuration
        Some(Lz4CodecConfiguration::new(self.acceleration).into())
    }

    /// Indicates if the input to a codecs partial decoder should be cached for optimal performance.
    /// If true, a cache may be inserted *before* it in a [`CodecChain`] partial decoder.
    fn partial_decoder_should_cache_input(&self) -> bool {
        false
    }

    /// Indicates if a partial decoder decodes all bytes from its input handle and its output should be cached for optimal performance.
    /// If true, a cache will be inserted at some point *after* it in a [`CodecChain`] partial decoder.
    fn partial_decoder_decodes_all(&self) -> bool {
        true
    }
}
}

A unique identifier is defined for the LZ4 codec, which is chosen as to not conflict with a potential future codec that may be implemented in zarrs itself (likely lz4). This is returned by the identifier() method. The identifier is used in codec registration, and enables features such as renaming of codecs for serialisation, and supporting multiple codec aliases.

The configuration_opt method creates the codec configuration. Note that this takes a name and options which are typically unneeded. However, there are cases where the configuration may be dependent on the codec name, or a runtime option could impact serialisation behaviour.

While the lz4 codec may actually support partial decoding, this needs to be implemented by the wrapper (and it may not be efficient anyway, depending on the access pattern). For simplicity in this example, let us indicate that partial decoding is NOT supported and make partial_decoder_decodes_all() return true. This ensures that a cache is inserted at the appropriate location in a partial decoder codec chain.

`BytesToBytesCodecTraits`

The BytesToBytesCodecTraits are where the encoding and decoding methods are implemented.

#![allow(unused)]
fn main() {
impl BytesToBytesCodecTraits for BloscCodec {
    /// Return a dynamic version of the codec.
    fn into_dyn(self: Arc<Self>) -> Arc<dyn BytesToBytesCodecTraits> {
        self as Arc<dyn BytesToBytesCodecTraits>
    }

    /// Return the maximum internal concurrency supported for the requested decoded representation.
    fn recommended_concurrency(
        &self,
        _decoded_representation: &BytesRepresentation,
    ) -> Result<RecommendedConcurrency, CodecError> {
        Ok(RecommendedConcurrency::new_maximum(1))
    }

    /// Returns the size of the encoded representation given a size of the decoded representation.
    fn encoded_representation(
        &self,
        decoded_representation: &BytesRepresentation,
    ) -> BytesRepresentation {
        todo!()
    }

    fn encode<'a>(
        &self,
        decoded_value: RawBytes<'a>,
        _options: &CodecOptions,
    ) -> Result<RawBytes<'a>, CodecError> {
        todo!()
    }

    fn decode<'a>(
        &self,
        encoded_value: RawBytes<'a>,
        _decoded_representation: &BytesRepresentation,
        _options: &CodecOptions,
    ) -> Result<RawBytes<'a>, CodecError> {
        todo!()
    }
}
}

In the above example, the encode and decode methods have been left as an exercise to the reader. A crate like lz4 could be used to implement these methods with only a few lines in each method.

The encoded representation of an array-to-bytes or bytes-to-bytes filter is a BytesRepresentation, which is either Fixed, Bounded, or Unbounded. Typically compression codecs like lz4 have an upper bound on the compressed size (see See LZ4_compressBound), so the encoded_representation() should return a BytesRepresentation::BoundedSize (unless the proceeding filter outputs an unbounded size). This has been left as an exercise for the reader.

Codec Parallelism

In the above snippet, the recommended_concurrency is set to 1. This indicates to higher level zarrs operations that the codec encode/decode operations will only use one thread and that zarrs should use chunk parallelism over codec parallelism. For large chunks, it may be preferable to use codec parallelism, and this can be achieved by increasing the recommended concurrency and using multithreading in the encode/decode methods. However, the cost of multithreading in external libraries can be expensive, so benchmark this! For example, the blosc codec in zarrs activates codec parallelism when the chunk size is greater than 4 MB.

Partial Encoding / Decoding

Note that the [async_]partial_decoder and [async_]partial_encoder methods of BytesToBytesCodecTraits are not implemented in the above example, and the default implementations encode/decode the entire chunk. Partial encoding is not applicable to the lz4 codec, but it could support partial decoding. The blosc codec in zarrs is an example of partial decoding. The input is always fully decoded (and is cached because partial_decoder_should_cache_input() returns true), but only requested byte ranges are decompressed.

Codec Registration

zarrs uses inventory for compile time registration of codecs. Registration involves creating a method that is used to check if the identifier is a match, and a function that actually creates the codec from a configuration.

#![allow(unused)]
fn main() {
// Register the codec.
inventory::submit! {
    CodecPlugin::new(LZ4, is_identifier_lz4, create_codec_lz4)
}

fn is_identifier_lz4(identifier: &str) -> bool {
    identifier == LZ4
}

pub(crate) fn create_codec_lz4(metadata: &MetadataV3) -> Result<Codec, PluginCreateError> {
    let configuration: Lz4Codec = metadata
        .to_configuration()
        .map_err(|_| PluginMetadataInvalidError::new(LZ4, "codec", metadata.clone()))?;
    let codec = Arc::new(Lz4Codec::new_with_configuration(&configuration)?);
    Ok(Codec::BytesToBytes(codec))
}
}

Codec Aliasing

By default, the codec name will be the codec identifier(), however that may not be desirable (especially with example.lz4!).

#![allow(unused)]
fn main() {
assert_eq!(Lz4::new(1).default_name(), "example.lz4");
}

zarrs includes a mechanism for setting the serialised name of codecs, as well as supported name aliases for decoding. By default, zarrs will preserve the alias if an array is rewritten, but this can be changed (see the zarrs global config).

If the codec is confirmed to be fully compatible with numcodecs.lz4, its default name could be changed with a runtime configuration:

#![allow(unused)]
fn main() {
global_config_mut()
    .codec_aliases_v3_mut()
    .default_names
    .entry(LZ4.into())
    .and_modify(|entry| {
        *entry = "numcodecs.lz4".into();
    });
assert_eq!(Lz4::new(1).default_name(), "numcodecs.lz4");
}

Or the identifier could just be changed to numcodecs.lz4, for example.

Ready to Test

At this point, the lz4 is ready to go and could be tested for compatibility against numcodecs.lz4 in zarr-python.

This codec would be a great candidate for merging into zarrs itself. Using the lz4 identifier would be recommended in this case and the default name would be set to numcodecs.lz4 by default. If lz4 were ever standardised without a numcodecs. prefix, then the default name could be lz4 but an alias would remain for numcodecs.lz4.

Array-to-Array and Array-to-Bytes Codecs

Implementing an Array-to-Array or Array-to-Bytes codec is similar, but the ArrayCodecTraits and ArraytoArrayCodecTraits or ArrayToBytesCodecTraits must be implemented too.

`ArrayCodecTraits`

ArrayCodecTraits has two methods.

recommended_concurrency() (Required)

This method differs from that of BytesToBytesCodecTraits only in the type of the decoded_representation parameter. It takes a ChunkRepresentation which holds a chunk shape, data type, and fill value.

partial_decode_granularity() (Provided)

Returns the shape of the smallest subset of a chunk that can be efficiently decoded if the chunk were subdivided into a regular grid. For most codecs, this is just the shape of the chunk. It is the shape of the "inner chunks" for the sharding codec. The default implementation just returns the chunk shape.

`ArrayToArrayCodecTraits`

This trait is similar to BytesToBytesCodecTraits except the encode and decode methods input and return ArrayBytes, which can represent arrays with fixed or variable sized elements.

Key methods beyond encode and decode are:

encoded_data_type() (required).
- This is where a codec can put an input data type compatibility check and indicate if the data type changes on encoding.
encoded_fill_value() (provided) Defaults to the input fill value.
encoded_shape() (provided) Defaults to the input shape.
decoded_shape() (provided) Defaults to the input shape.
encoded_representation() (provided) Creates a ChunkRepresentation from the output of encoded_{data_type,fill_value,shape}()

Default implementations are provided for [async_]partial_{encoder,decoder} which encode/decode the entire chunk.

`ArrayToBytesCodecTraits`

This trait has a required encoded_representation() method that returns a a BytesRepresentation based on ChunkRepresentation parameter. The decode() and encode() methods transform between ArrayBytes and RawBytes.

Custom Data Type Interaction

The next page deals with custom data types, however it is worth highlighting that third party codecs are expected to handle custom data types internally.

A first party codec may extend DataTypeExtension with a new codec_<CODEC_NAME> method and a new DataTypeExtension<CodecName> trait to enable a codec to be used with custom data types. Currently zarrs has data type extension traits for the bytes and packbits codecs. All other codecs are either data type agnostic (e.g. transpose, compression codecs, etc.) or operate on a specific set of data types (e.g. zfp).

note

If the need arises, DataTypeExtension may be changed in the future to better support interaction between custom data types and custom codecs.

Data Type Extensions

According to the Zarr V3 specification:

A data type defines the set of possible values that an array may contain. For example, the 32-bit signed integer data type defines binary representations for all integers in the range −2,147,483,648 to 2,147,483,647.

The specification defines a limited set of data types, but additional data types can be defined as extensions.

zarrs supports a number of extension data types, many of which are registered in the zarr-extensions repository. This chapter explains how to create custom data types with a guided walkthrough.

Example: The `uint4` Data type

The uint4 data type is registered at the zarr-extensions repository. The specification can be read here:

https://github.com/zarr-developers/zarr-extensions/tree/main/data-types/uint4

In summary, it defines a 4-bit unsigned integer in the range [0, 15] that is supported by the bytes and packbits codecs.

The `DataTypeUint4` Struct

The uint4 data type has no configuration, so it can be represented by a unit struct:

#![allow(unused)]
fn main() {
/// The `uint4` data type.
#[derive(Debug)]
struct DataTypeUint4;
}

Implementing `DataTypeExtension`

To be used as a data type extension, DataTypeUint4 must implement the DataTypeExtension trait. The DataTypeUint4Element used in these definitions is defined later on this page. This defines properties of the data type such as the metadata (name and configuration), size, and conversion between to and from fill values and fill value metadata. It has additional codec related methods detailed shortly.

#![allow(unused)]
fn main() {
/// A unique identifier for `uint4` data type.
const UINT4: &'static str = "uint4";

impl DataTypeExtension for DataTypeUint4 {
    fn name(&self) -> String {
        UINT4.to_string()
    }

    fn configuration(&self) -> Configuration {
        Configuration::default()
    }

    fn fill_value(
        &self,
        fill_value_metadata: &FillValueMetadataV3,
    ) -> Result<FillValue, DataTypeFillValueMetadataError> {
        let err = || DataTypeFillValueMetadataError::new(self.name(), fill_value_metadata.clone());
        let element_metadata: u64 = fill_value_metadata.as_u64().ok_or_else(err)?;
        let element = DataTypeUint4Element::try_from(element_metadata).map_err(|_| {
            DataTypeFillValueMetadataError::new(UINT4.to_string(), fill_value_metadata.clone())
        })?;
        Ok(FillValue::new(element.to_ne_bytes().to_vec()))
    }

    fn metadata_fill_value(
        &self,
        fill_value: &FillValue,
    ) -> Result<FillValueMetadataV3, DataTypeFillValueError> {
        let element = DataTypeUint4Element::from_ne_bytes(
            fill_value
                .as_ne_bytes()
                .try_into()
                .map_err(|_| DataTypeFillValueError::new(self.name(), fill_value.clone()))?,
        );
        Ok(FillValueMetadataV3::from(element.as_u8()))
    }

    fn size(&self) -> zarrs::array::DataTypeSize {
        DataTypeSize::Fixed(1)
    }
    
    ...
}
}

Implementing `DataTypeExtensionBytesCodec`

Supporting the bytes codec is absolutely trivial for the uint4 data type. It simply passes through the in-memory data unmodified, since it is already a 1-byte value.

#![allow(unused)]
fn main() {
impl DataTypeExtensionBytesCodec for DataTypeUint4 {
    fn encode<'a>(
        &self,
        bytes: std::borrow::Cow<'a, [u8]>,
        _endianness: Option<zarrs_metadata::Endianness>,
    ) -> Result<std::borrow::Cow<'a, [u8]>, DataTypeExtensionBytesCodecError> {
        Ok(bytes)
    }

    fn decode<'a>(
        &self,
        bytes: std::borrow::Cow<'a, [u8]>,
        _endianness: Option<zarrs_metadata::Endianness>,
    ) -> Result<std::borrow::Cow<'a, [u8]>, DataTypeExtensionBytesCodecError> {
        Ok(bytes)
    }
}
}

The default implementation of DataTypeExtension::codec_bytes must be overriden to return Ok(self):

#![allow(unused)]
fn main() {
impl DataTypeExtension for DataTypeUint4 {
    ...
    
    fn codec_bytes(&self) -> Result<&dyn DataTypeExtensionBytesCodec, DataTypeExtensionError> {
        Ok(self)
    }
}
}

Implementing `DataTypeExtensionPackBitsCodec`

The uint4 data type supports the packbits codec as a 4-bit value. This can be supported by implementing the DataTypeExtensionPackBitsCodec trait.

#![allow(unused)]
fn main() {
impl DataTypeExtensionPackBitsCodec for DataTypeUint4 {
    fn component_size_bits(&self) -> u64 {
        4
    }

    fn num_components(&self) -> u64 {
        1
    }

    fn sign_extension(&self) -> bool {
        false
    }
}
}

In this case, the trait methods signify that the data type:

has 1 component,
a component size of 4 bits, and
it is unsigned and does not need sign extension.

The default implementation of DataTypeExtension::codec_packbits must be overriden to return Ok(self):

#![allow(unused)]
fn main() {
impl DataTypeExtension for DataTypeUint4 {
    ...
    
    fn codec_packbits(
        &self,
    ) -> Result<&dyn DataTypeExtensionPackBitsCodec, DataTypeExtensionError> {
        Ok(self)
    }
}
}

Registering the `uint4` Data Type

A data type must be registered as a DataTypePlugin to be used in an Array.

#![allow(unused)]

fn main() {
// Register the data type so that it can be recognised when opening arrays.
inventory::submit! {
    DataTypePlugin::new(UINT4, is_uint4_dtype, create_uint4_dtype)
}

fn is_uint4_dtype(name: &str) -> bool {
    name == UINT4
}

fn create_uint4_dtype(
    metadata: &MetadataV3,
) -> Result<Arc<dyn DataTypeExtension>, PluginCreateError> {
    if metadata.configuration_is_none_or_empty() {
        Ok(Arc::new(DataTypeUint4))
    } else {
        Err(PluginMetadataInvalidError::new(UINT4, "data_type", metadata.to_string()).into())
    }
}
}

The `DataTypeUint4Element` Struct

The most suitable in-memory representation of a uint4 data type element is a u8.

#![allow(unused)]
fn main() {
/// The in-memory representation of the `uint4` data type.
#[derive(Deserialize, Clone, Copy, Debug, PartialEq)]
struct DataTypeUint4Element(u8);
}

A data type element must implement the Element trait to be used in Array::store_*_as_elements methods.

#![allow(unused)]
fn main() {
/// This defines how an in-memory DataTypeUint4 is converted into ArrayBytes before encoding via the codec pipeline.
impl Element for DataTypeUint4 {
    fn validate_data_type(data_type: &DataType) -> Result<(), ArrayError> {
        (data_type == &DataType::Extension(Arc::new(DataTypeUint4)))
            .then_some(())
            .ok_or(ArrayError::IncompatibleElementType)
    }

    fn into_array_bytes<'a>(
        data_type: &DataType,
        elements: &'a [Self],
    ) -> Result<zarrs::array::ArrayBytes<'a>, ArrayError> {
        Self::validate_data_type(data_type)?;
        // Maybe this could be a transmute instead &[DataTypeUint4(u8)] -> Cow::Borrowed(&[u8])
        let mut bytes: Vec<u8> =
            Vec::with_capacity(elements.len() * size_of::<DataTypeUint4>());
        for element in elements {
            bytes.push(element.0);
        }
        Ok(ArrayBytes::Fixed(Cow::Owned(bytes)))
    }
}
}

A data type element must implement the ElementOwned trait to be used in Array::retrieve_*_as_elements methods.

#![allow(unused)]
fn main() {
/// This defines how ArrayBytes are converted into a DataTypeUint4 after decoding via the codec pipeline.
impl ElementOwned for DataTypeUint4 {
    fn from_array_bytes(
        data_type: &DataType,
        bytes: ArrayBytes<'_>,
    ) -> Result<Vec<Self>, ArrayError> {
        Self::validate_data_type(data_type)?;
        let bytes = bytes.into_fixed()?;
        let bytes_len = bytes.len();
        let mut elements = Vec::with_capacity(bytes_len / size_of::<DataTypeUint4>());
        for byte in bytes.iter() {
            // TODO: Should not construct DataTypeUint4 this way as it could represent a
            // value outside of [0, 15]. Set upper bits in the byte to 0?
            elements.push(DataTypeUint4(*byte))
        }
        Ok(elements)
    }
}
}

Some non-essential utility methods were defined for DataTypeUint4 and used in the snippets above:

#![allow(unused)]
fn main() {
impl DataTypeUint4 {
    fn to_ne_bytes(&self) -> [u8; 1] {
        [self.0]
    }

    fn from_ne_bytes(bytes: &[u8; 1]) -> Self {
        Self(bytes[0])
    }

    fn as_u8(&self) -> u8 {
        self.0
    }
}
}

More Examples

The zarrs repository includes multiple custom data type examples:

Contributing New Data Types to `zarrs`

The zarr-extensions repository is always growing with new Zarr extensions. The conformance of zarrs to zarr-extensions is tracked in this issue:

https://github.com/zarrs/zarrs/issues/191

Contributions are welcomed to support additional data types. With a little bit of polish, the uint4 example above could be included in zarrs itself (if it isn't already)!

`zarrs_tools`

zarrs_reencode

Reencode/rechunk a Zarr V2/V3 to a Zarr v3 array.

Installation

zarrs_reencode packaged by default with zarrs_tools and requires no extra features.

Prebuilt Binaries

# Requires cargo-binstall https://github.com/cargo-bins/cargo-binstall
cargo binstall zarrs_tools

From Source

cargo install zarrs_tools

Usage

zarrs_reencode --help

Reencode a Zarr array

Usage: zarrs_reencode [OPTIONS] <PATH_IN> <PATH_OUT>

Arguments:
  <PATH_IN>
          The zarr array input path or URL

  <PATH_OUT>
          The zarr array output directory

Options:
  -d, --data-type <DATA_TYPE>
          The data type as a string
          
          Valid data types:
            - bool
            - int8, int16, int32, int64
            - uint8, uint16, uint32, uint64
            - float16, float32, float64, bfloat16
            - complex64, complex 128
            - r* (raw bits, where * is a multiple of 8)

  -f, --fill-value <FILL_VALUE>
          Fill value. See <https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#fill-value>
          
          The fill value must be compatible with the data type.
          
          Examples:
            int/uint: 0 100 -100
            float: 0.0 "NaN" "Infinity" "-Infinity"
            r*: "[0, 255]"

      --separator <SEPARATOR>
          The chunk key encoding separator. Either . or /

  -c, --chunk-shape <CHUNK_SHAPE>
          Chunk shape. A comma separated list of the chunk size along each array dimension.
          
          If any dimension has size zero, it will be set to match the array shape.

  -s, --shard-shape <SHARD_SHAPE>
          Shard shape. A comma separated list of the shard size along each array dimension.
          
          If specified, the array is encoded using the sharding codec.
          If any dimension has size zero, it will be set to match the array shape.

      --array-to-array-codecs <ARRAY_TO_ARRAY_CODECS>
          Array to array codecs.
          
          JSON holding an array of array to array codec metadata.
          
          Examples:
            '[ { "name": "transpose", "configuration": { "order": [0, 2, 1] } } ]'
            '[ { "name": "bitround", "configuration": { "keepbits": 9 } } ]'

      --array-to-bytes-codec <ARRAY_TO_BYTES_CODEC>
          Array to bytes codec.
          
          JSON holding array to bytes codec metadata.
          
          Examples:
            '{ "name": "bytes", "configuration": { "endian": "little" } }'
            '{ "name": "pcodec", "configuration": { "level": 12 } }'
            '{ "name": "zfp", "configuration": { "mode": "fixedprecision", "precision": 19 } }'

      --bytes-to-bytes-codecs <BYTES_TO_BYTES_CODECS>
          Bytes to bytes codecs.
          
          JSON holding an array bytes to bytes codec configurations.
          
          Examples:
            '[ { "name": "blosc", "configuration": { "cname": "blosclz", "clevel": 9, "shuffle": "bitshuffle", "typesize": 2, "blocksize": 0 } } ]'
            '[ { "name": "bz2", "configuration": { "level": 9 } } ]'
            '[ { "name": "crc32c" } ]'
            '[ { "name": "gzip", "configuration": { "level": 9 } } ]'
            '[ { "name": "zstd", "configuration": { "level": 22, "checksum": false } } ]'

      --dimension-names <DIMENSION_NAMES>
          Dimension names (optional). Comma separated.

      --attributes <ATTRIBUTES>
          Attributes (optional).
          
          JSON holding array attributes.

      --attributes-append <ATTRIBUTES_APPEND>
          Attributes to append (optional).
          
          JSON holding array attributes.

      --concurrent-chunks <CONCURRENT_CHUNKS>
          Number of concurrent chunks

      --ignore-checksums
          Ignore checksums.
          
          If set, checksum validation in codecs (e.g. crc32c) is skipped.

      --validate
          Validate written data

  -v, --verbose
          Print verbose information, such as the array header

      --cache-size <CACHE_SIZE>
          An optional chunk cache size (in bytes)

      --cache-chunks <CACHE_CHUNKS>
          An optional chunk cache size (in chunks)

      --cache-size-thread <CACHE_SIZE_THREAD>
          An optional per-thread chunk cache size (in bytes)

      --cache-chunks-thread <CACHE_CHUNKS_THREAD>
          An optional per-thread chunk cache size (in chunks)

      --write-shape <WRITE_SHAPE>
          Write shape (optional). A comma separated list of the write size along each array dimension.
          
          Use this parameter to incrementally write shards in batches of chunks of the specified write shape.
          The write shape defaults to the shard shape for sharded arrays.
          This parameter is ignored for unsharded arrays (the write shape is the chunk shape).
          
          Prefer to set the write shape to an integer multiple of the chunk shape to avoid unnecessary reads.

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

Example

Reencode array.zarr (uint16) with:

a chunk shape of [32, 32, 32],
a shard shape of [128, 128, 0]
- the last dimension of the shard shape will match the array shape to the nearest multiple of the chunk shape
level 9 blosclz compression with bitshuffling
an input chunk cache with a size of 1GB

zarrs_reencode \
--cache-size 1000000000 \
--chunk-shape 32,32,32 \
--shard-shape 128,128,0 \
--bytes-to-bytes-codecs '[ { "name": "blosc", "configuration": { "cname": "blosclz", "clevel": 9, "shuffle": "bitshuffle", "typesize": 2, "blocksize": 0 } } ]' \
array.zarr array_reencode.zarr

zarrs_ome

Convert a Zarr array to an OME-Zarr 0.5 multiscales hierarchy.

warning

Conformance with the OME-Zarr 0.5 specification is not guaranteed, and input validation is limited. For example, it is possible to create multiscale arrays with nonconformant axis ordering.

zarrs_ome creates a multi-resolution Zarr V3 array through various methods:

Gaussian image pyramid
Mean downsampling
Mode downsampling (for discrete data)

The downsample factor defaults to 2 on all axes (careful if data includes channels!). The physical size and units of the array elements can be set explicitly. The array can be reencoded when output to OME-Zarr.

Installation

zarrs_ome is installed with the ome feature of zarrs_tools.

Prebuilt Binaries

# Requires cargo-binstall https://github.com/cargo-bins/cargo-binstall
cargo binstall zarrs_tools

From Source

cargo install --features=ome zarrs_tools

Usage

zarrs_ome --help

Convert a Zarr array to an OME-Zarr multiscales hierarchy

Usage: zarrs_ome [OPTIONS] <INPUT> <OUTPUT> [DOWNSAMPLE_FACTOR]...

Arguments:
  <INPUT>
          The input array path

  <OUTPUT>
          The output group path

  [DOWNSAMPLE_FACTOR]...
          The downsample factor per axis, comma separated.
          
          Defaults to 2 on each axis.

Options:
      --ome-zarr-version <OME_ZARR_VERSION>
          [default: 0.5]

          Possible values:
          - 0.5: https://ngff.openmicroscopy.org/0.5/

      --max-levels <MAX_LEVELS>
          Maximum number of downsample levels
          
          [default: 10]

      --physical-size <PHYSICAL_SIZE>
          Physical size per axis, comma separated

      --physical-units <PHYSICAL_UNITS>
          Physical units per axis, comma separated.
          
          Set to "channel" for a channel axis.

      --name <NAME>
          OME Zarr dataset name

      --discrete
          Set to true for discrete data.
          
          Performs majority downsampling instead of creating a Gaussian image pyramid or mean downsampling.

      --gaussian-sigma <GAUSSIAN_SIGMA>
          The Gaussian "sigma" to apply when creating a Gaussian image pyramid per axis, comma separated.
          
          This is typically set to 0.5 times the downsample factor for each axis. If omitted, then mean downsampling is applied.
          
          Ignored for discrete data.

      --gaussian-kernel-half-size <GAUSSIAN_KERNEL_HALF_SIZE>
          The Gaussian kernel half size per axis, comma separated.
          
          If omitted, defaults to ceil(3 * sigma).
          
          Ignored for discrete data or if --gaussian-sigma is not set.

      --exists <EXISTS>
          Behaviour if the output exists
          
          [default: erase]

          Possible values:
          - erase:     Erase the output
          - overwrite: Overwrite existing files. Useful if the output includes additional non-zarr files to be preserved. May fail if changing the encoding
          - exit:      Exit if the output already exists

      --group-attributes <GROUP_ATTRIBUTES>
          Attributes (optional).
          
          JSON holding group attributes.

  -d, --data-type <DATA_TYPE>
          The data type as a string
          
          Valid data types:
            - bool
            - int8, int16, int32, int64
            - uint8, uint16, uint32, uint64
            - float16, float32, float64, bfloat16
            - complex64, complex 128
            - r* (raw bits, where * is a multiple of 8)

  -f, --fill-value <FILL_VALUE>
          Fill value. See <https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#fill-value>
          
          The fill value must be compatible with the data type.
          
          Examples:
            int/uint: 0 100 -100
            float: 0.0 "NaN" "Infinity" "-Infinity"
            r*: "[0, 255]"

      --separator <SEPARATOR>
          The chunk key encoding separator. Either . or /

  -c, --chunk-shape <CHUNK_SHAPE>
          Chunk shape. A comma separated list of the chunk size along each array dimension.
          
          If any dimension has size zero, it will be set to match the array shape.

  -s, --shard-shape <SHARD_SHAPE>
          Shard shape. A comma separated list of the shard size along each array dimension.
          
          If specified, the array is encoded using the sharding codec.
          If any dimension has size zero, it will be set to match the array shape.

      --array-to-array-codecs <ARRAY_TO_ARRAY_CODECS>
          Array to array codecs.
          
          JSON holding an array of array to array codec metadata.
          
          Examples:
            '[ { "name": "transpose", "configuration": { "order": [0, 2, 1] } } ]'
            '[ { "name": "bitround", "configuration": { "keepbits": 9 } } ]'

      --array-to-bytes-codec <ARRAY_TO_BYTES_CODEC>
          Array to bytes codec.
          
          JSON holding array to bytes codec metadata.
          
          Examples:
            '{ "name": "bytes", "configuration": { "endian": "little" } }'
            '{ "name": "pcodec", "configuration": { "level": 12 } }'
            '{ "name": "zfp", "configuration": { "mode": "fixedprecision", "precision": 19 } }'

      --bytes-to-bytes-codecs <BYTES_TO_BYTES_CODECS>
          Bytes to bytes codecs.
          
          JSON holding an array bytes to bytes codec configurations.
          
          Examples:
            '[ { "name": "blosc", "configuration": { "cname": "blosclz", "clevel": 9, "shuffle": "bitshuffle", "typesize": 2, "blocksize": 0 } } ]'
            '[ { "name": "bz2", "configuration": { "level": 9 } } ]'
            '[ { "name": "crc32c" } ]'
            '[ { "name": "gzip", "configuration": { "level": 9 } } ]'
            '[ { "name": "zstd", "configuration": { "level": 22, "checksum": false } } ]'

      --dimension-names <DIMENSION_NAMES>
          Dimension names (optional). Comma separated.

      --attributes <ATTRIBUTES>
          Attributes (optional).
          
          JSON holding array attributes.

      --attributes-append <ATTRIBUTES_APPEND>
          Attributes to append (optional).
          
          JSON holding array attributes.

      --chunk-limit <CHUNK_LIMIT>
          The maximum number of chunks concurrently processed.
          
          By default, this is set to the number of CPUs. Consider reducing this for images with large chunk sizes or on systems with low memory availability.

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

Examples

Match Input Encoding

zarrs_ome \
    --name "ABC-123" \
    --physical-size 2.0,2.0,2.0 \
    --physical-units micrometer,micrometer,micrometer \
    array.zarr array.ome.zarr

[00:00:00/00:00:00] 0 [1243, 1403, 1510] array.ome.zarr/0 rw:0.00/0.76 p:0.00
[00:00:14/00:00:14] 1 [621, 701, 755] array.ome.zarr/1 rw:1.95/0.51 p:12.24
[00:00:01/00:00:01] 2 [310, 350, 377] array.ome.zarr/2 rw:0.62/0.13 p:3.58
[00:00:00/00:00:00] 3 [155, 175, 188] array.ome.zarr/3 rw:0.06/0.01 p:0.26
[00:00:00/00:00:00] 4 [77, 87, 94] array.ome.zarr/4 rw:0.00/0.00 p:0.03
[00:00:00/00:00:00] 5 [38, 43, 47] array.ome.zarr/5 rw:0.00/0.00 p:0.01
[00:00:00/00:00:00] 6 [19, 21, 23] array.ome.zarr/6 rw:0.00/0.00 p:0.01
[00:00:00/00:00:00] 7 [9, 10, 11] array.ome.zarr/7 rw:0.00/0.00 p:0.00
[00:00:00/00:00:00] 8 [4, 5, 5] array.ome.zarr/8 rw:0.00/0.00 p:0.00
[00:00:00/00:00:00] 9 [2, 2, 2] array.ome.zarr/9 rw:0.00/0.00 p:0.00
[00:00:00/00:00:00] 10 [1, 1, 1] array.ome.zarr/10 rw:0.00/0.00 p:0.00

Change Encoding and Downsampling Factor

zarrs_ome \
    --name "ABC-123" \
    --physical-size 2.0,2.0,2.0 \
    --physical-units micrometer,micrometer,micrometer \
    --shard-shape 256,256,256 \
    --chunk-shape 32,32,32 \
    array.zarr array.ome.zarr 1,4,4

[00:00:01/00:00:01] 0 [1243, 1403, 1510] array.ome.zarr/0 rw:25.09/24.50 p:0.00
[00:00:12/00:00:12] 1 [1243, 350, 377] array.ome.zarr/1 rw:5.51/1.21 p:26.79
[00:00:00/00:00:00] 2 [1243, 87, 94] array.ome.zarr/2 rw:0.47/0.13 p:2.97
[00:00:00/00:00:00] 3 [1243, 21, 23] array.ome.zarr/3 rw:0.07/0.00 p:0.16
[00:00:00/00:00:00] 4 [1243, 5, 5] array.ome.zarr/4 rw:0.01/0.00 p:0.02
[00:00:00/00:00:00] 5 [1243, 1, 1] array.ome.zarr/5 rw:0.01/0.00 p:0.00

zarrs_filter

Apply simple image filters (transformations) to an array.

warning

zarrs_filter is highly experimental, has had limited production testing, and is sparsely documented.

The filters currently supported are:

reencode: Reencode (change encoding, data type, etc.).
crop: Crop given an offset and shape.
rescale: Rescale values given a multiplier and offset.
clamp: Clamp values between a minimum and maximum.
equal: Return a binary image where the input is equal to some value.
downsample: Downsample given a stride.
gradient-magnitude: Compute the gradient magnitude.
gaussian: Apply a Gaussian kernel.
summed area table: Compute the summed area table.
guided filter: Apply a guided filter (edge-preserving noise filter).

Installation

zarrs_filter is installed with the filter feature of zarrs_tools.

Prebuilt Binaries

# Requires cargo-binstall https://github.com/cargo-bins/cargo-binstall
cargo binstall zarrs_tools

From Source

cargo install --features=filter zarrs_tools

Usage

zarrs_filter --help

Apply simple image filters (transformations) to a Zarr array

Usage: zarrs_filter [OPTIONS] [RUN_CONFIG] [COMMAND]

Commands:
  reencode            Reencode an array
  crop                Crop an array given an offset and shape
  rescale             Rescale array values given a multiplier and offset
  clamp               Clamp values between a minimum and maximum
  equal               Return a binary image where the input is equal to some value
  downsample          Downsample an image given a stride
  gradient-magnitude  Compute the gradient magnitude
  gaussian            Apply a Gaussian kernel
  summed-area-table   Compute a summed area table (integral image)
  guided-filter       Apply a guided filter (edge-preserving noise filter)
  replace-value       Replace a value with another value
  help                Print this message or the help of the given subcommand(s)

Arguments:
  [RUN_CONFIG]
          Path to a JSON run configuration

Options:
      --exists <EXISTS>
          Behaviour if the output exists
          
          [default: erase]

          Possible values:
          - erase: Erase the output
          - exit:  Exit if the output already exists

      --tmp <TMP>
          Directory for temporary arrays.
          
          If omitted, defaults to the platform-specific temporary directory (e.g. ${TMPDIR}, /tmp, etc.)

      --chunk-limit <CHUNK_LIMIT>
          The maximum number of chunks concurrently processed.
          
          By default, this is set to the number of CPUs. Consider reducing this for images with large chunk sizes or on systems with low memory availability.

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

Run zarrs_filter <COMMAND> --help for more information on a specific command.

Examples (CLI)

export ENCODE_ARGS="--shard-shape 256,256,256 --chunk-shape 32,32,32"
zarrs_filter reencode           array.zarr       array_reenc.zarr               ${ENCODE_ARGS}
zarrs_filter reencode           array_reenc.zarr array_reenc_int32.zarr         ${ENCODE_ARGS} --data-type int32
zarrs_filter reencode           array_reenc.zarr array_reenc_float32.zarr       ${ENCODE_ARGS} --data-type float32
zarrs_filter crop               array_reenc.zarr array_crop.zarr                ${ENCODE_ARGS} --data-type float32 256,256,256 768,768,768
zarrs_filter rescale            array_reenc.zarr array_rescale.zarr             ${ENCODE_ARGS} --data-type float32 2.0 1.0 --fill-value 1.0
zarrs_filter clamp              array_reenc.zarr array_clamp.zarr               ${ENCODE_ARGS} --data-type float32 5 255 --fill-value 5.0
# zarrs_filter equal              array_reenc.zarr array_eq_bool.zarr             ${ENCODE_ARGS} --data-type bool 1 --fill-value true
zarrs_filter equal              array_reenc.zarr array_eq_u8.zarr               ${ENCODE_ARGS} --data-type uint8 1 --fill-value 1
zarrs_filter downsample         array_reenc.zarr array_downsample.zarr          ${ENCODE_ARGS} --data-type float32 2,2,2
zarrs_filter downsample         array_eq_u8.zarr array_downsample_discrete.zarr ${ENCODE_ARGS} --data-type uint8 2,2,2 --discrete
zarrs_filter gradient-magnitude array_reenc.zarr array_gradient_magnitude.zarr  ${ENCODE_ARGS} --data-type float32
zarrs_filter gaussian           array_reenc.zarr array_gaussian.zarr            ${ENCODE_ARGS} --data-type float32 1.0,1.0,1.0 3,3,3
zarrs_filter summed-area-table  array_reenc.zarr array_sat.zarr                 ${ENCODE_ARGS} --data-type int64
zarrs_filter guided-filter      array_reenc.zarr array_guided_filter.zarr       ${ENCODE_ARGS} --data-type float32 40000 3
zarrs_filter replace-value      array_reenc.zarr array_replace.zarr             ${ENCODE_ARGS} 65535 0 --fill-value 0

Examples (Config)

zarrs_filter <RUN.json>

run.json

[
    {
        "_comment": "Rechunk the input",
        "filter": "reencode",
        "input": "array.zarr",
        "output": "$reencode0",
        "shard_shape": [256, 256, 256],
        "chunk_shape": [32, 32, 32]
    },
    {
        "_comment": "Reencode the previous output as float32, automatically cast the fill value",
        "filter": "reencode",
        "output": "array_float32.zarr",
        "data_type": "float32"
    },
    {
        "filter": "crop",
        "input": "$reencode0",
        "output": "array_crop.zarr",
        "offset": [256, 256, 256],
        "shape": [768, 768, 768]
    },
    {
        "filter": "replace_value",
        "input": "$reencode0",
        "output": "array_replace.zarr",
        "value": 65535,
        "replace": 0
    },
    {
        "_comment": "Multiply by 7.0/20000.0, casting most values in the image between 0 and 7, store in 8-bit (saturate cast)",
        "filter": "rescale",
        "input": "$reencode0",
        "output": "array_3bit.zarr",
        "multiply": 0.00035,
        "add": 0.0,
        "data_type": "uint8",
        "fill_value": 0
    },
    {
        "_comment": "Multiply by 255.0/20000.0, casting most values in the image between 0 and 7, store in 8-bit (saturate cast)",
        "filter": "rescale",
        "input": "$reencode0",
        "output": "array_8bit.zarr",
        "multiply": 0.01275,
        "add": 0.0,
        "data_type": "uint8",
        "fill_value": 0
    },
    {
        "_comment": "Clamp the 3-bit output between 2 and 5 and set the fill value to 2",
        "filter": "clamp",
        "output": "array_3bit_clamp.zarr",
        "min": 2,
        "max": 5,
        "fill_value": 2
    },
    {
        "_comment": "Calculate a binary image where the input is equal to 5 (the max from the clamp). Store as bool",
        "filter": "equal",
        "input": "array_3bit_clamp.zarr", 
        "output": "array_clamp_equal_bool.zarr",
        "value": 5
    },
    {
        "_comment": "Calculate a binary image where the input is equal to 5 (the max from the clamp). Store as uint8",
        "filter": "equal",
        "input": "array_3bit_clamp.zarr",
        "output": "array_3bit_max.zarr",
        "value": 5,
        "data_type": "uint8",
        "fill_value": 0
    },
    {
        "_comment": "Downsample clamped image by a factor of 2 with mean operator.",
        "filter": "downsample",
        "input": "array_3bit_clamp.zarr",
        "output": "array_3bit_clamp_by2_continuous.zarr",
        "stride": [2, 2, 2],
        "discrete": false,
        "data_type": "float32",
        "shard_shape": [128, 128, 128],
        "chunk_shape": [32, 32, 32]
    },
    {
        "_comment": "Downsample clamped image by a factor of 2 with mode operator.",
        "filter": "downsample",
        "input": "array_3bit_clamp.zarr",
        "output": "array_3bit_clamp_by2_discrete.zarr",
        "stride": [2, 2, 2],
        "discrete": true,
        "shard_shape": [128, 128, 128],
        "chunk_shape": [32, 32, 32]
    },
    {
        "filter": "gradient_magnitude",
        "input": "$reencode0",
        "output": "array_gradient.zarr"
    },
    {
        "filter": "gaussian",
        "input": "$reencode0",
        "output": "array_gaussian.zarr",
        "sigma": [1.0, 1.0, 1.0],
        "kernel_half_size": [3, 3, 3]
    },
    {
        "filter": "summed_area_table",
        "input": "$reencode0",
        "output": "array_sat.zarr",
        "data_type": "float32"
    },
    {
        "filter": "guided_filter",
        "input": "$reencode0",
        "output": "array_guided_filter.zarr",
        "epsilon": 40000.0,
        "radius": 3,
        "data_type": "float32"
    }
]

output

0 reencode
        args:   {}
        encode: {"chunk_shape":[32,32,32],"shard_shape":[256,256,256]}
        input:  uint16 [1243, 1403, 1510] "array.zarr"
        output: uint16 [1243, 1403, 1510] "/tmp/.tmpCbeEcJ/$reencode0bxiFEM"
1 reencode
        args:   {}
        encode: {"data_type":"float32"}
        input:  uint16 [1243, 1403, 1510] "/tmp/.tmpCbeEcJ/$reencode0bxiFEM"
        output: float32 [1243, 1403, 1510] "array_float32.zarr" (overwrite)
2 crop
        args:   {"offset":[256,256,256],"shape":[768,768,768]}
        encode: {}
        input:  uint16 [1243, 1403, 1510] "/tmp/.tmpCbeEcJ/$reencode0bxiFEM"
        output: uint16 [768, 768, 768] "array_crop.zarr" (overwrite)
3 replace_value
        args:   {"value":65535,"replace":0}
        encode: {}
        input:  uint16 [1243, 1403, 1510] "/tmp/.tmpCbeEcJ/$reencode0bxiFEM"
        output: uint16 [1243, 1403, 1510] "array_replace.zarr" (overwrite)
4 rescale
        args:   {"multiply":0.00035,"add":0.0,"add_first":false}
        encode: {"data_type":"uint8","fill_value":0}
        input:  uint16 [1243, 1403, 1510] "/tmp/.tmpCbeEcJ/$reencode0bxiFEM"
        output: uint8 [1243, 1403, 1510] "array_3bit.zarr" (overwrite)
5 rescale
        args:   {"multiply":0.01275,"add":0.0,"add_first":false}
        encode: {"data_type":"uint8","fill_value":0}
        input:  uint16 [1243, 1403, 1510] "/tmp/.tmpCbeEcJ/$reencode0bxiFEM"
        output: uint8 [1243, 1403, 1510] "array_8bit.zarr" (overwrite)
6 clamp
        args:   {"min":2.0,"max":5.0}
        encode: {"fill_value":2}
        input:  uint8 [1243, 1403, 1510] "array_8bit.zarr"
        output: uint8 [1243, 1403, 1510] "array_3bit_clamp.zarr" (overwrite)
7 equal
        args:   {"value":5}
        encode: {}
        input:  uint8 [1243, 1403, 1510] "array_3bit_clamp.zarr"
        output: bool [1243, 1403, 1510] "array_clamp_equal_bool.zarr" (overwrite)
8 equal
        args:   {"value":5}
        encode: {"data_type":"uint8","fill_value":0}
        input:  uint8 [1243, 1403, 1510] "array_3bit_clamp.zarr"
        output: uint8 [1243, 1403, 1510] "array_3bit_max.zarr" (overwrite)
9 downsample
        args:   {"stride":[2,2,2],"discrete":false}
        encode: {"data_type":"float32","chunk_shape":[32,32,32],"shard_shape":[128,128,128]}
        input:  uint8 [1243, 1403, 1510] "array_3bit_clamp.zarr"
        output: float32 [621, 701, 755] "array_3bit_clamp_by2_continuous.zarr" (overwrite)
10 downsample
        args:   {"stride":[2,2,2],"discrete":true}
        encode: {"chunk_shape":[32,32,32],"shard_shape":[128,128,128]}
        input:  uint8 [1243, 1403, 1510] "array_3bit_clamp.zarr"
        output: uint8 [621, 701, 755] "array_3bit_clamp_by2_discrete.zarr" (overwrite)
11 gradient_magnitude
        args:   {}
        encode: {}
        input:  uint16 [1243, 1403, 1510] "/tmp/.tmpCbeEcJ/$reencode0bxiFEM"
        output: uint16 [1243, 1403, 1510] "array_gradient.zarr" (overwrite)
12 gaussian
        args:   {"sigma":[1.0,1.0,1.0],"kernel_half_size":[3,3,3]}
        encode: {}
        input:  uint16 [1243, 1403, 1510] "/tmp/.tmpCbeEcJ/$reencode0bxiFEM"
        output: uint16 [1243, 1403, 1510] "array_gaussian.zarr" (overwrite)
13 summed area table
        args:   {}
        encode: {"data_type":"float32"}
        input:  uint16 [1243, 1403, 1510] "/tmp/.tmpCbeEcJ/$reencode0bxiFEM"
        output: float32 [1243, 1403, 1510] "array_sat.zarr" (overwrite)
14 guided_filter
        args:   {"epsilon":40000.0,"radius":3}
        encode: {"data_type":"float32"}
        input:  uint16 [1243, 1403, 1510] "/tmp/.tmpCbeEcJ/$reencode0bxiFEM"
        output: float32 [1243, 1403, 1510] "array_guided_filter.zarr" (overwrite)
[00:00:02/00:00:02] reencode /tmp/.tmpCbeEcJ/$reencode0bxiFEM rw:34.78/28.90 p:0.00
[00:00:04/00:00:04] reencode array_float32.zarr rw:30.06/76.57 p:14.16
[00:00:00/00:00:00] crop array_crop.zarr rw:3.46/3.34 p:0.00
[00:00:02/00:00:02] replace_value array_replace.zarr rw:26.73/47.32 p:7.25
[00:00:01/00:00:01] rescale array_3bit.zarr rw:18.11/14.43 p:11.55
[00:00:01/00:00:01] rescale array_8bit.zarr rw:23.54/21.99 p:11.08
[00:00:00/00:00:00] clamp array_3bit_clamp.zarr rw:9.70/10.34 p:0.96
[00:00:00/00:00:00] equal array_clamp_equal_bool.zarr rw:10.61/9.32 p:4.56
[00:00:00/00:00:00] equal array_3bit_max.zarr rw:10.29/9.49 p:3.61
[00:00:02/00:00:02] downsample array_3bit_clamp_by2_continuous.zarr rw:7.01/1.95 p:71.76
[00:00:06/00:00:06] downsample array_3bit_clamp_by2_discrete.zarr rw:16.08/1.01 p:168.86
[00:00:20/00:00:20] gradient_magnitude array_gradient.zarr rw:147.16/14.38 p:289.05
[00:00:10/00:00:10] gaussian array_gaussian.zarr rw:36.19/22.01 p:181.06
[00:00:23/00:00:23] summed area table array_sat.zarr rw:190.51/215.68 p:54.39
[00:01:51/00:01:51] guided_filter array_guided_filter.zarr rw:29.57/59.96 p:2427.96

zarrs_validate

Compare the data in two Zarr arrays.

Installation

zarrs_validate is installed with the validate feature of zarrs_tools.

Prebuilt Binaries

# Requires cargo-binstall https://github.com/cargo-bins/cargo-binstall
cargo binstall zarrs_tools

From Source

cargo install --features=validate zarrs_tools

Usage

zarrs_validate --help

Compare the data in two Zarr arrays.

Equality of the arrays is determined by comparing the shape, data type, and data.

Differences in encoding (e.g codecs, chunk key encoding) and attributes are ignored.

Usage: zarrs_validate [OPTIONS] <FIRST> <SECOND>

Arguments:
  <FIRST>
          The path to the first zarr array

  <SECOND>
          The path to the second zarr array

Options:
      --concurrent-chunks <CONCURRENT_CHUNKS>
          Number of concurrent chunks to compare

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

zarrs_info

Get information about a Zarr array or group.

Installation

zarrs_info is installed with the info feature of zarrs_tools.

Prebuilt Binaries

# Requires cargo-binstall https://github.com/cargo-bins/cargo-binstall
cargo binstall zarrs_tools

From Source

cargo install --features=info zarrs_tools

Usage

zarrs_info --help

Get information about a Zarr array or group.

Outputs are JSON encoded.

Usage: zarrs_info [OPTIONS] <PATH> <COMMAND>

Commands:
  metadata         Get the array/group metadata
  metadata-v3      Get the array/group metadata (interpreted as V3)
  attributes       Get the array/group attributes
  shape            Get the array shape
  data-type        Get the array data type
  fill-value       Get the array fill value
  dimension-names  Get the array dimension names
  range            Get the array data range
  histogram        Get the array data histogram
  help             Print this message or the help of the given subcommand(s)

Arguments:
  <PATH>
          Path to the Zarr input array or group

Options:
      --chunk-limit <CHUNK_LIMIT>
          The maximum number of chunks concurrently processed.
          
          Defaults to the RAYON_NUM_THREADS environment variable or the number of logical CPUs. Consider reducing this for images with large chunk sizes or on systems with low memory availability.
          
          [default: 24]

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

Examples

Data Type

zarrs_info array.zarr data-type

{
  "data_type": "uint16"
}

Array Shape

zarrs_info array.zarr shape

{
  "shape": [
    1243,
    1403,
    1510
  ]
}

Data Range

zarrs_info array.zarr range

{
  "min": 0,
  "max": 65535
}

zarrs_binary2zarr

Create a Zarr V3 array from piped binary data.

Installation

zarrs_binary2zarr is installed with the binary2zarr feature of zarrs_tools.

Prebuilt Binaries

# Requires cargo-binstall https://github.com/cargo-bins/cargo-binstall
cargo binstall zarrs_tools

From Source

cargo install --features=binary2zarr zarrs_tools

Example

chameleon_1024x1024x1080.uint16 is an uncompressed binary 3D image split into multiple files with

(depth, height, width) = (1080, 1024, 1024)
data type = uint16

tree --du -h chameleon_1024x1024x1080.uint16

[2.1G]  chameleon_1024x1024x1080.uint16
├── [512M]  xaa
├── [512M]  xab
├── [512M]  xac
├── [512M]  xad
└── [112M]  xae

With the following command, the image is encoded as a zarr array with the sharding codec with a shard shape of (128, 1024, 1024)

inner chunks in each shard have a chunk shape of (32, 32, 32)
inner chunks are compressed using the blosc codec

cat chameleon_1024x1024x1080.uint16/* | \
zarrs_binary2zarr \
--data-type uint16 \
--fill-value 0 \
--separator '.' \
--array-shape 1080,1024,1024 \
--chunk-shape 32,32,32 \
--shard-shape 128,1024,1024 \
--bytes-to-bytes-codecs '[ { "name": "blosc", "configuration": { "cname": "blosclz", "clevel": 9, "shuffle": "bitshuffle", "typesize": 2, "blocksize": 0 } } ]' \
chameleon_1024x1024x1080.zarr

tree --du -h chameleon_1024x1024x1080.zarr

[1.3G]  chameleon_1024x1024x1080.zarr
├── [152M]  c.0.0.0
├── [157M]  c.1.0.0
├── [157M]  c.2.0.0
├── [156M]  c.3.0.0
├── [152M]  c.4.0.0
├── [150M]  c.5.0.0
├── [152M]  c.6.0.0
├── [152M]  c.7.0.0
├── [ 67M]  c.8.0.0
└── [1.2K]  zarr.json

Keyboard shortcuts

The zarrs Book