Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Data Type Extensions

According to the Zarr V3 specification:

A data type defines the set of possible values that an array may contain. For example, the 32-bit signed integer data type defines binary representations for all integers in the range −2,147,483,648 to 2,147,483,647.

The specification defines a limited set of data types, but additional data types can be defined as extensions.

zarrs supports a number of extension data types, many of which are registered in the zarr-extensions repository. This chapter explains how to create custom data types with a guided walkthrough.

Example: The uint4 Data type

The uint4 data type is registered at the zarr-extensions repository. The specification can be read here:

In summary, it defines a 4-bit unsigned integer in the range [0, 15] that is supported by the bytes and packbits codecs.

The DataTypeUint4 Struct

The uint4 data type has no configuration, so it can be represented by a unit struct:

#![allow(unused)]
fn main() {
/// The `uint4` data type.
#[derive(Debug)]
struct DataTypeUint4;
}

Implementing DataTypeExtension

To be used as a data type extension, DataTypeUint4 must implement the DataTypeExtension trait. The DataTypeUint4Element used in these definitions is defined later on this page. This defines properties of the data type such as the metadata (name and configuration), size, and conversion between to and from fill values and fill value metadata. It has additional codec related methods detailed shortly.

#![allow(unused)]
fn main() {
/// A unique identifier for `uint4` data type.
const UINT4: &'static str = "uint4";

impl DataTypeExtension for DataTypeUint4 {
    fn name(&self) -> String {
        UINT4.to_string()
    }

    fn configuration(&self) -> Configuration {
        Configuration::default()
    }

    fn fill_value(
        &self,
        fill_value_metadata: &FillValueMetadataV3,
    ) -> Result<FillValue, DataTypeFillValueMetadataError> {
        let err = || DataTypeFillValueMetadataError::new(self.name(), fill_value_metadata.clone());
        let element_metadata: u64 = fill_value_metadata.as_u64().ok_or_else(err)?;
        let element = DataTypeUint4Element::try_from(element_metadata).map_err(|_| {
            DataTypeFillValueMetadataError::new(UINT4.to_string(), fill_value_metadata.clone())
        })?;
        Ok(FillValue::new(element.to_ne_bytes().to_vec()))
    }

    fn metadata_fill_value(
        &self,
        fill_value: &FillValue,
    ) -> Result<FillValueMetadataV3, DataTypeFillValueError> {
        let element = DataTypeUint4Element::from_ne_bytes(
            fill_value
                .as_ne_bytes()
                .try_into()
                .map_err(|_| DataTypeFillValueError::new(self.name(), fill_value.clone()))?,
        );
        Ok(FillValueMetadataV3::from(element.as_u8()))
    }

    fn size(&self) -> zarrs::array::DataTypeSize {
        DataTypeSize::Fixed(1)
    }
    
    ...
}
}

Implementing DataTypeExtensionBytesCodec

Supporting the bytes codec is absolutely trivial for the uint4 data type. It simply passes through the in-memory data unmodified, since it is already a 1-byte value.

#![allow(unused)]
fn main() {
impl DataTypeExtensionBytesCodec for DataTypeUint4 {
    fn encode<'a>(
        &self,
        bytes: std::borrow::Cow<'a, [u8]>,
        _endianness: Option<zarrs_metadata::Endianness>,
    ) -> Result<std::borrow::Cow<'a, [u8]>, DataTypeExtensionBytesCodecError> {
        Ok(bytes)
    }

    fn decode<'a>(
        &self,
        bytes: std::borrow::Cow<'a, [u8]>,
        _endianness: Option<zarrs_metadata::Endianness>,
    ) -> Result<std::borrow::Cow<'a, [u8]>, DataTypeExtensionBytesCodecError> {
        Ok(bytes)
    }
}
}

The default implementation of DataTypeExtension::codec_bytes must be overriden to return Ok(self):

#![allow(unused)]
fn main() {
impl DataTypeExtension for DataTypeUint4 {
    ...
    
    fn codec_bytes(&self) -> Result<&dyn DataTypeExtensionBytesCodec, DataTypeExtensionError> {
        Ok(self)
    }
}
}

Implementing DataTypeExtensionPackBitsCodec

The uint4 data type supports the packbits codec as a 4-bit value. This can be supported by implementing the DataTypeExtensionPackBitsCodec trait.

#![allow(unused)]
fn main() {
impl DataTypeExtensionPackBitsCodec for DataTypeUint4 {
    fn component_size_bits(&self) -> u64 {
        4
    }

    fn num_components(&self) -> u64 {
        1
    }

    fn sign_extension(&self) -> bool {
        false
    }
}
}

In this case, the trait methods signify that the data type:

  • has 1 component,
  • a component size of 4 bits, and
  • it is unsigned and does not need sign extension.

The default implementation of DataTypeExtension::codec_packbits must be overriden to return Ok(self):

#![allow(unused)]
fn main() {
impl DataTypeExtension for DataTypeUint4 {
    ...
    
    fn codec_packbits(
        &self,
    ) -> Result<&dyn DataTypeExtensionPackBitsCodec, DataTypeExtensionError> {
        Ok(self)
    }
}
}

Registering the uint4 Data Type

A data type must be registered as a DataTypePlugin to be used in an Array.

#![allow(unused)]

fn main() {
// Register the data type so that it can be recognised when opening arrays.
inventory::submit! {
    DataTypePlugin::new(UINT4, is_uint4_dtype, create_uint4_dtype)
}

fn is_uint4_dtype(name: &str) -> bool {
    name == UINT4
}

fn create_uint4_dtype(
    metadata: &MetadataV3,
) -> Result<Arc<dyn DataTypeExtension>, PluginCreateError> {
    if metadata.configuration_is_none_or_empty() {
        Ok(Arc::new(DataTypeUint4))
    } else {
        Err(PluginMetadataInvalidError::new(UINT4, "data_type", metadata.to_string()).into())
    }
}
}

The DataTypeUint4Element Struct

The most suitable in-memory representation of a uint4 data type element is a u8.

#![allow(unused)]
fn main() {
/// The in-memory representation of the `uint4` data type.
#[derive(Deserialize, Clone, Copy, Debug, PartialEq)]
struct DataTypeUint4Element(u8);
}

A data type element must implement the Element trait to be used in Array::store_*_as_elements methods.

#![allow(unused)]
fn main() {
/// This defines how an in-memory DataTypeUint4 is converted into ArrayBytes before encoding via the codec pipeline.
impl Element for DataTypeUint4 {
    fn validate_data_type(data_type: &DataType) -> Result<(), ArrayError> {
        (data_type == &DataType::Extension(Arc::new(DataTypeUint4)))
            .then_some(())
            .ok_or(ArrayError::IncompatibleElementType)
    }

    fn into_array_bytes<'a>(
        data_type: &DataType,
        elements: &'a [Self],
    ) -> Result<zarrs::array::ArrayBytes<'a>, ArrayError> {
        Self::validate_data_type(data_type)?;
        // Maybe this could be a transmute instead &[DataTypeUint4(u8)] -> Cow::Borrowed(&[u8])
        let mut bytes: Vec<u8> =
            Vec::with_capacity(elements.len() * size_of::<DataTypeUint4>());
        for element in elements {
            bytes.push(element.0);
        }
        Ok(ArrayBytes::Fixed(Cow::Owned(bytes)))
    }
}
}

A data type element must implement the ElementOwned trait to be used in Array::retrieve_*_as_elements methods.

#![allow(unused)]
fn main() {
/// This defines how ArrayBytes are converted into a DataTypeUint4 after decoding via the codec pipeline.
impl ElementOwned for DataTypeUint4 {
    fn from_array_bytes(
        data_type: &DataType,
        bytes: ArrayBytes<'_>,
    ) -> Result<Vec<Self>, ArrayError> {
        Self::validate_data_type(data_type)?;
        let bytes = bytes.into_fixed()?;
        let bytes_len = bytes.len();
        let mut elements = Vec::with_capacity(bytes_len / size_of::<DataTypeUint4>());
        for byte in bytes.iter() {
            // TODO: Should not construct DataTypeUint4 this way as it could represent a
            // value outside of [0, 15]. Set upper bits in the byte to 0?
            elements.push(DataTypeUint4(*byte))
        }
        Ok(elements)
    }
}
}

Some non-essential utility methods were defined for DataTypeUint4 and used in the snippets above:

#![allow(unused)]
fn main() {
impl DataTypeUint4 {
    fn to_ne_bytes(&self) -> [u8; 1] {
        [self.0]
    }

    fn from_ne_bytes(bytes: &[u8; 1]) -> Self {
        Self(bytes[0])
    }

    fn as_u8(&self) -> u8 {
        self.0
    }
}
}

More Examples

The zarrs repository includes multiple custom data type examples:

Contributing New Data Types to zarrs

The zarr-extensions repository is always growing with new Zarr extensions. The conformance of zarrs to zarr-extensions is tracked in this issue:

Contributions are welcomed to support additional data types. With a little bit of polish, the uint4 example above could be included in zarrs itself (if it isn't already)!