Biomedical Imaging Data Model
This section describes the data model used to efficiently represent and store whole slide histopathology images within TileDB by using TileDB-BioImaging. The Storage Format Spec section covers TileDB-BioImaging’s format, whereas the Key Concepts section describes important background information in the internal mechanics of TileDB-BioImaging. By leveraging TileDB’s core features and a flexible multidimensional array structure, this model enables high-performance storage, retrieval, and analysis of large image datasets.
Data array
TileDB stores whole slide images as multidimensional ND dense arrays. This structure allows for the representation of image and volumetric data, incorporating spatial dimensions, time, and channels:
X
(Space): Represents the width of the image in pixels.Y
(Space): Represents the height of the image in pixels.Z
(Space): Represents depth (for volumetric data like 3D scans).T
(Time): Represents a time dimension (for time-series data).C
(Channels): Represents color channels (for example, Red, Green, Blue).
Each resolution layer of a whole slide image is represented as a separate ND array. For instance, a standard 2D image with three color channels would be stored as a 3D array (YXC). A time series of 3D volumetric images would be a 5D array (ZYXCT).
Data representation
Each pixel is represented as an int16
value for each color channel. TileDB stores this data within the cells of the dense array. For grayscale
images, the C
(channels) dimension reduces to 1
.
Advantages of this model
- Multidimensionality: TileDB’s inherent support for multidimensional arrays aligns perfectly with the complex structure of whole slide images, enabling efficient storage and querying of data across all dimensions.
- Flexibility: The model accommodates different image types, including 2D images, 3D volumes, and time series data.
- Scalability: TileDB’s architecture allows for efficient storage and processing of extremely large datasets, essential for handling high-resolution, whole slide images.
- Interoperability: This data model, alongside TileDB’s multidimensionality nature, allows the definition of extra dimensions like Space, Time, and Channels, leading to the adoption of standard formats like Open Microscopy Environment Next Generation File Format (OME-NGFF). OME-NGFF is a modern, cloud-optimized file format designed for storing and sharing large, complex bioimaging datasets. Using this format ensures interoperability with a growing ecosystem of tools and enables collaboration within the scientific community.
Key features and considerations
- Axes Order Flexibility: Using the API, you can preserve the original axes order from the source image or store the data in a canonical form. This offers flexibility and compatibility with different workflows.
- Resolution Layers: Whole slide images are often stored in a pyramidal format with multiple resolution levels. Each level is represented as a separate ND array, allowing for efficient access to different levels of detail.
- Pyramidal Image Grouping: Each pyramidal image with its multiple resolution levels is represented as a group of arrays within TileDB. This allows for organized storage and management of related data.
- Group Metadata: Metadata about the entire image, such as patient information, staining technique, and overall image dimensions, are stored at the group level.
- Level Metadata: Each resolution level within the group can have its own associated metadata, such as the downsampling factor and specific image dimensions for that level.
- Metadata: Thanks to TileDB, you can store rich metadata associated with the image data, including patient information, staining techniques, and acquisition parameters. This metadata is crucial for proper data interpretation and analysis.
- Compression: TileDB offers a variety of compression filters to reduce storage size and optimize I/O performance.
Benefits for digital pathology
This data model, combined with TileDB’s capabilities, gives a powerful solution for managing and analyzing whole slide histopathology images. This enables:
- Efficient Storage: Handle massive image datasets with optimal storage use.
- Fast Access: Quickly retrieve specific regions or resolutions of images for analysis.
- Advanced Analysis: Perform complex queries and computations on the data, enabling new insights and discoveries.
- Collaboration: Share data with colleagues and researchers, enabling collaboration and accelerating research progress.
This data model empowers researchers in digital pathology to efficiently manage, analyze, and share whole slide images, ultimately contributing to advancements in disease diagnosis and treatment.