1. Structure
  2. Arrays
  3. Foundation
  4. Key Concepts
  5. Storage
  6. Data Layout
  • Home
  • What is TileDB?
  • Get Started
  • Explore Content
  • Accounts
    • Individual Accounts
      • Apply for the Free Tier
      • Profile
        • Overview
        • Cloud Credentials
        • Storage Paths
        • REST API Tokens
        • Credits
    • Organization Admins
      • Create an Organization
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
    • Organization Members
      • Organization Invitations
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
  • Catalog
    • Introduction
    • Data
      • Arrays
      • Tables
      • Single-Cell (SOMA)
      • Genomics (VCF)
      • Biomedical Imaging
      • Vector Search
      • Files
    • Code
      • Notebooks
      • Dashboards
      • User-Defined Functions
      • Task Graphs
      • ML Models
    • Groups
    • Marketplace
    • Search
  • Collaborate
    • Introduction
    • Organizations
    • Access Control
      • Introduction
      • Share Assets
      • Asset Permissions
      • Public Assets
    • Logging
    • Marketplace
  • Analyze
    • Introduction
    • Slice Data
    • Multi-Region Redirection
    • Notebooks
      • Launch a Notebook
      • Usage
      • Widgets
      • Notebook Image Dependencies
    • Dashboards
      • Dashboards
      • Streamlit
    • Preview
    • User-Defined Functions
    • Task Graphs
    • Serverless SQL
    • Monitor
      • Task Log
      • Task Graph Log
  • Scale
    • Introduction
    • Task Graphs
    • API Usage
  • Structure
    • Why Structure Is Important
    • Arrays
      • Introduction
      • Quickstart
      • Foundation
        • Array Data Model
        • Key Concepts
          • Storage
            • Arrays
            • Dimensions
            • Attributes
            • Cells
            • Domain
            • Tiles
            • Data Layout
            • Compression
            • Encryption
            • Tile Filters
            • Array Schema
            • Schema Evolution
            • Fragments
            • Fragment Metadata
            • Commits
            • Indexing
            • Array Metadata
            • Datetimes
            • Groups
            • Object Stores
          • Compute
            • Writes
            • Deletions
            • Consolidation
            • Vacuuming
            • Time Traveling
            • Reads
            • Query Conditions
            • Aggregates
            • User-Defined Functions
            • Distributed Compute
            • Concurrency
            • Parallelism
        • Storage Format Spec
      • Tutorials
        • Basics
          • Basic Dense Array
          • Basic Sparse Array
          • Array Metadata
          • Compression
          • Encryption
          • Data Layout
          • Tile Filters
          • Datetimes
          • Multiple Attributes
          • Variable-Length Attributes
          • String Dimensions
          • Nullable Attributes
          • Multi-Range Reads
          • Query Conditions
          • Aggregates
          • Deletions
          • Catching Errors
          • Configuration
          • Basic S3 Example
          • Basic TileDB Cloud
          • fromDataFrame
          • Palmer Penguins
        • Advanced
          • Schema Evolution
          • Advanced Writes
            • Write at a Timestamp
            • Get Fragment Info
            • Consolidation
              • Fragments
              • Fragment List
              • Consolidation Plan
              • Commits
              • Fragment Metadata
              • Array Metadata
            • Vacuuming
              • Fragments
              • Commits
              • Fragment Metadata
              • Array Metadata
          • Advanced Reads
            • Get Fragment Info
            • Time Traveling
              • Introduction
              • Fragments
              • Array Metadata
              • Schema Evolution
          • Array Upgrade
          • Backends
            • Amazon S3
            • Azure Blob Storage
            • Google Cloud Storage
            • MinIO
            • Lustre
          • Virtual Filesystem
          • User-Defined Functions
          • Distributed Compute
          • Result Estimation
          • Incomplete Queries
        • Management
          • Array Schema
          • Groups
          • Object Management
        • Performance
          • Summary of Factors
          • Dense vs. Sparse
          • Dimensions vs. Attributes
          • Compression
          • Tiling and Data Layout
          • Tuning Writes
          • Tuning Reads
      • API Reference
    • Tables
      • Introduction
      • Quickstart
      • Foundation
        • Data Model
        • Key Concepts
          • Indexes
          • Columnar Storage
          • Compression
          • Data Manipulation
          • Optimize Tables
          • ACID
          • Serverless SQL
          • SQL Connectors
          • Dataframes
          • CSV Ingestion
      • Tutorials
        • Basics
          • Ingestion with SQL
          • CSV Ingestion
          • Basic S3 Example
          • Running Locally
        • Advanced
          • Scalable Ingestion
          • Scalable Queries
      • API Reference
    • AI & ML
      • Vector Search
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Vector Search
            • Vector Databases
            • Algorithms
            • Distance Metrics
            • Updates
            • Deployment Methods
            • Architecture
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Ingestion & Querying
            • Updates
            • Deletions
            • Basic S3 Example
            • Running Locally
          • Advanced
            • Versioning
            • Time Traveling
            • Consolidation
            • Distributed Compute
            • RAG LLM
            • LLM Memory
            • File Search
            • Image Search
            • Protein Search
          • Performance
        • API Reference
      • ML Models
        • Introduction
        • Quickstart
        • Foundation
          • Basics
          • Storage
          • Cloud Execution
          • Why TileDB for Machine Learning
        • Tutorials
          • Ingestion
            • Data Ingestion
              • Dense Datasets
              • Sparse Datasets
            • ML Model Ingestion
          • Management
            • Array Schema
            • Machine Learning: Groups
            • Time Traveling
    • Life Sciences
      • Single-cell
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Data Structures
            • Use of Apache Arrow
            • Join IDs
            • State Management
            • TileDB Cloud URIs
          • SOMA API Specification
        • Tutorials
          • Data Ingestion
          • Bulk Ingestion Tutorial
          • Data Access
          • Distributed Compute
          • Basic S3 Example
          • Multi-Experiment Queries
          • Appending Data to a SOMA Experiment
          • Add New Measurements
          • SQL Queries
          • Running Locally
          • Shapes in TileDB-SOMA
          • Drug Discovery App
        • Spatial
          • Introduction
          • Foundation
            • Spatial Data Model
            • Data Structures
          • Tutorials
            • Spatial Data Ingestion
            • Access Spatial Data
            • Manage Coordinate Spaces
        • API Reference
      • Population Genomics
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • The N+1 Problem
            • Architecture
            • Arrays
            • Ingestion
            • Reads
            • Variant Statistics
            • Annotations
            • User-Defined Functions
            • Tables and SQL
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Basic Ingestion
            • Basic Queries
            • Export to VCF
            • Add New Samples
            • Deleting Samples
            • Basic S3 Example
            • Basic TileDB Cloud
          • Advanced
            • Scalable Ingestion
            • Scalable Queries
            • Query Transforms
            • Handling Large Queries
            • Annotations
              • Finding Annotations
              • Embedded Annotations
              • External Annotations
              • Annotation VCFs
              • Ingesting Annotations
            • Variant Statistics
            • Tables and SQL
            • User-Defined Functions
            • Sample Metadata
            • Split VCF
          • Performance
        • API Reference
          • Command Line Interface
          • Python API
          • Cloud API
      • Biomedical Imaging
        • Introduction
        • Foundation
          • Data Model
          • Key Concepts
            • Arrays
            • Ingestion
            • Reads
            • User Defined Functions
          • Storage Format Spec
        • Quickstart
        • Tutorials
          • Basics
            • Ingestion
            • Read
              • OpenSlide
              • TileDB-Py
          • Advanced
            • Batched Ingestion
            • Chunked Ingestion
            • Machine Learning
              • PyTorch
            • Napari
    • Files
  • API Reference
  • Self-Hosting
    • Installation
    • Upgrades
    • Administrative Tasks
    • Image Customization
      • Customize User-Defined Function Images
      • AWS ECR Container Registry
      • Customize Jupyter Notebook Images
    • Single Sign-On
      • Configure Single Sign-On
      • OpenID Connect
      • Okta SCIM
      • Microsoft Entra
  • Glossary

On this page

  • Cell global order
    • Dense arrays
    • Sparse arrays
  • Columnar format
  • Data tiles
  • Fill values
  1. Structure
  2. Arrays
  3. Foundation
  4. Key Concepts
  5. Storage
  6. Data Layout

Data Layout

arrays
foundation
data layout
Learn about the TileDB data layout in this section, including the cell global order, the columnar format, data tiles, and fill values.
Note

It is strongly recommended to read the following sections before you learn about the data layout.

  • Key Concepts: Dimensions
  • Key Concepts: Attributes
  • Key Concepts: Domain
  • Key Concepts: Cells
  • Key Concepts: Tiles

The data layout dictates how coordinates and values of multi-dimensional cells are serialized and stored in the inherently 1-dimensional storage medium, in the various files created by TileDB.

This section covers the following topics:

  • Cell global order: This is the way cells are mapped into a unique, 1-dimensional order.
  • Columnar format: The values of each attribute, the coordinates in sparse arrays and the offsets of variable-length attributes are all stored in separate files.
  • Data tiles: The groups of non-empty cells, which serve as the atomic unit of I/O and compression.
  • Fill values: How TileDB stores fill values for partially populated tiles in dense arrays.

Cell global order

TileDB handles the global cell order differently for dense and sparse arrays.

Dense arrays

In dense arrays, the global cell order is determined by three parameters, all specified by the user upon array creation:

  1. Tile extent per dimension (which partitions each dimension in equal segments)
  2. Tile order (can be either row-major or column-major)
  3. Cell order (can be either row-major or column-major)
Note

In arrays with more than two dimensions:

  • row-major means that the faster running index in the order is the last dimension (e.g., [0,0,0], [0,0,1], [0,0,2], etc.).
  • col-major means that the faster running index in the order is the first dimension (e.g., [0,0,0], [1,0,0], [2,0,0], etc.).

The figure below shows examples of how different choices of the above parameters lead to different global orders in dense arrays.

Example global cell orders in dense arrays Example global cell orders in dense arrays

Sparse arrays

In sparse arrays, the global cell order is determined by the following parameters, all specified by the user upon array creation:

  1. The same parameters as in dense arrays, or
  2. By using a Hilbert space-filling curve. In this case, all the user needs to specify is the domain of each dimension, and set the cell order as hilbert (the tile extents and tile order have no effect in this case). Note that the Hilbert curve is based on quantizing each dimension domain and, therefore, it is strongly affected by it (e.g., if the domain is too small, all cells may map to the same Hilbert value and, hence, cell locality will be destroyed).

The figure below shows examples of how different choices of the above parameters lead to different global orders in sparse arrays.

Example global cell orders in sparse arrays Example global cell orders in sparse arrays

Columnar format

TileDB implements the so-called columnar format (adopted from analytical databases). This means that the cell values across each attribute are stored in a separate file. The same is true for the cell coordinates in sparse arrays, as well as the offsets of variable-length attributes. It is important to stress that all the values on all different files follow the same cell global order. The figures below show examples of various columnar layouts in dense and sparse arrays.

Example columnar layout in dense arrays Example columnar layout in dense arrays

Example columnar layout in sparse arrays Example columnar layout in sparse arrays

Example columnar layout of variable-length attributes (dense case, but sparse is similar) Example columnar layout of variable-length attributes (dense case, but sparse is similar)

Data tiles

A data tile is the atomic unit of compression and I/O. Data tiles in dense and sparse arrays are different:

  • In dense arrays, the space tile (i.e., the tile defined by the tile extents on each dimension) is the same as the data tile (i.e., the group of non-empty cells the data tile includes).
  • In sparse arrays, the space tile is not the same as the data tile, because a space tile may contain empty cells that TileDB does not materialize. If the space and data tiles were the same, then there could be scenarios where one tile contains one cell, and another millions, leading to I/O and compression imbalance. To mitigate this, sparse arrays receive an extra parameter upon creation, called capacity, which is the fixed number of non-empty cells that a data tile should contain. To determine which cells correspond to which tile, TileDB just follows the cell global order and packs non-empty cells in groups with size equal to the specified capacity.

In both the above cases, a logical data tile encompasses only logical non-empty cells, whereas a physical data tile corresponds to the non-empty cell values on a specific attribute, dimension (for sparse coordinates) or offset file (for variable-length attributes).

The following figure shows examples of space and data tiles in dense and sparse arrays.

An illustration of a dense array fragment and a sparse array fragment. In the dense array, the space tiles are the same as the data tiles, whereas in the sparse array, the space tiles do not always overlap with the data tiles. An illustration of a dense array fragment and a sparse array fragment. In the dense array, the space tiles are the same as the data tiles, whereas in the sparse array, the space tiles do not always overlap with the data tiles.

Fill values

Fill values are applicable only to dense arrays, and occur when data tiles are partially written. This happens in three cases:

  1. The user incrementally populates the array and writes a subarray such that it partially intersects certain tiles.
  2. The array domain does not contain integral tiles.
  3. The query subarray intersects empty tiles (or tiles with fill values).

In these cases, TileDB fills the partial tiles or result with special fill values. You can read about the default fill values in section Key Concepts: Tiles. Practically, a fill value represents a non-empty cell in a dense array.

The following figure shows examples of the above three scenarios, where TileDB uses fill values.

An illustration of the three scenarios where fill values occur in dense arrays. An illustration of the three scenarios where fill values occur in dense arrays.

Tiles
Compression