1. Structure
  2. Arrays
  3. Tutorials
  4. Performance
  5. Summary of Factors
  • Home
  • What is TileDB?
  • Get Started
  • Explore Content
  • Accounts
    • Individual Accounts
      • Apply for the Free Tier
      • Profile
        • Overview
        • Cloud Credentials
        • Storage Paths
        • REST API Tokens
        • Credits
    • Organization Admins
      • Create an Organization
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
    • Organization Members
      • Organization Invitations
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
  • Catalog
    • Introduction
    • Data
      • Arrays
      • Tables
      • Single-Cell (SOMA)
      • Genomics (VCF)
      • Biomedical Imaging
      • Vector Search
      • Files
    • Code
      • Notebooks
      • Dashboards
      • User-Defined Functions
      • Task Graphs
      • ML Models
    • Groups
    • Marketplace
    • Search
  • Collaborate
    • Introduction
    • Organizations
    • Access Control
      • Introduction
      • Share Assets
      • Asset Permissions
      • Public Assets
    • Logging
    • Marketplace
  • Analyze
    • Introduction
    • Slice Data
    • Multi-Region Redirection
    • Notebooks
      • Launch a Notebook
      • Usage
      • Widgets
      • Notebook Image Dependencies
    • Dashboards
      • Dashboards
      • Streamlit
    • Preview
    • User-Defined Functions
    • Task Graphs
    • Serverless SQL
    • Monitor
      • Task Log
      • Task Graph Log
  • Scale
    • Introduction
    • Task Graphs
    • API Usage
  • Structure
    • Why Structure Is Important
    • Arrays
      • Introduction
      • Quickstart
      • Foundation
        • Array Data Model
        • Key Concepts
          • Storage
            • Arrays
            • Dimensions
            • Attributes
            • Cells
            • Domain
            • Tiles
            • Data Layout
            • Compression
            • Encryption
            • Tile Filters
            • Array Schema
            • Schema Evolution
            • Fragments
            • Fragment Metadata
            • Commits
            • Indexing
            • Array Metadata
            • Datetimes
            • Groups
            • Object Stores
          • Compute
            • Writes
            • Deletions
            • Consolidation
            • Vacuuming
            • Time Traveling
            • Reads
            • Query Conditions
            • Aggregates
            • User-Defined Functions
            • Distributed Compute
            • Concurrency
            • Parallelism
        • Storage Format Spec
      • Tutorials
        • Basics
          • Basic Dense Array
          • Basic Sparse Array
          • Array Metadata
          • Compression
          • Encryption
          • Data Layout
          • Tile Filters
          • Datetimes
          • Multiple Attributes
          • Variable-Length Attributes
          • String Dimensions
          • Nullable Attributes
          • Multi-Range Reads
          • Query Conditions
          • Aggregates
          • Deletions
          • Catching Errors
          • Configuration
          • Basic S3 Example
          • Basic TileDB Cloud
          • fromDataFrame
          • Palmer Penguins
        • Advanced
          • Schema Evolution
          • Advanced Writes
            • Write at a Timestamp
            • Get Fragment Info
            • Consolidation
              • Fragments
              • Fragment List
              • Consolidation Plan
              • Commits
              • Fragment Metadata
              • Array Metadata
            • Vacuuming
              • Fragments
              • Commits
              • Fragment Metadata
              • Array Metadata
          • Advanced Reads
            • Get Fragment Info
            • Time Traveling
              • Introduction
              • Fragments
              • Array Metadata
              • Schema Evolution
          • Array Upgrade
          • Backends
            • Amazon S3
            • Azure Blob Storage
            • Google Cloud Storage
            • MinIO
            • Lustre
          • Virtual Filesystem
          • User-Defined Functions
          • Distributed Compute
          • Result Estimation
          • Incomplete Queries
        • Management
          • Array Schema
          • Groups
          • Object Management
        • Performance
          • Summary of Factors
          • Dense vs. Sparse
          • Dimensions vs. Attributes
          • Compression
          • Tiling and Data Layout
          • Tuning Writes
          • Tuning Reads
      • API Reference
    • Tables
      • Introduction
      • Quickstart
      • Foundation
        • Data Model
        • Key Concepts
          • Indexes
          • Columnar Storage
          • Compression
          • Data Manipulation
          • Optimize Tables
          • ACID
          • Serverless SQL
          • SQL Connectors
          • Dataframes
          • CSV Ingestion
      • Tutorials
        • Basics
          • Ingestion with SQL
          • CSV Ingestion
          • Basic S3 Example
          • Running Locally
        • Advanced
          • Scalable Ingestion
          • Scalable Queries
      • API Reference
    • AI & ML
      • Vector Search
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Vector Search
            • Vector Databases
            • Algorithms
            • Distance Metrics
            • Updates
            • Deployment Methods
            • Architecture
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Ingestion & Querying
            • Updates
            • Deletions
            • Basic S3 Example
            • Running Locally
          • Advanced
            • Versioning
            • Time Traveling
            • Consolidation
            • Distributed Compute
            • RAG LLM
            • LLM Memory
            • File Search
            • Image Search
            • Protein Search
          • Performance
        • API Reference
      • ML Models
        • Introduction
        • Quickstart
        • Foundation
          • Basics
          • Storage
          • Cloud Execution
          • Why TileDB for Machine Learning
        • Tutorials
          • Ingestion
            • Data Ingestion
              • Dense Datasets
              • Sparse Datasets
            • ML Model Ingestion
          • Management
            • Array Schema
            • Machine Learning: Groups
            • Time Traveling
    • Life Sciences
      • Single-cell
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Data Structures
            • Use of Apache Arrow
            • Join IDs
            • State Management
            • TileDB Cloud URIs
          • SOMA API Specification
        • Tutorials
          • Data Ingestion
          • Bulk Ingestion Tutorial
          • Data Access
          • Distributed Compute
          • Basic S3 Example
          • Multi-Experiment Queries
          • Appending Data to a SOMA Experiment
          • Add New Measurements
          • SQL Queries
          • Running Locally
          • Shapes in TileDB-SOMA
          • Drug Discovery App
        • Spatial
          • Introduction
          • Foundation
            • Spatial Data Model
            • Data Structures
          • Tutorials
            • Spatial Data Ingestion
            • Access Spatial Data
            • Manage Coordinate Spaces
        • API Reference
      • Population Genomics
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • The N+1 Problem
            • Architecture
            • Arrays
            • Ingestion
            • Reads
            • Variant Statistics
            • Annotations
            • User-Defined Functions
            • Tables and SQL
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Basic Ingestion
            • Basic Queries
            • Export to VCF
            • Add New Samples
            • Deleting Samples
            • Basic S3 Example
            • Basic TileDB Cloud
          • Advanced
            • Scalable Ingestion
            • Scalable Queries
            • Query Transforms
            • Handling Large Queries
            • Annotations
              • Finding Annotations
              • Embedded Annotations
              • External Annotations
              • Annotation VCFs
              • Ingesting Annotations
            • Variant Statistics
            • Tables and SQL
            • User-Defined Functions
            • Sample Metadata
            • Split VCF
          • Performance
        • API Reference
          • Command Line Interface
          • Python API
          • Cloud API
      • Biomedical Imaging
        • Introduction
        • Foundation
          • Data Model
          • Key Concepts
            • Arrays
            • Ingestion
            • Reads
            • User Defined Functions
          • Storage Format Spec
        • Quickstart
        • Tutorials
          • Basics
            • Ingestion
            • Read
              • OpenSlide
              • TileDB-Py
          • Advanced
            • Batched Ingestion
            • Chunked Ingestion
            • Machine Learning
              • PyTorch
            • Napari
    • Files
  • API Reference
  • Self-Hosting
    • Installation
    • Upgrades
    • Administrative Tasks
    • Image Customization
      • Customize User-Defined Function Images
      • AWS ECR Container Registry
      • Customize Jupyter Notebook Images
    • Single Sign-On
      • Configure Single Sign-On
      • OpenID Connect
      • Okta SCIM
      • Microsoft Entra
  • Glossary

On this page

  • Array storage
    • Dense vs. sparse arrays
    • Space tile extent
    • Tile capacity
    • Dimensions vs. attributes
    • Tile filtering
    • Duplicate cells for sparse arrays
  • Queries
    • Number of array fragments
    • Attribute subsetting
    • Read layout
    • Write layout
    • Open and close arrays
  • Configuration parameters
    • Cache size
    • Coordinate deduplication
    • Coordinate out-of-bounds check
    • Coordinate global order check
    • Consolidation parameters
    • Memory budget
    • Concurrency
    • VFS read batching
    • Cloud object store parallelism
    • Cloud object store write size
    • Python multiprocessing
  • System parameters
    • Hardware Parallelism
    • Storage backends
  1. Structure
  2. Arrays
  3. Tutorials
  4. Performance
  5. Summary of Factors

Summary of Factors

A condensed list of the most important performance factors of TileDB.

Here, you can find a summary of the most important performance factors of TileDB, including the following broad topics:

  • Array storage
  • Queries
  • Configuration parameters
  • System parameters

Array storage

You must consider the following factors when designing the array schema. In general, designing an effective schema that results in high performance depends on the type of data that the array will store, and the type of queries that users will run on the array.

Dense vs. sparse arrays

Choosing a dense or sparse array representation depends on the particular data and use case. Sparse arrays in TileDB require extra upkeep versus dense arrays due to the need to store explicit cell coordinates of the non-empty cells. Depending on your use case, the storage savings from not storing fill values for empty cells in a dense array may outweigh the extra upkeep from a sparse array. In that case, a sparse array would be a good choice. Sparse array reads also have extra runtime upkeep due to the potential need of sorting the cells based on their coordinates.

For more information, visit Dense vs. Sparse.

Space tile extent

In dense arrays, the space tile shape and size decides the atomic unit of I/O and compression. TileDB reads all space tiles in their entirety that intersect a read query’s subarray from disk, even if those space tiles only partially intersect with the query. Matching space tile shape and size with the typical read query can greatly improve performance. In sparse arrays, space tiles affect the order of the non-empty cells and, thus, the minimum bounding rectangle (MBR) shape and size.

In sparse arrays, TileDB reads from disk all tiles where the MBRs intersect a read query’s subarray. Choosing the space tiles in a way that the shape of the resulting MBR’s is similar to the read access patterns can improve read performance, since fewer MBRs may then overlap with the typical read query.

For more information, visit Tiling and Data Layout.

Tile capacity

The tile capacity decides the number of cells in a data tile of a sparse fragment. A data tile, as you may recall, is the atomic unit of I/O and compression. If the capacity is too small, the data tiles may also be too small, requiring many small, independent read operations to fulfill a particular read query. Moreover, very small tiles could lead to poor compressibility. If the capacity is too large, I/O operations would become fewer and larger (which is a good thing in general), but the amount of wasted work may increase due to reading more data than necessary to fulfill the typical read query.

For more information, visit Tiling and Data Layout.

Dimensions vs. attributes

If read queries typically access the entire domain of a particular dimension, consider making that dimension an attribute instead. You should use dimensions when read queries will issue ranges of that dimension of cells to retrieve, so that TileDB can internally prune cell ranges or whole tiles from the data that you want to read.

For more information, visit Dimensions vs. Attributes.

Tile filtering

Tile filtering (such as compression) can reduce persistent storage consumption, but at the expense of increased time to read tiles from and write tiles to the array. However, you can read tiles from disk in less time if those tiles are smaller due to compression, which can improve performance. Additionally, many compressors offer different levels of compression, which you can use to fine-tune the tradeoff.

Visit Compression for more recommendations related to compression.

Duplicate cells for sparse arrays

For sparse arrays, if you know the data doesn’t have any entries that have the same coordinates, or if you don’t mind getting more than one result for a coordinate, allowing duplicates can lead to significant performance improvements. For queries asking for results in the unordered layout, TileDB can process fragments in any order (and less of them at once) and skip deduplication of the results, so they will run much faster and require less memory.

Queries

The following factors impact query performance (that is, when reading data from or writing data to an array).

Number of array fragments

Large numbers of fragments can slow down performance during read queries, as TileDB must fetch a lot of fragment metadata, and internally sort and find potential overlap of cell data across all fragments. Thus, consolidating arrays with many fragments improves performance, since consolidation collapses a set of fragments into a single one.

Attribute subsetting

A benefit of the column-oriented nature of attribute storage in TileDB is if a read query does not require data for some attributes in the array, TileDB won’t touch or read those attributes from disk at all. Thus, specifying only the attributes you need for each read query will improve performance. In write queries, you must still specify values for all attributes.

Read layout

When you use an ordered layout (row-major or column-major) for a read query, and that layout differs from the array physical layout, TileDB must internally sort and reorganize the cell data it reads from the array into the requested order. If you don’t care about the particular ordering of cells in the result buffers, you can specify a global order query layout, allowing TileDB to skip the sorting and reorganization steps.

Write layout

Similarly, if your specified write layout differs from the physical layout of your array (that is, the global cell order), then TileDB needs to internally sort and reorganize the cell data into global order before writing to the array. If your cells are already arranged in global order, TileDB can skip the sorting and reorganization step when you issue a write query with global order. Additionally, many consecutive writes in global order to an open array will append to a single fragment rather than creating a new fragment per write, which can improve later the read performance. You must take care with global-order writes, as it is an error to write data to a subarray that is not aligned with the space tiling of the array.

Open and close arrays

Before issuing queries, you must first “open” an array. When you open an array, TileDB reads and decodes the array and fragment data, which may require disk operations, decompression, or both. You can open an array once and issue many queries to the array before closing. Minimizing the number of array open and close operations can improve performance.

That being said, when reading from an array, TileDB fetches metadata from relevant fragments from the backend and cached in memory. The more queries you perform, the more metadata TileDB fetches throughout time. To save memory space, you will need to occasionally close and reopen the array.

Some TileDB APIs support closing and reopening the array in a single command, which can be faster than closing and reopening the array in two different steps. For example, TileDB-Py has a reopen() method.

Configuration parameters

The following runtime configuration parameters can tune the performance of many internal TileDB subsystems.

Cache size

When TileDB reads a specific tile from disk during a read query, TileDB places the uncompressed tile in an in-memory LRU cache associated with the query’s context. When future read queries in the same context request that tile, TileDB may copy that tile directly from the cache instead of re-fetching it from disk. The "sm.tile_cache_size" configuration parameter decides the total size in bytes of the in-memory tile cache. Increasing it can lead to a higher cache hit ratio, and better performance.

Coordinate deduplication

During sparse writes, if you set "sm.dedup_coords" to true, TileDB will internally deduplicate cells it’s trying to write, so that it doesn’t try to write many cells with the same coordinate tuples (which is an error). If you leave "sm.dedup_coords" as false, TileDB runs a lighter-weight check, controlled with "sm.check_coord_dups" (which is true by default), where TileDB will search for duplicates and return an error if TileDB finds any duplicates. Disabling these checks can lead to better performance on writes.

Coordinate out-of-bounds check

During sparse writes, TileDB will internally check whether the given coordinates fall outside the domain. You can control this check with "sm.check_coord_oob", which is true by default. If you are certain that this is impossible in your application, you can set this parameter to false, avoiding the check and, thus, boosting performance.

Coordinate global order check

During sparse writes in global order, TileDB will internally check whether the given coordinates obey the global order. You can control this check with "sm.check_global_order", which is true by default. If you are certain that this is impossible in your application, you can set this parameter to false, avoiding the check and, thus, boosting performance.

Consolidation parameters

You can learn about the effect of the different "sm.consolidation.\*" parameters in Consolidation. For a full list of configuration parameters related to consolidation, refer to Configuration: Consolidation and Vacuuming.

Memory budget

The "sm.memory_budget" and "sm.memory_budget_var" parameters control the memory budget, which caps the total number of bytes that TileDB for each fixed-length or variable-length attribute during reads. This can prevent out-of-memory issues when a read query overlaps with many tiles that TileDB must fetch and decompress in memory. For large subarrays, this may lead to incomplete queries.

Concurrency

TileDB controls concurrency by the "sm.compute_concurrency_level" and "sm.io_concurrency_level" parameters. "sm.compute_concurrency_level" is the upper limit on the number of OS threads TileDB will use for compute-bound tasks, whereas "sm.io_concurrency_level" is the upper limit of the number of OS threads TileDB will use for I/O-bound tasks. These parameters default to the number of logical cores on the system.

VFS read batching

During read queries, the VFS will batch reads for distinct tiles that are physically close together (but not necessarily adjacent) in the same file. The "vfs.min_batch_size" parameter sets the minimum size in bytes required for a single batched read operation. The VFS will use this parameter to group “close by” tiles into the same batch, if the new batch size is smaller than or equal to "vfs.min_batch_size". This can help minimize the I/O latency that can come with many, very small VFS read operations. Similarly, "vfs.min_batch_gap" defines the minimum number of bytes between the end of one batch and the start of the next one. If two batches are fewer bytes apart than "vfs.min_batch_gap", TileDB stitches them together into a single batch read operation. This can help better group and parallelize over “adjacent” batch read operations.

Cloud object store parallelism

"vfs.<backend>.max_parallel_ops" controls the level of parallelism, with <backend> being any of s3, azure, or gcs, depending on your cloud storage backend. This decides the maximum number of parallel operations for s3://, azure://, or gcs:// URIs independently of the VFS thread pool size, so you can oversubscribe or undersubscribe VFS threads. Oversubscription can sometimes be useful with cloud object stores, to help hide I/O latency.

Cloud object store write size

Replacing "vfs.min_parallel_size" for cloud storage objects, the "vfs.<backend>.multipart_part_size" parameter controls the minimum part size of cloud object storage multi-part writes, with <backend> being any of s3, azure, or gcs, depending on your cloud storage backend.

Note that TileDB buffers a specific number of bytes in memory before submitting a write request, represented by the formula \(S × O\), with \(S\) being the value of "vfs.<backend>.multipart_part_size" and \(O\) being the value of "vfs.<backend>.max_parallel_ops". After TileDB sets the buffer, it issues the multipart write in parallel.

Python multiprocessing

TileDB’s Python integration works well with Python’s multiprocessing package and the ProcessPoolExecutor and ThreadPoolExecutor classes of the concurrent.futures package. You can find a large usage example demonstrating parallel CSV ingestion in the TileDB-Inc/TileDB-Py repo. You can run this example in either ThreadPool or ProcessPool mode.

Caution

The default multiprocessing execution method for ProcessPoolExecutor on Linux is not compatible with TileDB (nor with most other multi-threaded applications) due to complications of the global process state with the fork start method. You must use ProcessPoolExecutor with multiprocessing.set_start_method("spawn") to avoid unexpected behavior, such as hangs and crashes.

For more information about the different start methods, refer to Python’s documentation on contexts and start methods.

System parameters

Hardware Parallelism

The number of cores and hardware threads of the machine affects the amount of parallelism TileDB can use internally to speed up reads, writes, compression, and decompression.

Storage backends

The different types of storage backend (cloud object stores and local disk) have different throughput and latency characteristics, which can impact query time.

Performance
Dense vs. Sparse