1. Structure
  2. Arrays
  3. Foundation
  4. Key Concepts
  5. Compute
  6. Parallelism
  • Home
  • What is TileDB?
  • Get Started
  • Explore Content
  • Accounts
    • Individual Accounts
      • Apply for the Free Tier
      • Profile
        • Overview
        • Cloud Credentials
        • Storage Paths
        • REST API Tokens
        • Credits
    • Organization Admins
      • Create an Organization
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
    • Organization Members
      • Organization Invitations
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
  • Catalog
    • Introduction
    • Data
      • Arrays
      • Tables
      • Single-Cell (SOMA)
      • Genomics (VCF)
      • Biomedical Imaging
      • Vector Search
      • Files
    • Code
      • Notebooks
      • Dashboards
      • User-Defined Functions
      • Task Graphs
      • ML Models
    • Groups
    • Marketplace
    • Search
  • Collaborate
    • Introduction
    • Organizations
    • Access Control
      • Introduction
      • Share Assets
      • Asset Permissions
      • Public Assets
    • Logging
    • Marketplace
  • Analyze
    • Introduction
    • Slice Data
    • Multi-Region Redirection
    • Notebooks
      • Launch a Notebook
      • Usage
      • Widgets
      • Notebook Image Dependencies
    • Dashboards
      • Dashboards
      • Streamlit
    • Preview
    • User-Defined Functions
    • Task Graphs
    • Serverless SQL
    • Monitor
      • Task Log
      • Task Graph Log
  • Scale
    • Introduction
    • Task Graphs
    • API Usage
  • Structure
    • Why Structure Is Important
    • Arrays
      • Introduction
      • Quickstart
      • Foundation
        • Array Data Model
        • Key Concepts
          • Storage
            • Arrays
            • Dimensions
            • Attributes
            • Cells
            • Domain
            • Tiles
            • Data Layout
            • Compression
            • Encryption
            • Tile Filters
            • Array Schema
            • Schema Evolution
            • Fragments
            • Fragment Metadata
            • Commits
            • Indexing
            • Array Metadata
            • Datetimes
            • Groups
            • Object Stores
          • Compute
            • Writes
            • Deletions
            • Consolidation
            • Vacuuming
            • Time Traveling
            • Reads
            • Query Conditions
            • Aggregates
            • User-Defined Functions
            • Distributed Compute
            • Concurrency
            • Parallelism
        • Storage Format Spec
      • Tutorials
        • Basics
          • Basic Dense Array
          • Basic Sparse Array
          • Array Metadata
          • Compression
          • Encryption
          • Data Layout
          • Tile Filters
          • Datetimes
          • Multiple Attributes
          • Variable-Length Attributes
          • String Dimensions
          • Nullable Attributes
          • Multi-Range Reads
          • Query Conditions
          • Aggregates
          • Deletions
          • Catching Errors
          • Configuration
          • Basic S3 Example
          • Basic TileDB Cloud
          • fromDataFrame
          • Palmer Penguins
        • Advanced
          • Schema Evolution
          • Advanced Writes
            • Write at a Timestamp
            • Get Fragment Info
            • Consolidation
              • Fragments
              • Fragment List
              • Consolidation Plan
              • Commits
              • Fragment Metadata
              • Array Metadata
            • Vacuuming
              • Fragments
              • Commits
              • Fragment Metadata
              • Array Metadata
          • Advanced Reads
            • Get Fragment Info
            • Time Traveling
              • Introduction
              • Fragments
              • Array Metadata
              • Schema Evolution
          • Array Upgrade
          • Backends
            • Amazon S3
            • Azure Blob Storage
            • Google Cloud Storage
            • MinIO
            • Lustre
          • Virtual Filesystem
          • User-Defined Functions
          • Distributed Compute
          • Result Estimation
          • Incomplete Queries
        • Management
          • Array Schema
          • Groups
          • Object Management
        • Performance
          • Summary of Factors
          • Dense vs. Sparse
          • Dimensions vs. Attributes
          • Compression
          • Tiling and Data Layout
          • Tuning Writes
          • Tuning Reads
      • API Reference
    • Tables
      • Introduction
      • Quickstart
      • Foundation
        • Data Model
        • Key Concepts
          • Indexes
          • Columnar Storage
          • Compression
          • Data Manipulation
          • Optimize Tables
          • ACID
          • Serverless SQL
          • SQL Connectors
          • Dataframes
          • CSV Ingestion
      • Tutorials
        • Basics
          • Ingestion with SQL
          • CSV Ingestion
          • Basic S3 Example
          • Running Locally
        • Advanced
          • Scalable Ingestion
          • Scalable Queries
      • API Reference
    • AI & ML
      • Vector Search
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Vector Search
            • Vector Databases
            • Algorithms
            • Distance Metrics
            • Updates
            • Deployment Methods
            • Architecture
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Ingestion & Querying
            • Updates
            • Deletions
            • Basic S3 Example
            • Running Locally
          • Advanced
            • Versioning
            • Time Traveling
            • Consolidation
            • Distributed Compute
            • RAG LLM
            • LLM Memory
            • File Search
            • Image Search
            • Protein Search
          • Performance
        • API Reference
      • ML Models
        • Introduction
        • Quickstart
        • Foundation
          • Basics
          • Storage
          • Cloud Execution
          • Why TileDB for Machine Learning
        • Tutorials
          • Ingestion
            • Data Ingestion
              • Dense Datasets
              • Sparse Datasets
            • ML Model Ingestion
          • Management
            • Array Schema
            • Machine Learning: Groups
            • Time Traveling
    • Life Sciences
      • Single-cell
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Data Structures
            • Use of Apache Arrow
            • Join IDs
            • State Management
            • TileDB Cloud URIs
          • SOMA API Specification
        • Tutorials
          • Data Ingestion
          • Bulk Ingestion Tutorial
          • Data Access
          • Distributed Compute
          • Basic S3 Example
          • Multi-Experiment Queries
          • Appending Data to a SOMA Experiment
          • Add New Measurements
          • SQL Queries
          • Running Locally
          • Shapes in TileDB-SOMA
          • Drug Discovery App
        • Spatial
          • Introduction
          • Foundation
            • Spatial Data Model
            • Data Structures
          • Tutorials
            • Spatial Data Ingestion
            • Access Spatial Data
            • Manage Coordinate Spaces
        • API Reference
      • Population Genomics
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • The N+1 Problem
            • Architecture
            • Arrays
            • Ingestion
            • Reads
            • Variant Statistics
            • Annotations
            • User-Defined Functions
            • Tables and SQL
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Basic Ingestion
            • Basic Queries
            • Export to VCF
            • Add New Samples
            • Deleting Samples
            • Basic S3 Example
            • Basic TileDB Cloud
          • Advanced
            • Scalable Ingestion
            • Scalable Queries
            • Query Transforms
            • Handling Large Queries
            • Annotations
              • Finding Annotations
              • Embedded Annotations
              • External Annotations
              • Annotation VCFs
              • Ingesting Annotations
            • Variant Statistics
            • Tables and SQL
            • User-Defined Functions
            • Sample Metadata
            • Split VCF
          • Performance
        • API Reference
          • Command Line Interface
          • Python API
          • Cloud API
      • Biomedical Imaging
        • Introduction
        • Foundation
          • Data Model
          • Key Concepts
            • Arrays
            • Ingestion
            • Reads
            • User Defined Functions
          • Storage Format Spec
        • Quickstart
        • Tutorials
          • Basics
            • Ingestion
            • Read
              • OpenSlide
              • TileDB-Py
          • Advanced
            • Batched Ingestion
            • Chunked Ingestion
            • Machine Learning
              • PyTorch
            • Napari
    • Files
  • API Reference
  • Self-Hosting
    • Installation
    • Upgrades
    • Administrative Tasks
    • Image Customization
      • Customize User-Defined Function Images
      • AWS ECR Container Registry
      • Customize Jupyter Notebook Images
    • Single Sign-On
      • Configure Single Sign-On
      • OpenID Connect
      • Okta SCIM
      • Microsoft Entra
  • Glossary

On this page

  • Reads
    • Parallel I/O
    • Parallel tile unfiltering
  • Writes
    • Parallel tile filtering
    • Parallel I/O
  1. Structure
  2. Arrays
  3. Foundation
  4. Key Concepts
  5. Compute
  6. Parallelism

Parallelism

TileDB is fully paralyzed internally, meaning it uses multiple threads to process in parallel the most heavyweight tasks.

TileDB’s design supports parallelized read and write queries. This document outlines parallelized reds and writes and the configuration parameters you can use to control the amount of parallelization for both. Visit Configuration for a full list of configuration parameters and more details about each of the following parameters mentioned here.

Reads

When you perform a read query, TileDB goes through the following steps in this order:

  1. Identify the physical attribute data tiles relevant to the query, and prune the rest.
  2. Perform parallel I/O to retrieve those tiles from the storage backend.
  3. Unfilter on the data tiles in parallel to get the raw cell values and coordinates.
  4. Perform a refining step to get the actual results and organize them in the query layout.

TileDB parallelizes all steps, but the following subsections focus on steps 2 and 3, which are the most heavyweight.

Parallel I/O

TileDB reads the relevant tiles from all attributes to be read in parallel based on the following pseudocode:

parallel_for_each attribute being read:
  parallel_for_each relevant tile of the attribute read:
    prepare a byte range to be read (I/O task)

TileDB computes the byte ranges required to be fetched from each attribute file. Those byte ranges might be disconnected, and there might be many of them, especially with multi-range subarrays. To reduce the latency of the I/O requests (especially on cloud object stores), TileDB tries to merge byte ranges that are close to each other and dispatch fewer, larger I/O requests instead of many smaller ones. More specifically, TileDB merges two byte ranges if their gap size is smaller than "vfs.min_batch_gap" and their resulting size is smaller than "vfs.min_batch_size". Then, each byte range (always corresponding to the same attribute file) becomes an I/O task. TileDB dispatches these I/O tasks for concurrent execution, where the "sm.io_concurrency_level parameter" configuration parameter controls the maximum level of concurrency.

TileDB may further partition each byte range to be fetched based on the value of "vfs.file.max_parallel_ops" (for POSIX and Windows systems), "vfs.<backend>.max_parallel_ops" (for cloud object stores, where <backend> is either s3, azure, or gcs) and "vfs.min_parallel_size". TileDB then reads those partitions in parallel.

Parallel tile unfiltering

Once the relevant data tiles are in main memory, TileDB unfilters them (that is, it runs the filters applied during writes in reverse) in parallel in a nested manner, based on the following pseudocode:

for_each attribute being read:
  parallel_for_each tile of the attribute read:
    parallel_for_each chunk of the tile:
      unfilter chunk

The TileDB filter list has a parameter that controls the size of each “chunk” of a tile. This parameter defaults to 64 KB.

The "sm.compute_concurrency_level" configuration parameter affects these for loops, although it is not recommended to modify this configuration parameter from its default setting. The nested parallelism in reads allows for maximum usage of the available cores for filtering (such as compression and decompression), in either the case where the query intersects fewer, larger tiles or more, smaller tiles.

Writes

When you perform a write query, TileDB goes through the following steps in this order:

  1. Reorganize the cells in the global cell order and into attribute data tiles.
  2. Filter the attribute data tiles to be written.
  3. Perform parallel I/O to write those tiles to the storage backend.

TileDB parallelizes all steps, but the following subsections focus on steps 2 and 3, which are the most heavyweight.

Parallel tile filtering

TileDB uses a similar strategy for writes as it does for reads:

parallel_for_each attribute being written:
  for_each tile of the attribute being written:
    parallel_for_each chunk of the tile:
      filter chunk

Similar to reads, the "sm.compute_concurrency_level" parameter affects these for loops, although it is not recommended to modify this configuration parameter from its default setting.

Parallel I/O

Similar to reads, TileDB creates I/O tasks for each tile of every attribute. TileDB then dispatches these I/O tasks for concurrent execution, where the "sm.io_concurrency_level" configuration parameter controls the maximum level of concurrency.

For POSIX and Windows, if a data tile is large enough, the VFS layer partitions the tile based on configuration parameters "vfs.file.max_parallel_ops" and "vfs.min_parallel_size". TileDB then writes those partitions in parallel by using the VFS thread pool. The "vfs.io_concurrency" configuration parameter controls the size of the VFS thread pool.

For cloud object stores, TileDB buffers potentially many tiles and issues parallel, multipart upload requests to cloud storage. The size of the buffer is equal to \(S × O\), with \(S\) being the value of "vfs.<backend>.multipart_part_size", \(O\) being the value of "vfs.<backend>.max_parallel_ops", and <backend> being s3, azure, or gcp, depending on your cloud storage backend. When the buffer is filled, TileDB issues "vfs.<backend>.max_parallel_ops" parallel, multipart upload requests to cloud storage.

Concurrency
Storage Format Spec