1. Structure
  2. Life Sciences
  3. Population Genomics
  4. Tutorials
  5. Performance
  • Home
  • What is TileDB?
  • Get Started
  • Explore Content
  • Accounts
    • Individual Accounts
      • Apply for the Free Tier
      • Profile
        • Overview
        • Cloud Credentials
        • Storage Paths
        • REST API Tokens
        • Credits
    • Organization Admins
      • Create an Organization
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
    • Organization Members
      • Organization Invitations
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
  • Catalog
    • Introduction
    • Data
      • Arrays
      • Tables
      • Single-Cell (SOMA)
      • Genomics (VCF)
      • Biomedical Imaging
      • Vector Search
      • Files
    • Code
      • Notebooks
      • Dashboards
      • User-Defined Functions
      • Task Graphs
      • ML Models
    • Groups
    • Marketplace
    • Search
  • Collaborate
    • Introduction
    • Organizations
    • Access Control
      • Introduction
      • Share Assets
      • Asset Permissions
      • Public Assets
    • Logging
    • Marketplace
  • Analyze
    • Introduction
    • Slice Data
    • Multi-Region Redirection
    • Notebooks
      • Launch a Notebook
      • Usage
      • Widgets
      • Notebook Image Dependencies
    • Dashboards
      • Dashboards
      • Streamlit
    • Preview
    • User-Defined Functions
    • Task Graphs
    • Serverless SQL
    • Monitor
      • Task Log
      • Task Graph Log
  • Scale
    • Introduction
    • Task Graphs
    • API Usage
  • Structure
    • Why Structure Is Important
    • Arrays
      • Introduction
      • Quickstart
      • Foundation
        • Array Data Model
        • Key Concepts
          • Storage
            • Arrays
            • Dimensions
            • Attributes
            • Cells
            • Domain
            • Tiles
            • Data Layout
            • Compression
            • Encryption
            • Tile Filters
            • Array Schema
            • Schema Evolution
            • Fragments
            • Fragment Metadata
            • Commits
            • Indexing
            • Array Metadata
            • Datetimes
            • Groups
            • Object Stores
          • Compute
            • Writes
            • Deletions
            • Consolidation
            • Vacuuming
            • Time Traveling
            • Reads
            • Query Conditions
            • Aggregates
            • User-Defined Functions
            • Distributed Compute
            • Concurrency
            • Parallelism
        • Storage Format Spec
      • Tutorials
        • Basics
          • Basic Dense Array
          • Basic Sparse Array
          • Array Metadata
          • Compression
          • Encryption
          • Data Layout
          • Tile Filters
          • Datetimes
          • Multiple Attributes
          • Variable-Length Attributes
          • String Dimensions
          • Nullable Attributes
          • Multi-Range Reads
          • Query Conditions
          • Aggregates
          • Deletions
          • Catching Errors
          • Configuration
          • Basic S3 Example
          • Basic TileDB Cloud
          • fromDataFrame
          • Palmer Penguins
        • Advanced
          • Schema Evolution
          • Advanced Writes
            • Write at a Timestamp
            • Get Fragment Info
            • Consolidation
              • Fragments
              • Fragment List
              • Consolidation Plan
              • Commits
              • Fragment Metadata
              • Array Metadata
            • Vacuuming
              • Fragments
              • Commits
              • Fragment Metadata
              • Array Metadata
          • Advanced Reads
            • Get Fragment Info
            • Time Traveling
              • Introduction
              • Fragments
              • Array Metadata
              • Schema Evolution
          • Array Upgrade
          • Backends
            • Amazon S3
            • Azure Blob Storage
            • Google Cloud Storage
            • MinIO
            • Lustre
          • Virtual Filesystem
          • User-Defined Functions
          • Distributed Compute
          • Result Estimation
          • Incomplete Queries
        • Management
          • Array Schema
          • Groups
          • Object Management
        • Performance
          • Summary of Factors
          • Dense vs. Sparse
          • Dimensions vs. Attributes
          • Compression
          • Tiling and Data Layout
          • Tuning Writes
          • Tuning Reads
      • API Reference
    • Tables
      • Introduction
      • Quickstart
      • Foundation
        • Data Model
        • Key Concepts
          • Indexes
          • Columnar Storage
          • Compression
          • Data Manipulation
          • Optimize Tables
          • ACID
          • Serverless SQL
          • SQL Connectors
          • Dataframes
          • CSV Ingestion
      • Tutorials
        • Basics
          • Ingestion with SQL
          • CSV Ingestion
          • Basic S3 Example
          • Running Locally
        • Advanced
          • Scalable Ingestion
          • Scalable Queries
      • API Reference
    • AI & ML
      • Vector Search
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Vector Search
            • Vector Databases
            • Algorithms
            • Distance Metrics
            • Updates
            • Deployment Methods
            • Architecture
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Ingestion & Querying
            • Updates
            • Deletions
            • Basic S3 Example
            • Running Locally
          • Advanced
            • Versioning
            • Time Traveling
            • Consolidation
            • Distributed Compute
            • RAG LLM
            • LLM Memory
            • File Search
            • Image Search
            • Protein Search
          • Performance
        • API Reference
      • ML Models
        • Introduction
        • Quickstart
        • Foundation
          • Basics
          • Storage
          • Cloud Execution
          • Why TileDB for Machine Learning
        • Tutorials
          • Ingestion
            • Data Ingestion
              • Dense Datasets
              • Sparse Datasets
            • ML Model Ingestion
          • Management
            • Array Schema
            • Machine Learning: Groups
            • Time Traveling
    • Life Sciences
      • Single-cell
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Data Structures
            • Use of Apache Arrow
            • Join IDs
            • State Management
            • TileDB Cloud URIs
          • SOMA API Specification
        • Tutorials
          • Data Ingestion
          • Bulk Ingestion Tutorial
          • Data Access
          • Distributed Compute
          • Basic S3 Example
          • Multi-Experiment Queries
          • Appending Data to a SOMA Experiment
          • Add New Measurements
          • SQL Queries
          • Running Locally
          • Shapes in TileDB-SOMA
          • Drug Discovery App
        • Spatial
          • Introduction
          • Foundation
            • Spatial Data Model
            • Data Structures
          • Tutorials
            • Spatial Data Ingestion
            • Access Spatial Data
            • Manage Coordinate Spaces
        • API Reference
      • Population Genomics
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • The N+1 Problem
            • Architecture
            • Arrays
            • Ingestion
            • Reads
            • Variant Statistics
            • Annotations
            • User-Defined Functions
            • Tables and SQL
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Basic Ingestion
            • Basic Queries
            • Export to VCF
            • Add New Samples
            • Deleting Samples
            • Basic S3 Example
            • Basic TileDB Cloud
          • Advanced
            • Scalable Ingestion
            • Scalable Queries
            • Query Transforms
            • Handling Large Queries
            • Annotations
              • Finding Annotations
              • Embedded Annotations
              • External Annotations
              • Annotation VCFs
              • Ingesting Annotations
            • Variant Statistics
            • Tables and SQL
            • User-Defined Functions
            • Sample Metadata
            • Split VCF
          • Performance
        • API Reference
          • Command Line Interface
          • Python API
          • Cloud API
      • Biomedical Imaging
        • Introduction
        • Foundation
          • Data Model
          • Key Concepts
            • Arrays
            • Ingestion
            • Reads
            • User Defined Functions
          • Storage Format Spec
        • Quickstart
        • Tutorials
          • Basics
            • Ingestion
            • Read
              • OpenSlide
              • TileDB-Py
          • Advanced
            • Batched Ingestion
            • Chunked Ingestion
            • Machine Learning
              • PyTorch
            • Napari
    • Files
  • API Reference
  • Self-Hosting
    • Installation
    • Upgrades
    • Administrative Tasks
    • Image Customization
      • Customize User-Defined Function Images
      • AWS ECR Container Registry
      • Customize Jupyter Notebook Images
    • Single Sign-On
      • Configure Single Sign-On
      • OpenID Connect
      • Okta SCIM
      • Microsoft Entra
  • Glossary

On this page

  • Ingestion
    • Direct ingestion
    • Small datasets
  • Reads
    • Limit the amount of data read
    • Increase the memory budget
    • Use scalable queries
  • Anchor gap
    • Ingestion
    • Read performance
    • Recommendations
  • Compression
  1. Structure
  2. Life Sciences
  3. Population Genomics
  4. Tutorials
  5. Performance

Population Genomics Performance

life sciences
genomics (vcf)
tutorials
performance
Tips for boosting performance when using TileDB-VCF.
How to run this tutorial

You can run this tutorial in two ways:

  1. Locally on your machine.
  2. On TileDB Cloud.

However, since TileDB Cloud has a free tier, we strongly recommend that you sign up and run everything there, as that requires no installations or deployment.

This tutorial provides a number of examples that are useful for tuning performance when working with TileDB-VCF.

Ingestion

This section contains tips for improving the performance of TileDB-VCF ingestion.

Direct ingestion

When ingesting a TileDB-VCF dataset to remote object storage, the data can be written to a tiledb:// URI or directly to your object store. Ingesting to a tiledb:// URI provides the authentication, access control, and logging features of TileDB, with only a small performance cost.

Direct ingestion to object stores (for example, an s3:// URI) provides slightly better performance for large, cost-sensitive datasets. This approach relies on the object store’s access controls alone (for example, Amazon S3 credentials) during ingestion. The writes to the dataset are not logged in TileDB Cloud. Scalable ingestion registers the dataset on TileDB Cloud with a tiledb:// URI, so all reads will take advantage of authentication, access control, and logging provided by TileDB Cloud.

Small datasets

The TileDB-VCF ingest_samples API has a number of parameters with default values optimized for ingesting large datasets on TileDB Cloud. This section describes how to set these parameters to improve performance when ingesting small datasets locally or in a TileDB notebook.

First, import the necessary libraries, set the TileDB-VCF dataset URI (i.e., its path, which in this tutorial will be on local storage), and delete any previously created datasets with the same name.

  • Python
import os.path
import shutil

import tiledb
import tiledbvcf

# Print library versions
print("TileDB core version: {}".format(tiledb.libtiledb.version()))
print("TileDB-Py version: {}".format(tiledb.version()))
print("TileDB-VCF version: {}".format(tiledbvcf.version))

# Set VCF dataset URI
vcf_uri = "./performance-ingestion"

# Clean up VCF dataset if it already exists
if os.path.exists(vcf_uri):
    shutil.rmtree(vcf_uri)
TileDB core version: (2, 24, 2)
TileDB-Py version: (0, 30, 2)
TileDB-VCF version: 0.33.3

Set the following tiledb_config options:

  1. "vfs.s3.no_sign_request" - Set to True to reduce the cost of accessing public data by removing the need for AWS access credentials.
  2. "vfs.s3.region" - Set to us-east-1 to match the location of the public TileDB demo data. This avoids file access issues caused by a different default region in a local environment.

Set the following ingest_samples options:

  1. threads - Reduce the number of threads, since threads have upkeep that is greater than the performance benefit when working with the small demo datasets.
  2. total_memory_budget_mb - Reduce the total memory budget, since the default value assumes all memory on the system is available for ingestion (which is true for a TileDB UDF node).
  • Python
# Set list of VCF files to ingest
sample_list = [
    "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen-chr21-vcf/HG00096_chr21.vcf.gz",
    "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen-chr21-vcf/HG00097_chr21.vcf.gz",
    "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen-chr21-vcf/HG00099_chr21.vcf.gz",
    "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen-chr21-vcf/HG00100_chr21.vcf.gz",
    "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen-chr21-vcf/HG00101_chr21.vcf.gz",
]

# Set config for reading public s3 data
tiledb_config = {
    "vfs.s3.no_sign_request": True,
    "vfs.s3.region": "us-east-1",
}

# Create a dataset
ds = tiledbvcf.Dataset(vcf_uri, mode="w", tiledb_config=tiledb_config)
ds.create_dataset()

# Ingest the VCF files
ds.ingest_samples(sample_list, threads=4, total_memory_budget_mb=4096)

Clean up the created TileDB-VCF dataset.

  • Python
# Clean up VCF dataset if it exists
if os.path.exists(vcf_uri):
    shutil.rmtree(vcf_uri)

Reads

This section contains tips for improving the performance of TileDB-VCF reads.

Limit the amount of data read

One way to improve read performance is to limit the amount of data read to what is actually needed. Data transfer consumes a major portion of the read query, so reducing the amount of data transferred will improve read performance. This sounds like an obvious recommendation, but it is sometimes overlooked.

The amount of data read can be controlled with the following parameters of the read method:

  1. attrs - Specify only the attributes needed in the downstream analysis. Avoid reading all attributes unless all attributes are required.
  2. regions - Specify only the regions needed in the downstream analysis. When querying regions in multiple chromosomes, use a scalable query on TileDB Cloud.
  3. samples - Specify only the samples needed in downstream analysis. When querying a large number of samples, use a scalable query on TileDB Cloud.

Increase the memory budget

Avoid incomplete queries by increasing the memory budget when creating the dataset, as shown in this example:

# Set memory budget to 8 GiB
cfg = tiledbvcf.ReadConfig(memory_budget_mb=8192)
ds = tiledbvcf.Dataset(uri, cfg=cfg)

Use scalable queries

Scalable queries provide an automated solution to partitioning and distributing a query on TileDB Cloud. Partitioning and distributing queries provides a large performance improvement, especially when querying large datasets.

Anchor gap

As described in the Key Concepts: Ingestion, anchors are inserted into the data array to enable rapid retrieval of interval intersections. The number of anchors inserted depends on the anchor gap parameter and the type of variant data stored in the dataset. In general:

  • VCF data contains few variants with long ranges.
  • gVCF data contains many reference blocks with long ranges.
  • CNV data contains fewer variants with very long ranges.

The anchor gap is defined during array creation and has an impact on ingestion and read performance.

Ingestion

The anchor gap controls how many anchors are inserted into the data array. Decreasing the size of the anchor gap increases the number of anchors inserted, which increases the size of the dataset and the ingestion time.

Read performance

Each read query is expanded by the anchor gap size defined in the dataset. Increasing the size of the anchor gap increases the potential of reading unneeded data that must be filtered out. The read query time increases due to the time required to transfer the additional data and the time required to filter the data.

Recommendations

The anchor gap default value of 1,000 works well for VCF and gVCF data, resulting in very little impact to ingestion and read performance. For CNV data, which contains a small number of variants with very long reads, an anchor gap value of 1,000,000 is recommended to reduce the number of anchors inserted, which improves ingestion performance and has little impact on read performance due to the smaller number of variants.

Compression

TileDB’s columnar storage format provides excellent compression of VCF data, often achieving 50% reduction in the size of the already compressed *.vcf.gz VCF files. By default, TileDB-VCF uses zstd level 4 to compress VCF attributes, which provides a good balance of compression ratio to ingestion cost. A compression_level argument is provided for advanced users who want to explore the compression ratio vs. ingestion cost tradeoffs for different compression levels.

First, import the necessary libraries, set the TileDB-VCF dataset URI (i.e., its path, which in this tutorial will be on local storage), and delete any previously created datasets with the same name. Also, create a list of VCF URIs used in the examples and create an s3 config for reading public s3 data.

  • Python
import os.path
import shutil

import tiledb
import tiledbvcf

# Print library versions
print("TileDB core version: {}".format(tiledb.libtiledb.version()))
print("TileDB-Py version: {}".format(tiledb.version()))
print("TileDB-VCF version: {}".format(tiledbvcf.version))

# Set VCF dataset URI
vcf_uri = "./compression"

# Clean up VCF dataset if it already exists
if os.path.exists(vcf_uri):
    shutil.rmtree(vcf_uri)

# Set list of VCF files to ingest
sample_list = [
    "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen-chr21-vcf/HG00096_chr21.vcf.gz",
    "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen-chr21-vcf/HG00097_chr21.vcf.gz",
    "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen-chr21-vcf/HG00099_chr21.vcf.gz",
    "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen-chr21-vcf/HG00100_chr21.vcf.gz",
    "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen-chr21-vcf/HG00101_chr21.vcf.gz",
]

# Set config for reading public s3 data
tiledb_config = {
    "vfs.s3.no_sign_request": True,
    "vfs.s3.region": "us-east-1",
}
TileDB core version: (2, 24, 2)
TileDB-Py version: (0, 30, 2)
TileDB-VCF version: 0.33.3

Next, ingest the example VCF files using the default compression level and check the total size of the dataset. For this example, the dataset size is 30 MiB.

  • Python
# Create a dataset with zstd level 4 (default)
ds = tiledbvcf.Dataset(vcf_uri, mode="w", tiledb_config=tiledb_config)
ds.create_dataset()

# Ingest the VCF files
ds.ingest_samples(sample_list, threads=4, total_memory_budget_mb=4096)

# Report the total size of the dataset
total_size = !du -bch {vcf_uri} | tail -n 1
print(f"Total size = {total_size[0].split()[0]}iB")

# Remove the dataset
shutil.rmtree(vcf_uri)
Total size = 30MiB

Finally, ingest the example VCF files using compression level 17 and check the total size of the dataset. For this example, the dataset size is 24 MiB.

  • Python
# Create a dataset with zstd level 17
ds = tiledbvcf.Dataset(vcf_uri, mode="w", tiledb_config=tiledb_config)
ds.create_dataset(compression_level=17)

# Ingest the VCF files
ds.ingest_samples(sample_list, threads=4, total_memory_budget_mb=4096)

# Report the total size of the dataset
total_size = !du -bch {vcf_uri} | tail -n 1
print(f"Total size = {total_size[0].split()[0]}iB")

# Remove the dataset
shutil.rmtree(vcf_uri)
Total size = 24MiB

Since compression ratio is highly data dependent, any compression_ratio tradeoffs should be evaluated with the real VCF data that will be ingested.

Split VCF
API Reference