Population Genomics Performance

life sciences

genomics (vcf)

tutorials

performance

Tips for boosting performance when using TileDB-VCF.

How to run this tutorial

You can run this tutorial in two ways:

Locally on your machine.
On TileDB Cloud.

However, since TileDB Cloud has a free tier, we strongly recommend that you sign up and run everything there, as that requires no installations or deployment.

This tutorial provides a number of examples that are useful for tuning performance when working with TileDB-VCF.

Ingestion

This section contains tips for improving the performance of TileDB-VCF ingestion.

Direct ingestion

When ingesting a TileDB-VCF dataset to remote object storage, the data can be written to a tiledb:// URI or directly to your object store. Ingesting to a tiledb:// URI provides the authentication, access control, and logging features of TileDB, with only a small performance cost.

Direct ingestion to object stores (for example, an s3:// URI) provides slightly better performance for large, cost-sensitive datasets. This approach relies on the object store’s access controls alone (for example, Amazon S3 credentials) during ingestion. The writes to the dataset are not logged in TileDB Cloud. Scalable ingestion registers the dataset on TileDB Cloud with a tiledb:// URI, so all reads will take advantage of authentication, access control, and logging provided by TileDB Cloud.

Small datasets

The TileDB-VCF ingest_samples API has a number of parameters with default values optimized for ingesting large datasets on TileDB Cloud. This section describes how to set these parameters to improve performance when ingesting small datasets locally or in a TileDB notebook.

First, import the necessary libraries, set the TileDB-VCF dataset URI (i.e., its path, which in this tutorial will be on local storage), and delete any previously created datasets with the same name.

Python

import os.path
import shutil

import tiledb
import tiledbvcf

# Print library versions
print("TileDB core version: {}".format(tiledb.libtiledb.version()))
print("TileDB-Py version: {}".format(tiledb.version()))
print("TileDB-VCF version: {}".format(tiledbvcf.version))

# Set VCF dataset URI
vcf_uri = "./performance-ingestion"

# Clean up VCF dataset if it already exists
if os.path.exists(vcf_uri):
    shutil.rmtree(vcf_uri)

TileDB core version: (2, 24, 2)
TileDB-Py version: (0, 30, 2)
TileDB-VCF version: 0.33.3

Set the following tiledb_config options:

"vfs.s3.no_sign_request" - Set to True to reduce the cost of accessing public data by removing the need for AWS access credentials.
"vfs.s3.region" - Set to us-east-1 to match the location of the public TileDB demo data. This avoids file access issues caused by a different default region in a local environment.

Set the following ingest_samples options:

threads - Reduce the number of threads, since threads have upkeep that is greater than the performance benefit when working with the small demo datasets.
total_memory_budget_mb - Reduce the total memory budget, since the default value assumes all memory on the system is available for ingestion (which is true for a TileDB UDF node).

Python

# Set list of VCF files to ingest
sample_list = [
    "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen-chr21-vcf/HG00096_chr21.vcf.gz",
    "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen-chr21-vcf/HG00097_chr21.vcf.gz",
    "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen-chr21-vcf/HG00099_chr21.vcf.gz",
    "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen-chr21-vcf/HG00100_chr21.vcf.gz",
    "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen-chr21-vcf/HG00101_chr21.vcf.gz",
]

# Set config for reading public s3 data
tiledb_config = {
    "vfs.s3.no_sign_request": True,
    "vfs.s3.region": "us-east-1",
}

# Create a dataset
ds = tiledbvcf.Dataset(vcf_uri, mode="w", tiledb_config=tiledb_config)
ds.create_dataset()

# Ingest the VCF files
ds.ingest_samples(sample_list, threads=4, total_memory_budget_mb=4096)

Clean up the created TileDB-VCF dataset.

Python

# Clean up VCF dataset if it exists
if os.path.exists(vcf_uri):
    shutil.rmtree(vcf_uri)

Reads

This section contains tips for improving the performance of TileDB-VCF reads.

Limit the amount of data read

One way to improve read performance is to limit the amount of data read to what is actually needed. Data transfer consumes a major portion of the read query, so reducing the amount of data transferred will improve read performance. This sounds like an obvious recommendation, but it is sometimes overlooked.

The amount of data read can be controlled with the following parameters of the read method:

attrs - Specify only the attributes needed in the downstream analysis. Avoid reading all attributes unless all attributes are required.
regions - Specify only the regions needed in the downstream analysis. When querying regions in multiple chromosomes, use a scalable query on TileDB Cloud.
samples - Specify only the samples needed in downstream analysis. When querying a large number of samples, use a scalable query on TileDB Cloud.

Increase the memory budget

Avoid incomplete queries by increasing the memory budget when creating the dataset, as shown in this example:

# Set memory budget to 8 GiB
cfg = tiledbvcf.ReadConfig(memory_budget_mb=8192)
ds = tiledbvcf.Dataset(uri, cfg=cfg)

Use scalable queries

Scalable queries provide an automated solution to partitioning and distributing a query on TileDB Cloud. Partitioning and distributing queries provides a large performance improvement, especially when querying large datasets.

Anchor gap

As described in the Key Concepts: Ingestion, anchors are inserted into the data array to enable rapid retrieval of interval intersections. The number of anchors inserted depends on the anchor gap parameter and the type of variant data stored in the dataset. In general:

VCF data contains few variants with long ranges.
gVCF data contains many reference blocks with long ranges.
CNV data contains fewer variants with very long ranges.

The anchor gap is defined during array creation and has an impact on ingestion and read performance.

Ingestion

The anchor gap controls how many anchors are inserted into the data array. Decreasing the size of the anchor gap increases the number of anchors inserted, which increases the size of the dataset and the ingestion time.

Read performance

Each read query is expanded by the anchor gap size defined in the dataset. Increasing the size of the anchor gap increases the potential of reading unneeded data that must be filtered out. The read query time increases due to the time required to transfer the additional data and the time required to filter the data.

Recommendations

The anchor gap default value of 1,000 works well for VCF and gVCF data, resulting in very little impact to ingestion and read performance. For CNV data, which contains a small number of variants with very long reads, an anchor gap value of 1,000,000 is recommended to reduce the number of anchors inserted, which improves ingestion performance and has little impact on read performance due to the smaller number of variants.

Compression

TileDB’s columnar storage format provides excellent compression of VCF data, often achieving 50% reduction in the size of the already compressed *.vcf.gz VCF files. By default, TileDB-VCF uses zstd level 4 to compress VCF attributes, which provides a good balance of compression ratio to ingestion cost. A compression_level argument is provided for advanced users who want to explore the compression ratio vs. ingestion cost tradeoffs for different compression levels.

First, import the necessary libraries, set the TileDB-VCF dataset URI (i.e., its path, which in this tutorial will be on local storage), and delete any previously created datasets with the same name. Also, create a list of VCF URIs used in the examples and create an s3 config for reading public s3 data.

Python

import os.path
import shutil

import tiledb
import tiledbvcf

# Print library versions
print("TileDB core version: {}".format(tiledb.libtiledb.version()))
print("TileDB-Py version: {}".format(tiledb.version()))
print("TileDB-VCF version: {}".format(tiledbvcf.version))

# Set VCF dataset URI
vcf_uri = "./compression"

# Clean up VCF dataset if it already exists
if os.path.exists(vcf_uri):
    shutil.rmtree(vcf_uri)

# Set list of VCF files to ingest
sample_list = [
    "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen-chr21-vcf/HG00096_chr21.vcf.gz",
    "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen-chr21-vcf/HG00097_chr21.vcf.gz",
    "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen-chr21-vcf/HG00099_chr21.vcf.gz",
    "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen-chr21-vcf/HG00100_chr21.vcf.gz",
    "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen-chr21-vcf/HG00101_chr21.vcf.gz",
]

# Set config for reading public s3 data
tiledb_config = {
    "vfs.s3.no_sign_request": True,
    "vfs.s3.region": "us-east-1",
}

TileDB core version: (2, 24, 2)
TileDB-Py version: (0, 30, 2)
TileDB-VCF version: 0.33.3

Next, ingest the example VCF files using the default compression level and check the total size of the dataset. For this example, the dataset size is 30 MiB.

Python

# Create a dataset with zstd level 4 (default)
ds = tiledbvcf.Dataset(vcf_uri, mode="w", tiledb_config=tiledb_config)
ds.create_dataset()

# Ingest the VCF files
ds.ingest_samples(sample_list, threads=4, total_memory_budget_mb=4096)

# Report the total size of the dataset
total_size = !du -bch {vcf_uri} | tail -n 1
print(f"Total size = {total_size[0].split()[0]}iB")

# Remove the dataset
shutil.rmtree(vcf_uri)

Total size = 30MiB

Finally, ingest the example VCF files using compression level 17 and check the total size of the dataset. For this example, the dataset size is 24 MiB.

Python

# Create a dataset with zstd level 17
ds = tiledbvcf.Dataset(vcf_uri, mode="w", tiledb_config=tiledb_config)
ds.create_dataset(compression_level=17)

# Ingest the VCF files
ds.ingest_samples(sample_list, threads=4, total_memory_budget_mb=4096)

# Report the total size of the dataset
total_size = !du -bch {vcf_uri} | tail -n 1
print(f"Total size = {total_size[0].split()[0]}iB")

# Remove the dataset
shutil.rmtree(vcf_uri)

Total size = 24MiB

Since compression ratio is highly data dependent, any compression_ratio tradeoffs should be evaluated with the real VCF data that will be ingested.