Tips for boosting performance when using TileDB-VCF.
How to run this tutorial
You can run this tutorial in two ways:
Locally on your machine.
On TileDB Cloud.
However, since TileDB Cloud has a free tier, we strongly recommend that you sign up and run everything there, as that requires no installations or deployment.
This tutorial provides a number of examples that are useful for tuning performance when working with TileDB-VCF.
Ingestion
This section contains tips for improving the performance of TileDB-VCF ingestion.
Direct ingestion
When ingesting a TileDB-VCF dataset to remote object storage, the data can be written to a tiledb:// URI or directly to your object store. Ingesting to a tiledb:// URI provides the authentication, access control, and logging features of TileDB Cloud, with only a small performance overhead.
Direct ingestion to object stores (for example, an s3:// URI) provides slightly better performance for large, cost-sensitive datasets. This approach relies on the object store’s access controls alone (for example, Amazon S3 credentials) during ingestion. The writes to the dataset are not logged in TileDB Cloud. Scalable ingestion registers the dataset on TileDB Cloud with a tiledb:// URI, so all reads will take advantage of authentication, access control, and logging provided by TileDB Cloud.
Small datasets
The TileDB-VCF ingest_samples API has a number of parameters with default values optimized for ingesting large datasets on TileDB Cloud. This section describes how to set these parameters to improve performance when ingesting small datasets locally or in a TileDB notebook.
First, import the necessary libraries, set the TileDB-VCF dataset URI (i.e., its path, which in this tutorial will be on local storage), and delete any previously created datasets with the same name.
vfs.s3.no_sign_request - Set to True to reduce the overhead of accessing public data by removing the need for AWS access credentials.
vfs.s3.region - Set to us-east-1 to match the location of the public TileDB demo data. This avoids file access issues caused by a different default region in a local environment.
Set the following ingest_samples options:
threads - Reduce the number of threads, since threads have overhead that is larger than the performance benefit when working with the small demo datasets.
total_memory_budget_mb - Reduce the total memory budget, since the default value assumes all memory on the system is available for ingestion (which is true for a TileDB Cloud UDF node).
# Set list of VCF files to ingestsample_list = ["s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen-chr21-vcf/HG00096_chr21.vcf.gz","s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen-chr21-vcf/HG00097_chr21.vcf.gz","s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen-chr21-vcf/HG00099_chr21.vcf.gz","s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen-chr21-vcf/HG00100_chr21.vcf.gz","s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen-chr21-vcf/HG00101_chr21.vcf.gz",]# Set config for reading public s3 datatiledb_config = {"vfs.s3.no_sign_request": True,"vfs.s3.region": "us-east-1",}# Create a datasetds = tiledbvcf.Dataset(vcf_uri, mode="w", tiledb_config=tiledb_config)ds.create_dataset()# Ingest the VCF filesds.ingest_samples(sample_list, threads=4, total_memory_budget_mb=4096)
# Clean up VCF dataset if it existsif os.path.exists(vcf_uri): shutil.rmtree(vcf_uri)
Reads
This section contains tips for improving the performance of TileDB-VCF reads.
Limit the amount of data read
One way to improve read performance is to limit the amount of data read to what is actually needed. Data transfer consumes a major portion of the read query, so reducing the amount of data transferred will improve read performance. This sounds like an obvious recommendation, but it is sometimes overlooked.
The amount of data read can be controlled with the following parameters of the read method:
attrs - Specify only the attributes needed in the downstream analysis. Avoid reading all attributes unless all attributes are required.
regions - Specify only the regions needed in the downstream analysis. When querying regions in multiple chromosomes, use a scalable query on TileDB Cloud.
samples - Specify only the samples needed in downstream analysis. When querying a large number of samples, use a scalable query on TileDB Cloud.
Increase the memory budget
Avoid incomplete queries by increasing the memory budget when creating the dataset, as shown in this example:
# Set memory budget to 8 GiBcfg = tiledbvcf.ReadConfig(memory_budget_mb=8192)ds = tiledbvcf.Dataset(uri, cfg=cfg)
Use scalable queries
Scalable queries provide an automated solution to partitioning and distributing a query on TileDB Cloud. Partitioning and distributing queries provides a large performance improvement, especially when querying large datasets.
Anchor gap
As described in the Key Concepts: Ingestion, anchors are inserted into the data array to enable rapid retrieval of interval intersections. The number of anchors inserted depends on the anchor gap parameter and the type of variant data stored in the dataset. In general:
VCF data contains few variants with long ranges.
gVCF data contains many reference blocks with long ranges.
CNV data contains fewer variants with very long ranges.
The anchor gap is defined during array creation and has an impact on ingestion and read performance.
Ingestion
The anchor gap controls how many anchors are inserted into the data array. Decreasing the size of the anchor gap increases the number of anchors inserted, which increases the size of the dataset and the ingestion time.
Read performance
Each read query is expanded by the anchor gap size defined in the dataset. Increasing the size of the anchor gap increases the potential of reading unneeded data that must be filtered out. The read query time increases due to the time required to transfer the additional data and the time required to filter the data.
Recommendations
The anchor gap default value of 1,000 works well for VCF and gVCF data, resulting in very little impact to ingestion and read performance. For CNV data, which contains a small number of variants with very long reads, an anchor gap value of 1,000,000 is recommended to reduce the number of anchors inserted, which improves ingestion performance and has little impact on read performance due to the smaller number of variants.
Compression
TileDB’s columnar storage format provides excellent compression of VCF data, often achieving 50% reduction in the size of the already compressed *.vcf.gz VCF files. By default, TileDB-VCF uses zstd level 4 to compress VCF attributes, which provides a good balance of compression ratio to ingestion cost. A compression_level argument is provided for advanced users who want to explore the compression ratio vs. ingestion cost tradeoffs for different compression levels.
First, import the necessary libraries, set the TileDB-VCF dataset URI (i.e., its path, which in this tutorial will be on local storage), and delete any previously created datasets with the same name. Also, create a list of VCF URIs used in the examples and create an s3 config for reading public s3 data.
import tiledbimport tiledbvcfimport shutilimport os.path# Print library versionsprint("TileDB core version: {}".format(tiledb.libtiledb.version()))print("TileDB-Py version: {}".format(tiledb.version()))print("TileDB-VCF version: {}".format(tiledbvcf.version))# Set VCF dataset URIvcf_uri ="./compression"# Clean up VCF dataset if it already existsif os.path.exists(vcf_uri): shutil.rmtree(vcf_uri)# Set list of VCF files to ingestsample_list = ["s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen-chr21-vcf/HG00096_chr21.vcf.gz","s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen-chr21-vcf/HG00097_chr21.vcf.gz","s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen-chr21-vcf/HG00099_chr21.vcf.gz","s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen-chr21-vcf/HG00100_chr21.vcf.gz","s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen-chr21-vcf/HG00101_chr21.vcf.gz",]# Set config for reading public s3 datatiledb_config = {"vfs.s3.no_sign_request": True,"vfs.s3.region": "us-east-1",}
Next, ingest the example VCF files using the default compression level and check the total size of the dataset. For this example, the dataset size is 30 MiB.
# Create a dataset with zstd level 4 (default)ds = tiledbvcf.Dataset(vcf_uri, mode="w", tiledb_config=tiledb_config)ds.create_dataset()# Ingest the VCF filesds.ingest_samples(sample_list, threads=4, total_memory_budget_mb=4096)# Report the total size of the datasettotal_size =!du -bch {vcf_uri} | tail -n 1print(f"Total size = {total_size[0].split()[0]}iB")# Remove the datasetshutil.rmtree(vcf_uri)
Total size = 30MiB
Finally, ingest the example VCF files using compression level 17 and check the total size of the dataset. For this example, the dataset size is 24 MiB.
# Create a dataset with zstd level 17ds = tiledbvcf.Dataset(vcf_uri, mode="w", tiledb_config=tiledb_config)ds.create_dataset(compression_level=17)# Ingest the VCF filesds.ingest_samples(sample_list, threads=4, total_memory_budget_mb=4096)# Report the total size of the datasettotal_size =!du -bch {vcf_uri} | tail -n 1print(f"Total size = {total_size[0].split()[0]}iB")# Remove the datasetshutil.rmtree(vcf_uri)
Total size = 24MiB
Since compression ratio is highly data dependent, any compression_ratio tradeoffs should be evaluated with the real VCF data that will be ingested.