VCF Dataset Reads

life sciences

genomics (vcf)

foundation

reads

Key concepts for reading variant data stored in a TileDB-VCF dataset.

This section covers the key concepts related to reading variant data from a TileDB-VCF dataset. Visit the Arrays Key Concepts: Reads section for more foundational information about reading TileDB arrays.

Slicing

As described in the storage format spec, each TileDB-VCF dataset includes a data array, which contains variant records from the original VCF files. TileDB-VCF provides methods to read the data array that are familiar to people working with genomic data. For example, a slice of the data array is read by specifying:

A list of genomic regions, potentially derived from gene names.
A BED file defining genomic regions.
A list of sample names, defining a cohort of samples.
A list of attribute names, defining the variant-level and sample-level information required for analysis.

The array data model provided by TileDB is optimized to perform this type of slicing, which is a very powerful feature for data exploration and discovery.

Visit the basic queries tutorial for examples of TileDB-VCF read queries.

Incomplete queries

A TileDB-VCF dataset can contain much more data than a single disk can hold, and a TileDB-VCF read can potentially return more data than a single machine can store in memory. The TileDB incomplete query feature provides a mechanism to gracefully handle these larger than memory results. Since incomplete queries have an impact on performance, TileDB-VCF provides a memory budget parameter that can be tuned to reduce the number of incomplete queries.

The memory budget is set for a tiledbvcf.Dataset object using a TileDB-VCF configuration object containing the memory_budget_mb parameter. This parameter accepts an integer representing the number of megabytes of memory budget you wish to set. The following example demonstrates this:

# Set memory budget to 8 GiB
cfg = tiledbvcf.ReadConfig(memory_budget_mb=8192)
ds = tiledbvcf.Dataset(uri, cfg=cfg)

Important

The memory_budget_mb parameter must not exceed the machine memory minus the memory footprint of all OS processes.

A portion of the memory budget is allocated for result buffers. TileDB fills these result buffers with read results and generates incomplete queries when the result buffers are full and more results remain. Therefore, the memory budget should be increased to reduce the number of incomplete queries and improve performance.

Visit the Tutorials: Handling Large Queries tutorial to learn how incomplete queries work.

Scaling

For TileDB-VCF datasets containing thousands of samples, it is more efficient to read data using the distributed compute architecture of TileDB Cloud. Partitioning and distributing a read query across multiple compute nodes provides these advantages:

Each query accesses a subset of the total data, reducing the required compute resources (CPU and memory) and the query latency.
The queries run in parallel, taking advantage of independent compute resources and reducing the total query latency.

TileDB Cloud provides full automation of scalable reads, including genomic slicing features, which makes it easier to run discovery analysis workflows on biobank scale datasets.

Visit the scalable queries tutorial for examples of TileDB-VCF queries on TileDB Cloud.

Performance tuning

Visit the performance tutorial for examples of tuning the performance of TileDB-VCF read queries.