Ingestion
Ingestion is the process of extracting variant data from VCF files, transforming the data into an analysis-ready format, and loading it into a TileDB-VCF dataset. This section describes the basic ingestion process provided by the TileDB-VCF open-source code, best practices for optimizing read performance on large datasets, and the automation provided by scalable ingestion on TileDB Cloud.
Basic ingestion
The TileDB-VCF open-source code provides basic ingestion functionality for VCF files that meet the following prerequisites:
- Each VCF file contains a single sample.
- The records in each VCF file are sorted.
- Each VCF file is compressed with
bgzip
. - Each VCF file is indexed with
bcftools
.
VCF files are parsed with htslib. Therefore, any warnings or errors detected by htslib
will be reported by TileDB-VCF. If htslib
reports any errors in the VCF file, the errors must be fixed before running ingestion. When htslib
reads a remote VCF file, it copies the index file to the local system to improve performance. These index files can be removed after ingestion is complete.
The basic TileDB-VCF ingestion code supports ingesting multiple batches of VCF files in parallel and appending additional batches of VCF files over time. These capabilities allow TileDB-VCF datasets to grow in a scalable fashion as described in the data model. As a dataset grows, the arrays in the dataset accumulate fragments. A large number of fragments can impact read performance if the array is not consolidated and vacuumed.
The following sections describe recommendations for maintaining optimal read performance in a large TileDB-VCF dataset, including consolidation and vacuuming of fragments.
Recommendations
This section contains recommendations for maintaining optimal read performance as a TileDB-VCF dataset grows in size. To provide maximum value to TileDB customers, these recommendations are fully automated in the TileDB-Cloud-based scalable ingestion.
Avoid fragment overlap
When reading from a TileDB array, a powerful method to reduce read query time is to prune fragments that are guaranteed not to contain results for the query. Fragment pruning is enabled by the dimensions of an array and the indexing provided by TileDB. TileDB-VCF data array dimensions, in order of precedence, are contig
, start_pos
, and sample
, which correspond to CHROM
, POS
, and Sample ID
in the VCF specification.
Illustrating fragment pruning with an example, consider a read query targeting chromosome chr1
, position 10,000,000
, and sample HG00096
. TileDB-VCF ingestion stores data for different chromosomes in different fragments. Therefore, only fragments that contain data for chr1
are considered, and all others fragments are pruned. Next, the position value 10,000,000
is considered, and any fragments that do not contain data for this position are pruned. Finally, the sample value HG00096
is considered, and any fragments that cannot contain data for this sample are pruned. Visit the Arrays Key Concepts: Reads section for more details about the TileDB read algorithm.
Fragment overlap occurs when the non-empty domains (explained in the Arrays Key Concepts: Reads section) for two or more fragments overlap in a way that prevent the fragments from being pruned. Returning to the example above, consider the sample
dimension and the HG00096
query.
An example of overlapping fragments:
- Fragment 1:
sample
non-empty domain = (AAA
,ZZZ
) - Fragment 2:
sample
non-empty domain = (AAA
,ZZZ
)
The sample value HG00096
could be found in either fragment, therefore neither fragment can be pruned.
An example of non-overlapping fragments:
- Fragment 1:
sample
non-empty domain = (AAA
,MMM
) - Fragment 2:
sample
non-empty domain = (NNN
,ZZZ
)
With this more optimal fragment layout, the sample value HG00096
can only be found in Fragment 1
. Therefore, Fragment 2
is pruned from the query.
As mentioned above, TileDB-VCF ingestion avoids fragment overlap in the contig
dimension by limiting data in each fragment to one chromosome. Fragment overlap in the start_pos
dimension is less of a concern, since each fragment contains data for an entire chromosome, and overlap is expected. However, overlap in the sample
dimension has more impact on the read query performance. Expanding on the example above, if overlap in the sample dimension prevents a read query from pruning fragments that do not contain data for sample HG00096
, then all overlapping fragments must be inspected to find data for HG00096
.
The way to prevent fragments that overlap in the sample
dimension is to create batches of samples that contain non-overlapping sample names. For example, generating a list of sample names to be ingested, sorting the sample names lexicographically, and splitting the sorted list into sample batches. These steps are fully automated in the TileDB-Cloud-based scalable ingestion.
Anchor gap
As described in the data model, TileDB-VCF injects anchors into the data to break up long ranges and support rapid retrieval of interval intersections. The anchor gap defines the distance between anchors in these variants with long ranges. Consider the following tradeoffs between storage size and query performance when setting the size of the anchor gap:
- A larger anchor gap inserts fewer anchors, which reduces additional storage size. Each query is expanded by a larger amount, potentially reading more unneeded data that must be filtered out and increasing query time.
- A smaller anchor gap inserts more anchors, which increases additional storage size. Each query is expanded by a smaller amount, reducing reads of unneeded data and reducing the impact to query time.
The default anchor gap value works well with ranges in typical gVCF files. For variant data with much longer ranges, like CNV data, the anchor gap should be increased to 1,000,000.
Consolidation requirements
Maintaining optimal read performance for large datasets with a large number of fragments in each array requires consolidation and vacuuming of commits, fragment metadata, and array fragments. Consolidation of array fragments is a complex operation that is best performed in a distributed manner.
When ingesting a large TileDB-VCF dataset, it is important to periodically consolidate the arrays in the dataset. This periodic consolidation maintains a low array open time, which is required by the resume ingestion feature.
A final requirement for consolidation is ingestion (writing new fragments to the arrays) must be paused while the arrays are being consolidated.
All consolidation requirements are fully automated in the TileDB-Cloud-based scalable ingestion.
Ingestion on TileDB Cloud
Scalable ingestion on TileDB Cloud fully automates preparing VCF files to meet most prerequisites described in the Basic ingestion section, specifically:
- If a VCF file is not sorted or bgzipped, the VCF should be provided as a
*.vcf
file (not a gzipped*.vcf.gz
file). The VCF file will be sorted and bgzipped on the fly usingbcftools
. - If a VCF file is not indexed, it will be indexed automatically using
bcftools
.
For the final prerequisite, single sample VCF files, TileDB Cloud provides a solution for efficiently splitting multi-sample VCF files into single-sample VCF files in a distributed fashion.
To address the issue of avoiding fragment overlaps, TileDB-VCF scalable ingestion runs the following steps:
- Given a URI and search pattern or a URI with a list of VCF URIs, create a list of VCF URIs to ingest.
- For each VCF URI, read the sample name and check for a valid index file. Save this information in the
manifest
array. - Read the list of sample names from the
manifest
and sort the list lexicographically. - Split the sorted list of sample names into batches (default size of 100 samples) to ensure no sample names overlap between the fragments created by each batch.
To meet the requirement of periodic consolidation, scalable ingestion creates a task graph with the following pattern:
- Ingest N batches of samples in parallel (default 40).
The number of parallel batches is limited by the bandwidth of the object store backend. For example, Amazon S3 is limited to 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per partitioned Amazon S3 prefix.
- Consolidate and vacuum the commits and fragment metadata in all arrays.
- While more sample batches to ingest remain, return to step #1.
- Consolidate and vacuum the fragments in the variant statistic arrays.
Consolidation on TileDB Cloud
Consolidation is a key ingredient to providing a scalable TileDB-VCF dataset with optimized read performance. As described above, consolidation is fully integrated into the scalable ingestion solution. The consolidation of commits and fragment metadata is straightforward, since these are baseline consolidation operations. The consolidation of fragments in the variant statistics arrays is more complex and is described in variant statistics consolidation.
In general, consolidation of TileDB arrays is important to optimize read performance. Other TileDB arrays, like annotations, must be consolidated separately since they are not included in TileDB-VCF dataset consolidation.
A large majority of the data in a TileDB-VCF dataset is stored in the data
array. Consolidation of the data
array addresses the following goals:
- Optimize the array fragment sizes while maintaining the desired pruning characteristics of the array dimensions. This goal is achieved by creating groups of fragments with the same
contig
value and consolidating each group of fragments with a target size of 1 GiB per fragment. - Materialize deletions with the option of consolidating deleted cells into a new fragment to preserve time traveling or permanently purging the deletions to optimize storage and read performance.
Similar to the variant statistic arrays, consolidation of the data
array is fully automated and distributed on TileDB Cloud to reduce the latency of consolidation.