Variant Statistics
TileDB-VCF provides a powerful feature that efficiently captures multiple types of variant statistics during ingestion (for example, internal allele frequency and allele counts at the dataset level and variant type counts at the sample level). This section describes the lifecycle of these statistics, including how they are ingested, stored, aggregated, appended, consolidated, and deleted.
Methods for accessing the variant statistics are described in the advanced variant statistics tutorial.
Ingestion, storage, and aggregation
During ingestion, TileDB-VCF reads all records for every VCF file ingested to the dataset. The fact that every record is touched during ingestion is leveraged to efficiently calculate a number of statistics while the variant data is in memory. The current set of statistics captured includes Internal Allele Frequency, Allele Count, and Sample Statistics.
Internal allele frequency
Internal Allele Frequency (IAF) refers to the absolute count and relative abundance of alleles observed at specific positions in the genome within a cohort of samples. In conjunction with global minor allele frequency information like that provided by gnomAD, IAF can be a valuable tool for variant prioritization. IAF is calculated by dividing the allele count (AC), the number of times a specific allele occurs at a given locus, by the allele number (AN), the number of alleles observed in the dataset at the same given locus (\(IAF = AC / AN\)). IAF is calculated for every unique chrom-pos-allele
in the dataset.
During ingestion, TileDB-VCF inspects every value of CHROM
, POS
, REF
, ALT
, and GT
and calculates the AC
and AN
for every locus and allele. The values of AC
and AN
are stored in the variant_stats
array. Since ingestion is distributed across multiple sample batches, which run on different compute nodes at different times, the values of AC
and AN
calculated on one compute node are a partial sum of the values needed to compute AC
and AN
for the entire dataset. Each compute node stores the partial results in a fragment of the variant_stats
array. When computing the final AC
and AN
values for a locus in the dataset, these partial sums are aggregated to generate the total values required to generate IAF
.
For the IAF
calculation, alleles are normalized so they are counted consistently and correctly. For example, the counts for called REF
alleles are stored in a allele specified as ref in the variant_stats
array because the actual values of REF
at a locus can differ depending on the type of variant at the locus (SNV, insertion, or deletion). This normalization of allele values is the main difference compared to the allele count described in the next section.
Allele count
The allele_count
array provides a count of unique chrom-pos-ref-alt
variants in the dataset. This array can be used to generate comprehensive lists of all variants observed in a TileDB-VCF dataset. This allele dump can be used, for instance, to generate a sample-less, variant-only VCF for downstream annotation. The allele counts can be optionally grouped by FILTER
and GT
values for more granularity in the analysis of the allele counts. In the allele_count
array, the raw REF
and ALT
values are counted, as opposed to the normalized allele values in the IAF calculation described above.
During ingestion, TileDB-VCF inspects every value of CHROM
, POS
, REF
, ALT
, FILTER
, and GT
and counts the number of unique values seen. Similar to IAF, each compute node in the distributed scalable ingestion calculates a partial sum of values needed to calculate the total allele count at each locus for the dataset. Again, each compute node stores these partial results of the allele_count
array, and the partial sums are aggregated to generate the final allele count values.
Sample statistics
The sample statistics calculated by TileDB-VCF are similar to the statistics provided by Hail’s sample_qc
and bcftools stats
. For each sample, the following summary statistics are provided:
- Read Depth:
min
,max
,mean
, andstddev
- Genotype Quality:
min
,max
,mean
, andstddev
- Call counts:
called
,not_called
- Zygosity counts:
hom_ref
,het
,hom_var
- Variant type counts:
non_ref
,singleton
,snp
,insertion
,deletion
,transition
,transversion
,star
- Rates:
call_rate
,ti_tv
,het_hom_var
,insertion_deletion
- Record counts:
records
, multiallelic records
During ingestion, TileDB-VCF inspects every value of REF
, ALT
, GT
, DP
, and GQ
to compute the partial values needed to compute the sample statistics. All values for one sample are processed by one compute node. However, each thread of an ingestion compute node will process a portion of a sample. Therefore, partial sums are still required and are stored in fragments in the sample_stats
array. Similar to the other arrays discussed above, these partial sums are aggregated to generate the final sample statistic values.
Appending samples to the variant statistic arrays is analogous to ingesting an additional batch of samples. The statistics are written to new fragments in the array, which are included in the aggregation of the final statistics.
The allele_count
and variant_stats
arrays use 0-based indexing to match the 0-based indexing in the data
array.
Consolidation
Ingesting a large dataset will result in a large number of small fragments in the variant_stats
and allele_count
arrays. To optimize the read performance of these arrays, the small fragments are consolidated into larger fragments during distributed scalable ingestion on TileDB Cloud. First, the fragments are grouped by the value of their first dimension (CHROM
), then each group of fragments is consolidated with a target size of 1 GiB per fragment. The consolidation of each group runs on a separate compute node to reduce latency of consolidation.
This consolidation strategy optimizes the read performance in the following ways:
- Queries can ignore fragments with a different value of the first dimension, which will be a large percentage of fragments.
- Values to be aggregated will have good spatial locality in the same fragment, which will reduce the time required to read the values to be aggregated.
Deletion
When a sample is deleted from the dataset, the variant statistics must be adjusted to reflect the sample deletion. TileDB-VCF handles sample deletion in an intelligent way by leveraging the aggregation mechanism already in place.
The process of deleting a sample includes adding negative counts to the variant_stats
and allele_count
arrays. When an aggregated statistic is calculated, these negative counts reduce the aggregated counts to the correct value, as if the sample was never ingested.
For the sample_stats
array, deletion is more straightforward, deleting all stats for the sample being deleted.