Variant Statistics

life sciences

genomics (vcf)

foundation

statistics

Variant statistics provided by TileDB-VCF.

TileDB-VCF provides a powerful feature that efficiently captures multiple types of variant statistics during ingestion (for example, internal allele frequency and allele counts at the dataset level and variant type counts at the sample level). This section describes the lifecycle of these statistics, including how they are ingested, stored, aggregated, appended, consolidated, and deleted.

Methods for accessing the variant statistics are described in the advanced variant statistics tutorial.

Ingestion, storage, and aggregation

During ingestion, TileDB-VCF reads all records for every VCF file ingested to the dataset. The fact that every record is touched during ingestion is leveraged to efficiently calculate a number of statistics while the variant data is in memory. The current set of statistics captured includes Internal Allele Frequency, Allele Count, and Sample Statistics.

Internal allele frequency

Internal Allele Frequency (IAF) refers to the absolute count and relative abundance of alleles observed at specific positions in the genome within a cohort of samples. In conjunction with global minor allele frequency information like that provided by gnomAD, IAF can be a valuable tool for variant prioritization. IAF is calculated by dividing the allele count (AC), the number of times a specific allele occurs at a given locus, by the allele number (AN), the number of alleles observed in the dataset at the same given locus (\(IAF = AC / AN\)). IAF is calculated for every unique chrom-pos-allele in the dataset.

During ingestion, TileDB-VCF inspects every value of CHROM, POS, REF, ALT, and GT and calculates the AC and AN for every locus and allele. The values of AC and AN are stored in the variant_stats array. Since ingestion is distributed across multiple sample batches, which run on different compute nodes at different times, the values of AC and AN calculated on one compute node are a partial sum of the values needed to compute AC and AN for the entire dataset. Each compute node stores the partial results in a fragment of the variant_stats array. When computing the final AC and AN values for a locus in the dataset, these partial sums are aggregated to generate the total values required to generate IAF.

For the IAF calculation, alleles are normalized so they are counted consistently and correctly. For example, the counts for called REF alleles are stored in a allele specified as ref in the variant_stats array because the actual values of REF at a locus can differ depending on the type of variant at the locus (SNV, insertion, or deletion). This normalization of allele values is the main difference compared to the allele count described in the next section.

Allele count

The allele_count array provides a count of unique chrom-pos-ref-alt variants in the dataset. This array can be used to generate comprehensive lists of all variants observed in a TileDB-VCF dataset. This allele dump can be used, for instance, to generate a sample-less, variant-only VCF for downstream annotation. The allele counts can be optionally grouped by FILTER and GT values for more granularity in the analysis of the allele counts. In the allele_count array, the raw REF and ALT values are counted, as opposed to the normalized allele values in the IAF calculation described above.

During ingestion, TileDB-VCF inspects every value of CHROM, POS, REF, ALT, FILTER, and GT and counts the number of unique values seen. Similar to IAF, each compute node in the distributed scalable ingestion calculates a partial sum of values needed to calculate the total allele count at each locus for the dataset. Again, each compute node stores these partial results of the allele_count array, and the partial sums are aggregated to generate the final allele count values.

Sample statistics

The sample statistics calculated by TileDB-VCF are similar to the statistics provided by Hail’s sample_qc and bcftools stats. For each sample, the following summary statistics are provided:

Read Depth: min, max, mean, and stddev
Genotype Quality: min, max, mean, and stddev
Call counts: called, not_called
Zygosity counts: hom_ref, het, hom_var
Variant type counts: non_ref, singleton, snp, insertion, deletion, transition, transversion, star
Rates: call_rate, ti_tv, het_hom_var, insertion_deletion
Record counts: records, multiallelic records

During ingestion, TileDB-VCF inspects every value of REF, ALT, GT, DP, and GQ to compute the partial values needed to compute the sample statistics. All values for one sample are processed by one compute node. However, each thread of an ingestion compute node will process a portion of a sample. Therefore, partial sums are still required and are stored in fragments in the sample_stats array. Similar to the other arrays discussed above, these partial sums are aggregated to generate the final sample statistic values.

Note

Appending samples to the variant statistic arrays is analogous to ingesting an additional batch of samples. The statistics are written to new fragments in the array, which are included in the aggregation of the final statistics.

Warning

The allele_count and variant_stats arrays use 0-based indexing to match the 0-based indexing in the data array.

Consolidation

Ingesting a large dataset will result in a large number of small fragments in the variant_stats and allele_count arrays. To optimize the read performance of these arrays, the small fragments are consolidated into larger fragments during distributed scalable ingestion on TileDB Cloud. First, the fragments are grouped by the value of their first dimension (CHROM), then each group of fragments is consolidated with a target size of 1 GiB per fragment. The consolidation of each group runs on a separate compute node to reduce latency of consolidation.

This consolidation strategy optimizes the read performance in the following ways:

Queries can ignore fragments with a different value of the first dimension, which will be a large percentage of fragments.
Values to be aggregated will have good spatial locality in the same fragment, which will reduce the time required to read the values to be aggregated.

Deletion

When a sample is deleted from the dataset, the variant statistics must be adjusted to reflect the sample deletion. TileDB-VCF handles sample deletion in an intelligent way by leveraging the aggregation mechanism already in place.

The process of deleting a sample includes adding negative counts to the variant_stats and allele_count arrays. When an aggregated statistic is calculated, these negative counts reduce the aggregated counts to the correct value, as if the sample was never ingested.

For the sample_stats array, deletion is more straightforward, deleting all stats for the sample being deleted.