1. Structure
  2. Life Sciences
  3. Population Genomics
  4. Foundation
  5. Key Concepts
  6. Variant Statistics
  • Home
  • What is TileDB?
  • Get Started
  • Explore Content
  • Accounts
    • Individual Accounts
      • Apply for the Free Tier
      • Profile
        • Overview
        • Cloud Credentials
        • Storage Paths
        • REST API Tokens
        • Credits
    • Organization Admins
      • Create an Organization
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
    • Organization Members
      • Organization Invitations
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
  • Catalog
    • Introduction
    • Data
      • Arrays
      • Tables
      • Single-Cell (SOMA)
      • Genomics (VCF)
      • Biomedical Imaging
      • Vector Search
      • Files
    • Code
      • Notebooks
      • Dashboards
      • User-Defined Functions
      • Task Graphs
      • ML Models
    • Groups
    • Marketplace
    • Search
  • Collaborate
    • Introduction
    • Organizations
    • Access Control
      • Introduction
      • Share Assets
      • Asset Permissions
      • Public Assets
    • Logging
    • Marketplace
  • Analyze
    • Introduction
    • Slice Data
    • Multi-Region Redirection
    • Notebooks
      • Launch a Notebook
      • Usage
      • Widgets
      • Notebook Image Dependencies
    • Dashboards
      • Dashboards
      • Streamlit
    • Preview
    • User-Defined Functions
    • Task Graphs
    • Serverless SQL
    • Monitor
      • Task Log
      • Task Graph Log
  • Scale
    • Introduction
    • Task Graphs
    • API Usage
  • Structure
    • Why Structure Is Important
    • Arrays
      • Introduction
      • Quickstart
      • Foundation
        • Array Data Model
        • Key Concepts
          • Storage
            • Arrays
            • Dimensions
            • Attributes
            • Cells
            • Domain
            • Tiles
            • Data Layout
            • Compression
            • Encryption
            • Tile Filters
            • Array Schema
            • Schema Evolution
            • Fragments
            • Fragment Metadata
            • Commits
            • Indexing
            • Array Metadata
            • Datetimes
            • Groups
            • Object Stores
          • Compute
            • Writes
            • Deletions
            • Consolidation
            • Vacuuming
            • Time Traveling
            • Reads
            • Query Conditions
            • Aggregates
            • User-Defined Functions
            • Distributed Compute
            • Concurrency
            • Parallelism
        • Storage Format Spec
      • Tutorials
        • Basics
          • Basic Dense Array
          • Basic Sparse Array
          • Array Metadata
          • Compression
          • Encryption
          • Data Layout
          • Tile Filters
          • Datetimes
          • Multiple Attributes
          • Variable-Length Attributes
          • String Dimensions
          • Nullable Attributes
          • Multi-Range Reads
          • Query Conditions
          • Aggregates
          • Deletions
          • Catching Errors
          • Configuration
          • Basic S3 Example
          • Basic TileDB Cloud
          • fromDataFrame
          • Palmer Penguins
        • Advanced
          • Schema Evolution
          • Advanced Writes
            • Write at a Timestamp
            • Get Fragment Info
            • Consolidation
              • Fragments
              • Fragment List
              • Consolidation Plan
              • Commits
              • Fragment Metadata
              • Array Metadata
            • Vacuuming
              • Fragments
              • Commits
              • Fragment Metadata
              • Array Metadata
          • Advanced Reads
            • Get Fragment Info
            • Time Traveling
              • Introduction
              • Fragments
              • Array Metadata
              • Schema Evolution
          • Array Upgrade
          • Backends
            • Amazon S3
            • Azure Blob Storage
            • Google Cloud Storage
            • MinIO
            • Lustre
          • Virtual Filesystem
          • User-Defined Functions
          • Distributed Compute
          • Result Estimation
          • Incomplete Queries
        • Management
          • Array Schema
          • Groups
          • Object Management
        • Performance
          • Summary of Factors
          • Dense vs. Sparse
          • Dimensions vs. Attributes
          • Compression
          • Tiling and Data Layout
          • Tuning Writes
          • Tuning Reads
      • API Reference
    • Tables
      • Introduction
      • Quickstart
      • Foundation
        • Data Model
        • Key Concepts
          • Indexes
          • Columnar Storage
          • Compression
          • Data Manipulation
          • Optimize Tables
          • ACID
          • Serverless SQL
          • SQL Connectors
          • Dataframes
          • CSV Ingestion
      • Tutorials
        • Basics
          • Ingestion with SQL
          • CSV Ingestion
          • Basic S3 Example
          • Running Locally
        • Advanced
          • Scalable Ingestion
          • Scalable Queries
      • API Reference
    • AI & ML
      • Vector Search
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Vector Search
            • Vector Databases
            • Algorithms
            • Distance Metrics
            • Updates
            • Deployment Methods
            • Architecture
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Ingestion & Querying
            • Updates
            • Deletions
            • Basic S3 Example
            • Running Locally
          • Advanced
            • Versioning
            • Time Traveling
            • Consolidation
            • Distributed Compute
            • RAG LLM
            • LLM Memory
            • File Search
            • Image Search
            • Protein Search
          • Performance
        • API Reference
      • ML Models
        • Introduction
        • Quickstart
        • Foundation
          • Basics
          • Storage
          • Cloud Execution
          • Why TileDB for Machine Learning
        • Tutorials
          • Ingestion
            • Data Ingestion
              • Dense Datasets
              • Sparse Datasets
            • ML Model Ingestion
          • Management
            • Array Schema
            • Machine Learning: Groups
            • Time Traveling
    • Life Sciences
      • Single-cell
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Data Structures
            • Use of Apache Arrow
            • Join IDs
            • State Management
            • TileDB Cloud URIs
          • SOMA API Specification
        • Tutorials
          • Data Ingestion
          • Bulk Ingestion Tutorial
          • Data Access
          • Distributed Compute
          • Basic S3 Example
          • Multi-Experiment Queries
          • Appending Data to a SOMA Experiment
          • Add New Measurements
          • SQL Queries
          • Running Locally
          • Shapes in TileDB-SOMA
          • Drug Discovery App
        • Spatial
          • Introduction
          • Foundation
            • Spatial Data Model
            • Data Structures
          • Tutorials
            • Spatial Data Ingestion
            • Access Spatial Data
            • Manage Coordinate Spaces
        • API Reference
      • Population Genomics
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • The N+1 Problem
            • Architecture
            • Arrays
            • Ingestion
            • Reads
            • Variant Statistics
            • Annotations
            • User-Defined Functions
            • Tables and SQL
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Basic Ingestion
            • Basic Queries
            • Export to VCF
            • Add New Samples
            • Deleting Samples
            • Basic S3 Example
            • Basic TileDB Cloud
          • Advanced
            • Scalable Ingestion
            • Scalable Queries
            • Query Transforms
            • Handling Large Queries
            • Annotations
              • Finding Annotations
              • Embedded Annotations
              • External Annotations
              • Annotation VCFs
              • Ingesting Annotations
            • Variant Statistics
            • Tables and SQL
            • User-Defined Functions
            • Sample Metadata
            • Split VCF
          • Performance
        • API Reference
          • Command Line Interface
          • Python API
          • Cloud API
      • Biomedical Imaging
        • Introduction
        • Foundation
          • Data Model
          • Key Concepts
            • Arrays
            • Ingestion
            • Reads
            • User Defined Functions
          • Storage Format Spec
        • Quickstart
        • Tutorials
          • Basics
            • Ingestion
            • Read
              • OpenSlide
              • TileDB-Py
          • Advanced
            • Batched Ingestion
            • Chunked Ingestion
            • Machine Learning
              • PyTorch
            • Napari
    • Files
  • API Reference
  • Self-Hosting
    • Installation
    • Upgrades
    • Administrative Tasks
    • Image Customization
      • Customize User-Defined Function Images
      • AWS ECR Container Registry
      • Customize Jupyter Notebook Images
    • Single Sign-On
      • Configure Single Sign-On
      • OpenID Connect
      • Okta SCIM
      • Microsoft Entra
  • Glossary

On this page

  • Ingestion, storage, and aggregation
    • Internal allele frequency
    • Allele count
    • Sample statistics
  • Consolidation
  • Deletion
  1. Structure
  2. Life Sciences
  3. Population Genomics
  4. Foundation
  5. Key Concepts
  6. Variant Statistics

Variant Statistics

life sciences
genomics (vcf)
foundation
statistics
Variant statistics provided by TileDB-VCF.

TileDB-VCF provides a powerful feature that efficiently captures multiple types of variant statistics during ingestion (for example, internal allele frequency and allele counts at the dataset level and variant type counts at the sample level). This section describes the lifecycle of these statistics, including how they are ingested, stored, aggregated, appended, consolidated, and deleted.

Methods for accessing the variant statistics are described in the advanced variant statistics tutorial.

Ingestion, storage, and aggregation

During ingestion, TileDB-VCF reads all records for every VCF file ingested to the dataset. The fact that every record is touched during ingestion is leveraged to efficiently calculate a number of statistics while the variant data is in memory. The current set of statistics captured includes Internal Allele Frequency, Allele Count, and Sample Statistics.

Internal allele frequency

Internal Allele Frequency (IAF) refers to the absolute count and relative abundance of alleles observed at specific positions in the genome within a cohort of samples. In conjunction with global minor allele frequency information like that provided by gnomAD, IAF can be a valuable tool for variant prioritization. IAF is calculated by dividing the allele count (AC), the number of times a specific allele occurs at a given locus, by the allele number (AN), the number of alleles observed in the dataset at the same given locus (\(IAF = AC / AN\)). IAF is calculated for every unique chrom-pos-allele in the dataset.

During ingestion, TileDB-VCF inspects every value of CHROM, POS, REF, ALT, and GT and calculates the AC and AN for every locus and allele. The values of AC and AN are stored in the variant_stats array. Since ingestion is distributed across multiple sample batches, which run on different compute nodes at different times, the values of AC and AN calculated on one compute node are a partial sum of the values needed to compute AC and AN for the entire dataset. Each compute node stores the partial results in a fragment of the variant_stats array. When computing the final AC and AN values for a locus in the dataset, these partial sums are aggregated to generate the total values required to generate IAF.

For the IAF calculation, alleles are normalized so they are counted consistently and correctly. For example, the counts for called REF alleles are stored in a allele specified as ref in the variant_stats array because the actual values of REF at a locus can differ depending on the type of variant at the locus (SNV, insertion, or deletion). This normalization of allele values is the main difference compared to the allele count described in the next section.

Allele count

The allele_count array provides a count of unique chrom-pos-ref-alt variants in the dataset. This array can be used to generate comprehensive lists of all variants observed in a TileDB-VCF dataset. This allele dump can be used, for instance, to generate a sample-less, variant-only VCF for downstream annotation. The allele counts can be optionally grouped by FILTER and GT values for more granularity in the analysis of the allele counts. In the allele_count array, the raw REF and ALT values are counted, as opposed to the normalized allele values in the IAF calculation described above.

During ingestion, TileDB-VCF inspects every value of CHROM, POS, REF, ALT, FILTER, and GT and counts the number of unique values seen. Similar to IAF, each compute node in the distributed scalable ingestion calculates a partial sum of values needed to calculate the total allele count at each locus for the dataset. Again, each compute node stores these partial results of the allele_count array, and the partial sums are aggregated to generate the final allele count values.

Sample statistics

The sample statistics calculated by TileDB-VCF are similar to the statistics provided by Hail’s sample_qc and bcftools stats. For each sample, the following summary statistics are provided:

  • Read Depth: min, max, mean, and stddev
  • Genotype Quality: min, max, mean, and stddev
  • Call counts: called, not_called
  • Zygosity counts: hom_ref, het, hom_var
  • Variant type counts: non_ref, singleton, snp, insertion, deletion, transition, transversion, star
  • Rates: call_rate, ti_tv, het_hom_var, insertion_deletion
  • Record counts: records, multiallelic records

During ingestion, TileDB-VCF inspects every value of REF, ALT, GT, DP, and GQ to compute the partial values needed to compute the sample statistics. All values for one sample are processed by one compute node. However, each thread of an ingestion compute node will process a portion of a sample. Therefore, partial sums are still required and are stored in fragments in the sample_stats array. Similar to the other arrays discussed above, these partial sums are aggregated to generate the final sample statistic values.

Note

Appending samples to the variant statistic arrays is analogous to ingesting an additional batch of samples. The statistics are written to new fragments in the array, which are included in the aggregation of the final statistics.

Warning

The allele_count and variant_stats arrays use 0-based indexing to match the 0-based indexing in the data array.

Consolidation

Ingesting a large dataset will result in a large number of small fragments in the variant_stats and allele_count arrays. To optimize the read performance of these arrays, the small fragments are consolidated into larger fragments during distributed scalable ingestion on TileDB Cloud. First, the fragments are grouped by the value of their first dimension (CHROM), then each group of fragments is consolidated with a target size of 1 GiB per fragment. The consolidation of each group runs on a separate compute node to reduce latency of consolidation.

This consolidation strategy optimizes the read performance in the following ways:

  • Queries can ignore fragments with a different value of the first dimension, which will be a large percentage of fragments.
  • Values to be aggregated will have good spatial locality in the same fragment, which will reduce the time required to read the values to be aggregated.

Deletion

When a sample is deleted from the dataset, the variant statistics must be adjusted to reflect the sample deletion. TileDB-VCF handles sample deletion in an intelligent way by leveraging the aggregation mechanism already in place.

The process of deleting a sample includes adding negative counts to the variant_stats and allele_count arrays. When an aggregated statistic is calculated, these negative counts reduce the aggregated counts to the correct value, as if the sample was never ingested.

For the sample_stats array, deletion is more straightforward, deleting all stats for the sample being deleted.

Reads
Annotations