Annotations

life sciences

genomics (vcf)

foundation

annotations

How to produce and manage variant or interval annotations with TileDB-VCF.

Annotations refer to any sort of genomic metadata linked to variants. These, along with sample metadata, are essential elements of most genomic analyses. Annotations can be used for filtering, aggregation, and visualization of variant data.

Two primary types of annotation formats exist that you can use with TileDB-VCF:

VCF-embedded annotations
External table annotations

TileDB-VCF supports both types of annotations, although external annotations provide more query approaches.

Two primary schemas exist that TileDB-VCF uses for annotation joins:

Allelic changes (chr-pos-ref-alt)
Genomic intervals (chr-start-end)

TileDB-VCF supports both of these join approaches, often with the use of logic contained within user-defined functions (UDFs) and distributed by task graphs provided by TileDB Cloud.

Embedded variant annotation

Variant annotation is often embedded with VCF files, as serialized INFO fields (e.g., info_InbreedingCoeff). TileDB-VCF will faithfully ingest these embedded variant annotations, which are accessible using the attrs argument in tiledbvcf.Dataset.read, and they can be used in downstream filters and joins such as those provided by pandas.

Caution

Embedded INFO annotations are encoded in htslib data types and cannot be used with TileDB attribute filters. Embedded INFO annotations should be selected as attributes and then filtered in pandas after the TileDB-VCF dataset is read.

Annotation in TileDB arrays

TileDB-VCF also supports the use of external annotation stored as TileDB arrays. This approach allows attributes to be queried using SQL or other APIs prior to or in parallel with TileDB-VCF queries.

VEP and SnpEFF annotation

TileDB Cloud supports in-house methods of annotating TileDB-VCF datasets with Ensembl Variant Effect Predictor (VEP) and SnpEFF.

In both cases, TileDB Cloud converts annotated VCFs into generic TileDB tables with chromosome and position as dimensions. The reference and alternate columns are preserved, so there should be at least one annotation for every variant in a TileDB dataset.

Because annotation is variant, and not sample-specific, one annotation database can be used for multiple TileDB-VCF datasets as long as the species, genome freeze, annotator version, and annotation options are uniform. This means only novel variants need be annotated when new datasets are introduced.

Annotating against primary transcript sets such as MANE or activating the VEP option --pick or --pick-allele will reduce the number of results for each variant in cases where multiple transcripts per gene are present. In all cases, each allele change and transcript receives a separate entry in the TileDB annotation table.

Genomic interval annotation

Interval-based reference annotation sources, including gene models such as Ensembl and metrics linked to genes or genomic intervals, can be stored in a TileDB array and used to annotate TileDB-VCF datasets. This is particularly useful for gene expression, chromatin accessibility, and other genomic features that are specific to genomic intervals. The annotations can be used to filter and join with TileDB-VCF datasets using range queries. For selecting variants in a gene of interest, it is advisable to first obtain the start and end positions and use those to filter the TileDB-VCF dataset.

What is next?

The variant annotations tutorial provides a hands-on guide to joining variant annotation sources with TileDB-VCF datasets.