Annotations
Annotations refer to any sort of genomic metadata linked to variants. These, along with sample metadata, are essential elements of most genomic analyses. Annotations can be used for filtering, aggregation, and visualization of variant data.
Two primary types of annotation formats exist that you can use with TileDB-VCF:
- VCF-embedded annotations
- External table annotations
TileDB-VCF supports both types of annotations, although external annotations provide more query approaches.
Two primary schemas exist that TileDB-VCF uses for annotation joins:
- Allelic changes (
chr-pos-ref-alt
) - Genomic intervals (
chr-start-end
)
TileDB-VCF supports both of these join approaches, often with the use of logic contained within user-defined functions (UDFs) and distributed by task graphs provided by TileDB Cloud.
Embedded variant annotation
Variant annotation is often embedded with VCF files, as serialized INFO
fields (e.g., info_InbreedingCoeff
). TileDB-VCF will faithfully ingest these embedded variant annotations, which are accessible using the attrs
argument in tiledbvcf.Dataset.read
, and they can be used in downstream filters and joins such as those provided by pandas.
Embedded INFO
annotations are encoded in htslib data types and cannot be used with TileDB attribute filters. Embedded INFO
annotations should be selected as attributes and then filtered in pandas after the TileDB-VCF dataset is read.
Annotation in TileDB arrays
TileDB-VCF also supports the use of external annotation stored as TileDB arrays. This approach allows attributes to be queried using SQL or other APIs prior to or in parallel with TileDB-VCF queries.
VEP and SnpEFF annotation
TileDB Cloud supports in-house methods of annotating TileDB-VCF datasets with Ensembl Variant Effect Predictor (VEP) and SnpEFF.
In both cases, TileDB Cloud converts annotated VCFs into generic TileDB tables with chromosome
and position
as dimensions. The reference and alternate columns are preserved, so there should be at least one annotation for every variant in a TileDB dataset.
Because annotation is variant, and not sample-specific, one annotation database can be used for multiple TileDB-VCF datasets as long as the species, genome freeze, annotator version, and annotation options are uniform. This means only novel variants need be annotated when new datasets are introduced.
Annotating against primary transcript sets such as MANE or activating the VEP option --pick
or --pick-allele
will reduce the number of results for each variant in cases where multiple transcripts per gene are present. In all cases, each allele change and transcript receives a separate entry in the TileDB annotation table.
Genomic interval annotation
Interval-based reference annotation sources, including gene models such as Ensembl and metrics linked to genes or genomic intervals, can be stored in a TileDB array and used to annotate TileDB-VCF datasets. This is particularly useful for gene expression, chromatin accessibility, and other genomic features that are specific to genomic intervals. The annotations can be used to filter and join with TileDB-VCF datasets using range queries. For selecting variants in a gene of interest, it is advisable to first obtain the start and end positions and use those to filter the TileDB-VCF dataset.
The variant annotations tutorial provides a hands-on guide to joining variant annotation sources with TileDB-VCF datasets.