Learn about joining variant annotation sources with TileDB-VCF datasets.
How to run this tutorial
You can run this tutorial only on TileDB Cloud. However, TileDB Cloud has a free tier. We strongly recommend that you sign up and run everything there, as that requires no installations or deployment.
This tutorial demonstrates how to query external annotation tables (stored as TileDB arrays) and join the returned information against TileDB-VCF datasets.
Import the necessary libraries, and set the URIs that will be used in this tutorial. If you are running this from a local notebook, visit the Tutorials: Basic TileDB Cloud for more information on how to set your TileDB Cloud credentials in a configuration object (this step can be omitted inside a TileDB Cloud notebook).
import osimport tiledbimport tiledb.cloudimport tiledb.cloud.vcfimport tiledb.cloud.vcf.vcf_toolbox as vtbimport tiledbvcf# Get your credentialstiledb_token = os.environ["TILEDB_REST_TOKEN"]# or use your username and password (not recommended)# tiledb_username = os.environ["TILEDB_USERNAME"]# tiledb_password = os.environ["TILEDB_PASSWORD"]# Public URI datasets to be used in this tutorialvep_uri ="tiledb://tiledb-genomics-dev/vep_20230726_6"vcf_uri ="tiledb://TileDB-Inc/vcf-1kg-dragen-v376"# Log into TileDB Cloudtiledb.cloud.login(token=tiledb_token)# or use your username and password (not recommended)# tiledb.cloud.login(username=tiledb_username, password=tiledb_password)
First, create a function that searches for terms in the Consequence field of a VEP annotation table:
Split the resulting alleles column to generate ref and alt columns for a join, and perform a join to merge the VCF and VEP data frames. The resulting data frame is limited to coding variants found in the TileDB-VCF dataset for this region.
TileDB Cloud offers a convenience utility in the tiledb.cloud.vcf.vcftoolbox package called annotate that allows you to perform joins like the ones described above, more easily and in a distributed fashion (similar to Tutorials: Scalable Queries). This approach works well with the transform function included in tiledb.cloud.vcf (visit the Tutorials: Query Transforms for more information on query transforms).
Here is the documentation of the annotate functions:
from tiledb.cloud.vcf.vcf_toolbox.annotate import _annotatehelp(_annotate)
Help on function _annotate in module tiledb.cloud.vcf.vcf_toolbox.annotate:
_annotate(vcf_df: pandas.core.frame.DataFrame, *, ann_uri: str, ann_regions: Union[str, Sequence[str]], ann_attrs: Union[Sequence[str], str, NoneType] = None, vcf_filter: Optional[str] = None, split_multiallelic: bool = True, add_zygosity: bool = False, reorder: Optional[Sequence[str]] = None, rename: Optional[Mapping[str, str]] = None, verbose: bool = False) -> pandas.core.frame.DataFrame
Annotate a VCF DataFrame with annotations from a TileDB array.
:param vcf_df: VCF DataFrame to annotate
:param ann_uri: URI of the annotation array
:param ann_regions: regions to annotate. All regions must be in the same
chromosome/contig.
:param ann_attrs: annotation attributes to read,
defaults to None which queries all attributes.
:param vcf_filter: a pandas filter to apply to the VCF DataFrame before annotation,
defaults to None
:param split_multiallelic: split multiallelic variants into separate rows,
defaults to True
:param add_zygosity: add zygosity column to the DataFrame, defaults to False
:param reorder: list of columns to reorder (before renaming), defaults to None
:param rename: dict of columns to rename, defaults to None
:param verbose: enable verbose logging, defaults to False
:return: annotated VCF DataFrame
Configure and run the VCF query with VEP annotations on a small region of NA12878.
regions ="chr21:26973732-27213386"# Run the VCF query with annotationdf = tiledb.cloud.vcf.read( dataset_uri=vcf_uri, regions=regions, samples="NA12878", transform_result=vtb.annotate( ann_uri=vep_uri, ann_regions=regions, ),).to_pandas()df