External Annotations

life sciences

genomics (vcf)

tutorials

annotations

remote access

Learn about joining variant annotation sources with TileDB-VCF datasets.

How to run this tutorial

You can run this tutorial only on TileDB Cloud. However, TileDB Cloud has a free tier. We strongly recommend that you sign up and run everything there, as that requires no installations or deployment.

This tutorial demonstrates how to query external annotation tables (stored as TileDB arrays) and join the returned information against TileDB-VCF datasets.

Import the necessary libraries, and set the URIs that will be used in this tutorial. If you are running this from a local notebook, visit the Tutorials: Basic TileDB Cloud for more information on how to set your TileDB Cloud credentials in a configuration object (this step can be omitted inside a TileDB Cloud notebook).

Python

import os

import tiledb
import tiledb.cloud
import tiledb.cloud.vcf
import tiledb.cloud.vcf.vcf_toolbox as vtb
import tiledbvcf

# Get your credentials
tiledb_token = os.environ["TILEDB_REST_TOKEN"]
# or use your username and password (not recommended)
# tiledb_username = os.environ["TILEDB_USERNAME"]
# tiledb_password = os.environ["TILEDB_PASSWORD"]


# Public URI datasets to be used in this tutorial
vep_uri = "tiledb://tiledb-genomics-dev/vep_20230726_6"
vcf_uri = "tiledb://TileDB-Inc/vcf-1kg-dragen-v376"

# Log into TileDB Cloud
tiledb.cloud.login(token=tiledb_token)
# or use your username and password (not recommended)
# tiledb.cloud.login(username=tiledb_username, password=tiledb_password)

First, create a function that searches for terms in the Consequence field of a VEP annotation table:

Python

def query_vep_by_consequence(
    vep_uri: str = None, consequence_list: list = None, genomic_coordinates: list = None
):
    """
    This function queries the VEP Variant Annotation
    Arguments:
        @param consequence_list: the list of VEP consequences
        @param genomic_coordinates: The list of gene coordinates
    Returns:
        A list of variants and consequences
    Todo:
        Accept gene lists rather than genomic coordinates
    """
    import re

    import pandas
    import pyarrow
    import tiledb

    with tiledb.open(vep_uri, ctx=tiledb.cloud.Ctx()) as vep_array_obj:
        resdfs = []
        for genomic_coodinate in genomic_coordinates:
            regexgroups = re.match(
                "(chr[0-9XYMT]+):([0-9]+)-([0-9]+)", genomic_coodinate.replace(" ", "")
            )
            regexchr = regexgroups.group(1)
            regexstart = int(regexgroups.group(2))
            regexend = int(regexgroups.group(3))  # lose the dash
            resdfs += [
                vep_array_obj.query(
                    cond=f"Consequence in {consequence_list}",
                    dims=[
                        "contig",
                        "pos_start",
                    ],
                    attrs=[
                        "ref",
                        "alt",
                        "Gene",
                        "Feature",
                        "Feature_type",
                        "Consequence",
                        "cDNA_position",
                        "CDS_position",
                        "Protein_position",
                        "Amino_acids",
                        "Codons",
                    ],
                ).df[regexchr, regexstart:regexend]
            ]
        results_df = pandas.concat(resdfs, ignore_index=True)
    results = pyarrow.Table.from_pandas(results_df, preserve_index=False)
    return results

Use this function to search for frameshifts in the TTN gene, in 1000 Genomes samples as was done in this manuscript.

Python

consequence_results = query_vep_by_consequence(
    vep_uri=vep_uri,
    consequence_list=["frameshift_variant"],
    genomic_coordinates=["chr2:178525989-178830802"],
)
consequence_results.to_pandas()

	contig	pos_start	ref	alt	Gene	Feature	Feature_type	Consequence	cDNA_position	CDS_position	Protein_position	Amino_acids	Codons
0	chr2	178527497	TA	T	ENSG00000155657	ENST00000589042	Transcript	frameshift_variant	107853	107628	35876	N/X	aaT/aa
1	chr2	178530599	TC	T	ENSG00000155657	ENST00000589042	Transcript	frameshift_variant	106240	106015	35339	D/X	Gat/at
2	chr2	178531200	T	TC	ENSG00000155657	ENST00000589042	Transcript	frameshift_variant	105639-105640	105414-105415	35138-35139	-/X	-/G
3	chr2	178531203	TG	T	ENSG00000155657	ENST00000589042	Transcript	frameshift_variant	105636	105411	35137	S/X	tcC/tc
4	chr2	178531285	T	TG	ENSG00000155657	ENST00000589042	Transcript	frameshift_variant	105554-105555	105329-105330	35110	E/DX	gaa/gaCa
...	...	...	...	...	...	...	...	...	...	...	...	...	...
124	chr2	178774924	C	CGTTGTTG	ENSG00000155657	ENST00000589042	Transcript	frameshift_variant	7011-7012	6786-6787	2262-2263	-/QQX	-/CAACAAC
125	chr2	178774927	CAA	C	ENSG00000155657	ENST00000589042	Transcript	frameshift_variant	7007-7008	6782-6783	2261	I/X	aTT/a
126	chr2	178779363	CT	C	ENSG00000155657	ENST00000589042	Transcript	frameshift_variant	4053	3828	1276	E/X	gaA/ga
127	chr2	178785866	TC	T	ENSG00000155657	ENST00000589042	Transcript	frameshift_variant	2576	2351	784	G/X	gGa/ga
128	chr2	178793484	C	CT	ENSG00000155657	ENST00000589042	Transcript	frameshift_variant	1680-1681	1455-1456	485-486	-/X	-/A

129 rows × 13 columns

This function can generate a list of search loci (chr:start-end) that works with a tiledbvcf.read query

Use query_vep_by_consequence to search for coding mutations in the region chr16:30915000-30975000 and generate loci of interest.

Python

vep_res_arrow = query_vep_by_consequence(
    vep_uri=vep_uri,
    consequence_list=["missense_variant", "nonsense_variant", "frameshift_variant"],
    genomic_coordinates=["chr16:30915000-30975000"],
)
vep_res = vep_res_arrow.to_pandas()
query_loci = vep_res.apply(
    lambda x: f"{x['contig']}:{x['pos_start']}-{x['pos_start']}", axis=1
).tolist()
len(query_loci)

Query the DRAGEN 1000 Genomes dataset on those loci:

Python

ds = tiledbvcf.Dataset(vcf_uri, tiledb_config=tiledb.cloud.Config())
attrs = ds.attributes()
vcf_df = ds.read(regions=query_loci, samples=None, attrs=attrs)
vcf_df

	alleles	contig	filters	fmt	fmt_AD	fmt_AF	fmt_DP	fmt_F1R2	fmt_F2R1	fmt_GP	...	info_R2_5P_bias	info_ReadPosRankSum	info_SOR	pos_end	pos_start	qual	query_bed_end	query_bed_line	query_bed_start	sample_name
0	[T, TC]	chr16	[PASS]	[0, 0, 0, 0]	[18, 13]	[0.419]	31	[11, 7]	[7, 6]	[48.238, 8.6978e-05, 53.0]	...	[-0.172]	[-1.541]	[2.147]	30953571	30953571	48.240002	30953571	-1	30953570	NA18868
1	[G, T]	chr16	[PASS]	[0, 0, 0, 0]	[21, 18]	[0.462]	39	[4, 9]	[17, 9]	[49.354, 7.2222e-05, 53.0]	...	[20.499]	[3.761]	[0.646]	30958751	30958751	49.349998	30958751	-1	30958750	NA18871
2	[G, A]	chr16	[PASS]	[0, 0, 0, 0]	[10, 20]	[0.667]	30	[8, 7]	[2, 13]	[50.0, 0.00016257, 45.622]	...	[-3.06]	[2.354]	[0.765]	30964724	30964724	50.000000	30964724	-1	30964723	NA18619
3	[A, T]	chr16	[PASS]	[0, 0, 0, 0]	[26, 21]	[0.447]	47	[10, 11]	[16, 10]	[48.754, 7.9471e-05, 53.0]	...	[1.227]	[3.691]	[0.905]	30964989	30964989	48.750000	30964989	-1	30964988	NA18626
4	[C, G]	chr16	[PASS]	[0, 0, 0, 0]	[19, 10]	[0.345]	29	[5, 5]	[14, 5]	[44.034, 0.00019337, 53.0]	...	[-6.998]	[3.556]	[0.555]	30965664	30965664	44.029999	30965664	-1	30965663	NA18570
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
263	[T, A]	chr16	[PASS]	[0, 0, 0, 0]	[21, 25]	[0.543]	46	[12, 18]	[9, 7]	[50.0, 6.9116e-05, 52.272]	...	[-4.939]	[4.014]	[0.523]	30964808	30964808	50.000000	30964808	-1	30964807	HG03022
264	[C, G]	chr16	[PASS]	[0, 0, 0, 0]	[17, 24]	[0.585]	41	[5, 11]	[12, 13]	[50.0, 8.2059e-05, 50.516]	...	[8.578]	[3.533]	[0.637]	30965042	30965042	50.000000	30965042	-1	30965041	HG03096
265	[G, A]	chr16	[PASS]	[0, 0, 0, 0]	[24, 20]	[0.455]	44	[8, 5]	[16, 15]	[49.279, 7.2999e-05, 53.0]	...	[-6.616]	[1.638]	[0.43]	30966144	30966144	49.279999	30966144	-1	30966143	HG03168
266	[G, A]	chr16	[PASS]	[0, 0, 0, 0]	[20, 28]	[0.583]	48	[11, 11]	[9, 17]	[50.0, 8.1542e-05, 50.576]	...	[-22.176]	[2.75]	[1.565]	30966144	30966144	50.000000	30966144	-1	30966143	HG03170
267	[G, T]	chr16	[PASS]	[0, 0, 0, 0]	[18, 21]	[0.538]	39	[12, 14]	[6, 7]	[50.0, 6.8339e-05, 52.426]	...	[20.639]	[3.169]	[1.432]	30971395	30971395	50.000000	30971395	-1	30971394	HG03063

268 rows × 42 columns

Split the resulting alleles column to generate ref and alt columns for a join, and perform a join to merge the VCF and VEP data frames. The resulting data frame is limited to coding variants found in the TileDB-VCF dataset for this region.

Python

vcf_df["ref"] = vcf_df["alleles"].str[0]
vcf_df["alt"] = vcf_df["alleles"].apply(lambda x: ",".join(x[1:]))
vcf_df = vcf_df.drop("alleles", axis=1)
vcf_df.merge(vep_res, on=["contig", "pos_start", "ref", "alt"])

	contig	filters	fmt	fmt_AD	fmt_AF	fmt_DP	fmt_F1R2	fmt_F2R1	fmt_GP	fmt_GQ	...	alt	Gene	Feature	Feature_type	Consequence	cDNA_position	CDS_position	Protein_position	Amino_acids	Codons
0	chr16	[PASS]	[0, 0, 0, 0]	[18, 13]	[0.419]	31	[11, 7]	[7, 6]	[48.238, 8.6978e-05, 53.0]	47	...	TC	ENSG00000175938	ENST00000318663	Transcript	frameshift_variant	837-838	615-616	205-206	-/X	-/C
1	chr16	[PASS]	[0, 0, 0, 0]	[19, 17]	[0.472]	36	[10, 11]	[9, 6]	[49.758, 6.7563e-05, 53.0]	48	...	TC	ENSG00000175938	ENST00000318663	Transcript	frameshift_variant	837-838	615-616	205-206	-/X	-/C
2	chr16	[PASS]	[0, 0, 0, 0]	[28, 30]	[0.517]	58	[13, 10]	[15, 20]	[50.0, 6.5751e-05, 52.896]	48	...	TC	ENSG00000175938	ENST00000318663	Transcript	frameshift_variant	837-838	615-616	205-206	-/X	-/C
3	chr16	[PASS]	[0, 0, 0, 0]	[20, 25]	[0.556]	45	[7, 7]	[13, 18]	[50.0, 7.1187e-05, 51.927]	48	...	TC	ENSG00000175938	ENST00000318663	Transcript	frameshift_variant	837-838	615-616	205-206	-/X	-/C
4	chr16	[PASS]	[0, 0, 0, 0]	[23, 28]	[0.549]	51	[12, 14]	[11, 14]	[50.0, 6.9893e-05, 52.166]	48	...	TC	ENSG00000175938	ENST00000318663	Transcript	frameshift_variant	837-838	615-616	205-206	-/X	-/C
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
260	chr16	[PASS]	[0, 0, 0, 0]	[15, 14]	[0.483]	29	[8, 7]	[7, 7]	[49.925, 6.601e-05, 53.0]	48	...	A	ENSG00000099381	ENST00000262519	Transcript	missense_variant	2571	2345	782	G/D	gGc/gAc
261	chr16	[PASS]	[0, 0, 0, 0]	[31, 13]	[0.295]	44	[13, 10]	[18, 3]	[35.039, 0.001383, 53.001]	35	...	C	ENSG00000099364	ENST00000338343	Transcript	frameshift_variant	2297	1385	462	R/X	cGc/cc
262	chr16	[PASS]	[0, 0, 0, 0]	[15, 20]	[0.571]	35	[7, 11]	[8, 9]	[50.0, 7.4552e-05, 51.45]	48	...	T	ENSG00000175938	ENST00000318663	Transcript	missense_variant	934	712	238	H/Y	Cat/Tat
263	chr16	[PASS]	[0, 0, 0, 0]	[13, 12]	[0.48]	25	[9, 7]	[4, 5]	[49.913, 6.601e-05, 53.0]	48	...	T	ENSG00000175938	ENST00000318663	Transcript	missense_variant	934	712	238	H/Y	Cat/Tat
264	chr16	[PASS]	[0, 0, 0, 0]	[13, 22]	[0.629]	35	[9, 13]	[4, 9]	[50.0, 0.00011364, 47.909]	46	...	T	ENSG00000099381	ENST00000262519	Transcript	missense_variant	1047	821	274	T/M	aCg/aTg

265 rows × 52 columns

TileDB Cloud offers a convenience utility in the tiledb.cloud.vcf.vcftoolbox package called annotate that allows you to perform joins like the ones described above, more easily and in a distributed fashion (similar to Tutorials: Scalable Queries). This approach works well with the transform function included in tiledb.cloud.vcf (visit the Tutorials: Query Transforms for more information on query transforms).

Here is the documentation of the annotate functions:

Python

from tiledb.cloud.vcf.vcf_toolbox.annotate import _annotate

help(_annotate)

Help on function _annotate in module tiledb.cloud.vcf.vcf_toolbox.annotate:

_annotate(vcf_df: pandas.core.frame.DataFrame, *, ann_uri: str, ann_regions: Union[str, Sequence[str]], ann_attrs: Union[Sequence[str], str, NoneType] = None, vcf_filter: Optional[str] = None, split_multiallelic: bool = True, add_zygosity: bool = False, reorder: Optional[Sequence[str]] = None, rename: Optional[Mapping[str, str]] = None, verbose: bool = False) -> pandas.core.frame.DataFrame
    Annotate a VCF DataFrame with annotations from a TileDB array.
    
    :param vcf_df: VCF DataFrame to annotate
    :param ann_uri: URI of the annotation array
    :param ann_regions: regions to annotate. All regions must be in the same
        chromosome/contig.
    :param ann_attrs: annotation attributes to read,
        defaults to None which queries all attributes.
    :param vcf_filter: a pandas filter to apply to the VCF DataFrame before annotation,
        defaults to None
    :param split_multiallelic: split multiallelic variants into separate rows,
        defaults to True
    :param add_zygosity: add zygosity column to the DataFrame, defaults to False
    :param reorder: list of columns to reorder (before renaming), defaults to None
    :param rename: dict of columns to rename, defaults to None
    :param verbose: enable verbose logging, defaults to False
    :return: annotated VCF DataFrame

Configure and run the VCF query with VEP annotations on a small region of NA12878.

Python

regions = "chr21:26973732-27213386"

# Run the VCF query with annotation
df = tiledb.cloud.vcf.read(
    dataset_uri=vcf_uri,
    regions=regions,
    samples="NA12878",
    transform_result=vtb.annotate(
        ann_uri=vep_uri,
        ann_regions=regions,
    ),
).to_pandas()
df

	sample_name	contig	pos_start	fmt_GT	ref	alt	Gene	Feature	Feature_type	Consequence	...	gnomADg_ASJ_AF	gnomADg_EAS_AF	gnomADg_FIN_AF	gnomADg_MID_AF	gnomADg_NFE_AF	gnomADg_OTH_AF	gnomADg_SAS_AF	CLIN_SIG	SOMATIC	PHENO
0	NA12878	chr21	26973860	[1, 1]	G	A	None	None	None	intergenic_variant	...	0.1866	0.008484	0.2310	0.1804	0.2395	0.1740	0.09950	None	None	None
1	NA12878	chr21	26974227	[1, 1]	G	C	None	None	None	intergenic_variant	...	0.6084	0.249900	0.5712	0.6487	0.5876	0.5941	0.38260	None	None	1
2	NA12878	chr21	26974527	[1, 1]	G	A	None	None	None	intergenic_variant	...	0.1869	0.008308	0.2312	0.1804	0.2397	0.1745	0.09921	None	None	None
3	NA12878	chr21	26976088	[1, 1]	CTATATA	C	None	None	None	intergenic_variant	...	0.2852	0.059410	0.3608	0.3654	0.3537	0.3103	0.16260	None	None	None
4	NA12878	chr21	26977141	[1, 1]	C	A	None	None	None	intergenic_variant	...	0.1872	0.008320	0.2307	0.1772	0.2401	0.1761	0.09959	None	None	None
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
361	NA12878	chr21	27207400	[1, 1]	C	T	None	None	None	intergenic_variant	...	0.8284	0.829600	0.8893	0.7922	0.8123	0.8211	0.86620	None	None	None
362	NA12878	chr21	27208103	[1, 1]	T	C	None	None	None	intergenic_variant	...	0.8468	0.829200	0.9266	0.8323	0.8473	0.8622	0.87770	None	None	None
363	NA12878	chr21	27209337	[0, 1]	T	G	None	None	None	intergenic_variant	...	0.3681	0.493500	0.4325	0.3013	0.3781	0.3706	0.29170	None	None	None
364	NA12878	chr21	27210578	[1, 1]	G	A	None	None	None	intergenic_variant	...	0.7875	0.805600	0.8903	0.7468	0.8127	0.7874	0.79300	None	None	None
365	NA12878	chr21	27211633	[1, 1]	G	C	None	None	None	intergenic_variant	...	0.8847	0.406600	0.8037	0.8481	0.8779	0.8270	0.80540	None	None	None

366 rows × 47 columns