Annotation VCFs

life sciences

genomics (vcf)

tutorials

annotation

Learn about querying sampleless variant-only TileDB-VCF datasets.

How to run this tutorial

You can run this tutorial only on TileDB Cloud. However, TileDB Cloud has a free tier. We strongly recommend that you sign up and run everything there, as that requires no installations or deployment.

VCF can serve as a delivery medium for variant annotations, even if no samples are present. TileDB-VCF can ingest these sampleless, variant-only VCFs. This tutorial shows how to extract this annotation information from VCF files and ingest it into separate TileDB arrays on TileDB Cloud, which can then be combined to generate annotations for other VCF datasets.

Import the necessary libraries and set the URIs that will be used in this tutorial. If you are running this from a local notebook, visit the Tutorials: Basic TileDB Cloud for more information on how to set your TileDB Cloud credentials in a configuration object (this step can be omitted inside a TileDB Cloud notebook).

Python

import os

import numpy as np
import tiledb
import tiledb.cloud
import tiledb.cloud.vcf
import tiledb.cloud.vcf.vcf_toolbox as vtb
import tiledbvcf

# Get your credentials
tiledb_token = os.environ["TILEDB_REST_TOKEN"]
# or use your username and password (not recommended)
# tiledb_username = os.environ["TILEDB_USERNAME"]
# tiledb_password = os.environ["TILEDB_PASSWORD"]


# Public URI datasets to be used in this tutorial
gnomad_uri = "tiledb://TileDB-Inc/gnomad-4_0-include-nopass"
vcf_uri = "tiledb://TileDB-Inc/vcf-1kg-dragen-v376"

# Log into TileDB Cloud
tiledb.cloud.login(token=tiledb_token)
# or use your username and password (not recommended)
# tiledb.cloud.login(username=tiledb_username, password=tiledb_password)

You will use gnomAD, which is distributed in this manner, and can be used to annotate variant datasets for measures of allele frequency.

Start by specifying the gnomAD dataset and region of interest. Then, query the gnomAD dataset, and select the allele frequency INFO column.

Python

regions = ["chr19:44905804-44909392"]
gnomad_arrow = tiledb.cloud.vcf.read(
    dataset_uri=gnomad_uri,
    regions=regions,
    samples="",
    attrs=["contig", "pos_start", "alleles", "info_AF"],
)
gnomad_df = gnomad_arrow.to_pandas()
gnomad_df.head()

	contig	pos_start	alleles	info_AF
0	chr19	44905804	[C, T]	[6.56866e-06]
1	chr19	44905805	[C, T]	[6.57531e-06]
2	chr19	44905808	[A, T]	[0.0]
3	chr19	44905810	[C, T]	[6.57609e-06]
4	chr19	44905813	[A, G]	[6.58077e-06]

Clean the gnomAD result DataFrame. Split the alleles into ref and alt columns, and cast the columns to the correct data types.

Python

gnomad_df = (
    gnomad_df.assign(
        ref=lambda df: df["alleles"].apply(lambda alleles: alleles[0]),
        alt=lambda df: df["alleles"].apply(lambda alleles: alleles[1:]),
    )
    .dropna(subset=["alt"])
    .explode(["alt", "info_AF"])
    .assign(
        ref=lambda df: df["ref"].astype(str),
        alt=lambda df: df["alt"].apply(lambda x: str(x)),
        info_AF=lambda df: df["info_AF"].astype(np.float32),
    )
    .drop(columns=["alleles"])
    .set_index(["contig", "pos_start"])
)
gnomad_df

		info_AF	ref	alt
contig	pos_start
chr19	44905804	0.000007	C	T
	44905805	0.000007	C	T
	44905808	0.000000	A	T
	44905810	0.000007	C	T
	44905813	0.000007	A	G
	...	...	...	...
	44909370	0.000007	TAAAGATTCACC	T
	44909377	0.000007	T	A
	44909384	0.000007	G	A
	44909391	0.000007	G	A
	44909392	0.000007	C	T

907 rows × 3 columns

Create a TileDB array for the gnomAD results.

Python

# Create array URI
user_profile = tiledb.cloud.user_profile()
username = user_profile.username
array_uri = f"tiledb://{user_profile.username}/{user_profile.default_s3_path.rstrip('/')}/gnomad_apoe"

# Ingest dataframe into array. This will also register the array to TileDB Cloud
tiledb.from_pandas(
    dataframe=gnomad_df,
    uri=array_uri,
    column_types={
        "contig": "str",
        "pos_start": "int32",
        "ref": "str",
        "alt": "str",
        "info_AF": np.float32,
    },
)

Now that you created the array, you can inspect it.

Python

with tiledb.open(array_uri, ctx=tiledb.cloud.Ctx()) as A:
    df = A.df[:]
df

		info_AF	ref	alt
contig	pos_start
chr19	44905804	0.000007	C	T
	44905805	0.000007	C	T
	44905808	0.000000	A	T
	44905810	0.000007	C	T
	44905813	0.000007	A	G
	...	...	...	...
	44909370	0.000007	TAAAGATTCACC	T
	44909377	0.000007	T	A
	44909384	0.000007	G	A
	44909391	0.000007	G	A
	44909392	0.000007	C	T

907 rows × 3 columns

Perform a TileDB Cloud distributed query and annotate using the newly created gnomAD array. Note the variants have now been annotated with their global allele frequency.

Python

# Get the first 100 samples
vcf_df = tiledbvcf.Dataset(vcf_uri, tiledb_config=tiledb.cloud.Config())
samples = vcf_df.samples()[:100]

# Perform the query
df = tiledb.cloud.vcf.read(
    dataset_uri=vcf_uri,
    regions=regions,
    samples=samples,
    transform_result=vtb.annotate(ann_uri=array_uri, ann_regions=regions),
).to_pandas()
df

	sample_name	contig	pos_start	fmt_GT	ref	alt	info_AF
0	HG00096	chr19	44905910	[1, 1]	C	G	0.688353
1	HG00097	chr19	44905910	[1, 1]	C	G	0.688353
2	HG00101	chr19	44905910	[1, 1]	C	G	0.688353
3	HG00102	chr19	44905910	[0, 1]	C	G	0.688353
4	HG00103	chr19	44905910	[1, 1]	C	G	0.688353
...	...	...	...	...	...	...	...
43	HG00262	chr19	44908684	[0, 1]	T	C	0.157359
44	HG00239	chr19	44908822	[0, 1]	C	T	0.077837
45	HG00242	chr19	44908822	[0, 1]	C	T	0.077837
46	HG00254	chr19	44908822	[0, 1]	C	T	0.077837
47	HG00251	chr19	44908947	[0, 1]	C	T	0.000888

236 rows × 7 columns