Variant Statistics Tutorial

life sciences

genomics (vcf)

tutorials

statistics

Learn about using allele frequency and sample quality control metrics in TileDB-VCF.

How to run this tutorial

You can run this tutorial only on TileDB Cloud. However, TileDB Cloud has a free tier. We strongly recommend that you sign up and run everything there, as that requires no installations or deployment.

TileDB-VCF allows you to query variant statistics, which are either generated and stored inside the TileDB-VCF along with the variant data as the separate allele_count and variant_stats auxiliary tables, or computed and returned at query time. For more information on variant statistics, visit the Key Concepts: Variants Statistics section.

In this tutorial, you will use the public tiledb://TileDB-Inc/vcf-1kg-dragen-v376 dataset, which you can locate on the TileDB Cloud Marketplace.

Start by setting up your TileDB Cloud credentials in a config object. Note that you can skip this step if you are running the tutorial inside a TileDB Cloud notebook.

Python

import os

import tiledb

# You should set the appropriate environment variables with your keys.
# Get the keys from the environment variables.

tiledb_token = os.environ["TILEDB_REST_TOKEN"]
# or use your username and password (not recommended)
# tiledb_username = os.environ["TILEDB_USERNAME"]
# tiledb_password = os.environ["TILEDB_PASSWORD"]

# Set the AWS keys and region to the config of the default context
# This context initialization can be performed only once.
cfg = tiledb.Config(
    {
        "rest.token": tiledb_token,
        # or use your username and password (not recommended)
        # "rest.username": tiledb_username,
        # "rest.password": tiledb_password,
    }
)
ctx = tiledb.Ctx(cfg)

Next, import the necessary libraries, and set the VCF dataset URI.

Python

import tiledb.cloud
import tiledbvcf

# Print library versions
print("TileDB core version: {}".format(tiledb.libtiledb.version()))
print("TileDB-Py version: {}".format(tiledb.version()))
print("TileDB-VCF version: {}".format(tiledbvcf.version))
print("TileDB-Cloud-Py version: {}".format(tiledb.cloud.version.version))

# Set the VCF dataset URI
vcf_uri = "tiledb://TileDB-Inc/vcf-1kg-dragen-v376"

TileDB core version: (2, 24, 2)
TileDB-Py version: (0, 30, 2)
TileDB-VCF version: 0.33.3
TileDB-Cloud-Py version: 0.12.18

Show the various TileDB objects stored inside the TileDB-VCF dataset, which the rest of the tutorial will be accessing.

Python

# Show the groups and arrays inside the TileDB-VCF dataset
ds = tiledbvcf.Dataset(vcf_uri, mode="r", tiledb_config=cfg)
ds_grp = tiledb.Group(ds.uri, "r", ctx=ctx)
for i in range(len(ds_grp)):
    print(f"URI: {ds_grp[i].uri}, Type: {ds_grp[i].type}, Name: {ds_grp[i].name}")

URI: tiledb://TileDB-Inc/03208842-61d1-4aa2-9ee3-4e5b598090a2, Type: <class 'tiledb.libtiledb.Array'>, Name: vcf_headers
URI: tiledb://TileDB-Inc/b9b2ecaf-123b-4907-96ec-9b3c496279d1, Type: <class 'tiledb.libtiledb.Array'>, Name: data
URI: tiledb://TileDB-Inc/a951e969-59a3-4651-990f-76ca4a132709, Type: <class 'tiledb.libtiledb.Array'>, Name: allele_count
URI: tiledb://TileDB-Inc/6e6f9723-16f4-42eb-9ead-5d2bc6fba7cb, Type: <class 'tiledb.libtiledb.Array'>, Name: variant_stats
URI: tiledb://TileDB-Inc/e779f911-2d93-4ae9-8053-43eb63eccc94, Type: <class 'tiledb.libtiledb.Array'>, Name: phenotypes
URI: tiledb://TileDB-Inc/64083f91-92d2-4b24-9e73-7d7fd11cc7bf, Type: <class 'tiledb.libtiledb.Array'>, Name: hpoterms
URI: tiledb://TileDB-Inc/5ed2b89f-b454-4b0d-b123-0ed76cfda418, Type: <class 'tiledb.libtiledb.Array'>, Name: log
URI: tiledb://TileDB-Inc/a7027688-c2d9-489b-8f04-d07f31609755, Type: <class 'tiledb.libtiledb.Array'>, Name: manifest

Open the VCF dataset in read mode, to prepare it for reading.

Python

# Open the dataset in read mode
ds = tiledbvcf.Dataset(vcf_uri, mode="r", tiledb_config=cfg)

Perform a read using a genomic region (the BTD gene) and setting the attributes to extract. To retrieve the internal allele frequency calculation for any variant in the result set, you need to specify info_TILEDB_IAF in the attributes list to retrieve. Note that the values for this attribute are calculated at query time.

Python

# Set info_TILEDB_IAF to the attributes argument
attrs = [
    "sample_name",
    "contig",
    "pos_start",
    "pos_end",
    "alleles",
    "fmt_GT",
    "info_TILEDB_IAF",
]

# Read from the dataset
df = ds.read(
    attrs=attrs,
    regions=["chr3:15601341-15722311"],
)
df

	sample_name	contig	pos_start	pos_end	alleles	fmt_GT	info_TILEDB_IAF
0	NA21143	chr3	15601536	15601536	[A, G]	[1, 1]	[0.027682202, 0.9723178]
1	NA21144	chr3	15601536	15601536	[A, G]	[1, 1]	[0.027682202, 0.9723178]
2	NA21144	chr3	15601668	15601668	[G, A]	[0, 1]	[0.43612567, 0.56387436]
3	NA21143	chr3	15601866	15601866	[A, G]	[0, 1]	[0.5, 0.5]
4	NA21144	chr3	15602568	15602568	[A, G]	[0, 1]	[0.42662117, 0.57337886]
...	...	...	...	...	...	...	...
606391	HG03021	chr3	15722066	15722066	[A, G]	[0, 1]	[0.4359155, 0.56408453]
606392	HG03034	chr3	15722277	15722277	[T, C]	[0, 1]	[0.5, 0.5]
606393	HG03035	chr3	15722277	15722277	[T, C]	[0, 1]	[0.5, 0.5]
606394	HG03091	chr3	15722277	15722277	[T, C]	[0, 1]	[0.5, 0.5]
606395	HG03091	chr3	15722283	15722283	[C, T]	[0, 1]	[0.5, 0.5]

606396 rows × 7 columns

Filtering on allele frequency can be done by setting a threshold for the internal allele frequency to the query.

Python

# Query setting an AF filter
df = ds.read(
    attrs=attrs,
    regions=["chr3:15601341-15722311"],
    set_af_filter="<0.5",
)
df

	sample_name	contig	pos_start	pos_end	alleles	fmt_GT	info_TILEDB_IAF
0	NA21144	chr3	15601668	15601668	[G, A]	[0, 1]	[0.43612567, 0.56387436]
1	NA21144	chr3	15602568	15602568	[A, G]	[0, 1]	[0.42662117, 0.57337886]
2	NA21144	chr3	15602688	15602688	[G, A]	[0, 1]	[0.47108433, 0.52891564]
3	NA21143	chr3	15603161	15603161	[A, G]	[0, 1]	[0.44444445, 0.5555556]
4	NA21143	chr3	15603733	15603733	[C, T]	[0, 1]	[0.4846698, 0.5153302]
...	...	...	...	...	...	...	...
372319	HG03162	chr3	15722024	15722024	[C, T]	[0, 1]	[0.48473284, 0.5152672]
372320	HG03164	chr3	15722024	15722024	[C, T]	[0, 1]	[0.48473284, 0.5152672]
372321	HG03054	chr3	15722047	15722047	[G, GT]	[0, 1]	[0.4, 0.5]
372322	HG03055	chr3	15722047	15722047	[G, GT]	[0, 1]	[0.4, 0.5]
372323	HG03021	chr3	15722066	15722066	[A, G]	[0, 1]	[0.4359155, 0.56408453]

372324 rows × 7 columns

Use the read_allele_count method to interrogate the allele_count array directly at a specific position:

Warning

allele_count uses 0-based indexing.

Python

# Read allele count information
ac = ds.read_allele_count("chr22:50808372-50808373").to_pandas()
ac

	pos	ref	alt	filter	gt	count
0	50808372	A	AG	DRAGENHardQUAL	0,1	1
1	50808372	A	AG	DRAGENHardQUAL	1,1	15
2	50808372	A	AG	DRAGENHardQUAL;LowDepth	1,1	3
3	50808372	A	AG	LowDepth	0,1	2
4	50808372	A	AG	LowDepth	1,1	9
5	50808372	A	AG	PASS	0,1	9
6	50808372	A	AG	PASS	1,1	161
7	50808372	A	AGG	PASS	1,1	3
8	50808372	A	AT	LowDepth	1,1	2
9	50808372	A	AT	PASS	0,1	2
10	50808372	A	AT	PASS	1,1	1
11	50808372	A	ATG	PASS	1,1	1
12	50808372	A	G	DRAGENHardQUAL	1,1	1
13	50808372	A	G	PASS	1,1	5
14	50808372	A	T	PASS	1,1	2

Note how many of the same alleles are repeated, despite each presenting a count value. This is an artifact of the progressive ingestion process. Aggregate these counts to get a more accurate representation of the allele count:

Python

ac.groupby(["pos", "ref", "alt", "gt"]).agg({"count": "sum"})

				count
pos	ref	alt	gt
50808372	A	AG	0,1	12
		AG	1,1	188
		AGG	1,1	3
		AT	0,1	2
		AT	1,1	3
		ATG	1,1	1
		G	1,1	6
		T	1,1	2

Use the read_variant_stats method to interrogate the variant_stats array:

Python

ds.read_variant_stats("chr22:50808372-50808373").to_pandas()

	pos	alleles	ac	an	af
0	50808372	A,ATG	2	434	0.004608
1	50808372	A,T	4	434	0.009217
2	50808372	A,AGG	6	434	0.013825
3	50808372	ref	14	434	0.032258
4	50808372	A,G	12	434	0.027650
5	50808372	A,AT	8	434	0.018433
6	50808372	A,AG	388	434	0.894009

The 188 observed homozygous + 12 heterozygous A->AG calls in allele_count are equivalent to the 388 A,AG alleles observed in variant_stats.