Learn about using allele frequency and sample quality control metrics in TileDB-VCF.
How to run this tutorial
You can run this tutorial only on TileDB Cloud. However, TileDB Cloud has a free tier. We strongly recommend that you sign up and run everything there, as that requires no installations or deployment.
TileDB-VCF allows you to query variant statistics, which are either generated and stored inside the TileDB-VCF along with the variant data as the separate allele_count and variant_stats auxiliary tables, or computed and returned at query time. For more information on variant statistics, visit the Key Concepts: Variants Statistics section.
In this tutorial, you will use the public tiledb://TileDB-Inc/vcf-1kg-dragen-v376 dataset, which you can locate on the TileDB Cloud Marketplace.
Start by setting up your TileDB Cloud credentials in a config object. Note that you can skip this step if you are running the tutorial inside a TileDB Cloud notebook.
import osimport tiledb# You should set the appropriate environment variables with your keys.# Get the keys from the environment variables.tiledb_token = os.environ["TILEDB_REST_TOKEN"]# or use your username and password (not recommended)# tiledb_username = os.environ["TILEDB_USERNAME"]# tiledb_password = os.environ["TILEDB_PASSWORD"]# Set the AWS keys and region to the config of the default context# This context initialization can be performed only once.cfg = tiledb.Config( {"rest.token": tiledb_token,# or use your username and password (not recommended)# "rest.username": tiledb_username,# "rest.password": tiledb_password, })ctx = tiledb.Ctx(cfg)
Next, import the necessary libraries, and set the VCF dataset URI.
# Show the groups and arrays inside the TileDB-VCF datasetds = tiledbvcf.Dataset(vcf_uri, mode="r", tiledb_config=cfg)ds_grp = tiledb.Group(ds.uri, "r", ctx=ctx)for i inrange(len(ds_grp)):print(f"URI: {ds_grp[i].uri}, Type: {ds_grp[i].type}, Name: {ds_grp[i].name}")
# Open the dataset in read modeds = tiledbvcf.Dataset(vcf_uri, mode="r", tiledb_config=cfg)
Perform a read using a genomic region (the BTD gene) and setting the attributes to extract. To retrieve the internal allele frequency calculation for any variant in the result set, you need to specify info_TILEDB_IAF in the attributes list to retrieve. Note that the values for this attribute are calculated at query time.
# Set info_TILEDB_IAF to the attributes argumentattrs = ["sample_name","contig","pos_start","pos_end","alleles","fmt_GT","info_TILEDB_IAF",]# Read from the datasetdf = ds.read( attrs=attrs, regions=["chr3:15601341-15722311"],)df
sample_name
contig
pos_start
pos_end
alleles
fmt_GT
info_TILEDB_IAF
0
NA21143
chr3
15601536
15601536
[A, G]
[1, 1]
[0.027682202, 0.9723178]
1
NA21144
chr3
15601536
15601536
[A, G]
[1, 1]
[0.027682202, 0.9723178]
2
NA21144
chr3
15601668
15601668
[G, A]
[0, 1]
[0.43612567, 0.56387436]
3
NA21143
chr3
15601866
15601866
[A, G]
[0, 1]
[0.5, 0.5]
4
NA21144
chr3
15602568
15602568
[A, G]
[0, 1]
[0.42662117, 0.57337886]
...
...
...
...
...
...
...
...
606391
HG03021
chr3
15722066
15722066
[A, G]
[0, 1]
[0.4359155, 0.56408453]
606392
HG03034
chr3
15722277
15722277
[T, C]
[0, 1]
[0.5, 0.5]
606393
HG03035
chr3
15722277
15722277
[T, C]
[0, 1]
[0.5, 0.5]
606394
HG03091
chr3
15722277
15722277
[T, C]
[0, 1]
[0.5, 0.5]
606395
HG03091
chr3
15722283
15722283
[C, T]
[0, 1]
[0.5, 0.5]
606396 rows × 7 columns
Filtering on allele frequency can be done by setting a threshold for the internal allele frequency to the query.
Note how many of the same alleles are repeated, despite each presenting a count value. This is an artifact of the progressive ingestion process. Aggregate these counts to get a more accurate representation of the allele count: