import os
import tiledb
import tiledb.cloud
import tiledb.cloud.vcf
import tiledb.cloud.vcf.vcf_toolbox as vtb
import tiledbvcf
import numpy as np
# Get your credentials
= os.environ["TILEDB_REST_TOKEN"]
tiledb_token # or use your username and password (not recommended)
# tiledb_username = os.environ["TILEDB_USERNAME"]
# tiledb_password = os.environ["TILEDB_PASSWORD"]
# Public URI datasets to be used in this tutorial
= "tiledb://TileDB-Inc/gnomad-4_0-include-nopass"
gnomad_uri = "tiledb://TileDB-Inc/vcf-1kg-dragen-v376"
vcf_uri
# Log into TileDB Cloud
=tiledb_token)
tiledb.cloud.login(token# or use your username and password (not recommended)
# tiledb.cloud.login(username=tiledb_username, password=tiledb_password)
Annotation VCFs
You can run this tutorial only on TileDB Cloud. However, TileDB Cloud has a free tier. We strongly recommend that you sign up and run everything there, as that requires no installations or deployment.
VCF can serve as a delivery medium for variant annotations, even if no samples are present. TileDB-VCF can ingest these sampleless, variant-only VCFs. This tutorial shows how to extract this annotation information from VCF files and ingest it into separate TileDB arrays on TileDB Cloud, which can then be combined to generate annotations for other VCF datasets.
Import the necessary libraries and set the URIs that will be used in this tutorial. If you are running this from a local notebook, visit the Tutorials: Basic TileDB Cloud for more information on how to set your TileDB Cloud credentials in a configuration object (this step can be omitted inside a TileDB Cloud notebook).
You will use gnomAD, which is distributed in this manner, and can be used to annotate variant datasets for measures of allele frequency.
Start by specifying the gnomAD dataset and region of interest. Then, query the gnomAD dataset, and select the allele frequency INFO
column.
Clean the gnomAD result DataFrame. Split the alleles
into ref
and alt
columns, and cast the columns to the correct data types.
Create a TileDB array for the gnomAD results.
Now that you created the array, you can inspect it.
Perform a TileDB Cloud distributed query and annotate using the newly created gnomAD array. Note the variants have now been annotated with their global allele frequency.