import os
import tiledb
import tiledb.cloud
import tiledb.cloud.sql
import tiledb.cloud.vcf
from tiledb.cloud.compute import DelayedSQL
# Get your credentials
= os.environ["TILEDB_REST_TOKEN"]
tiledb_token # or use your username and password (not recommended)
# tiledb_username = os.environ["TILEDB_USERNAME"]
# tiledb_password = os.environ["TILEDB_PASSWORD"]
# VCF data from the DRAGEN 1KG samples
= "tiledb://TileDB-Inc/vcf-1kg-dragen-v376"
vcf_array
# Phenotypes for the 1KG samples
= "tiledb://TileDB-Inc/vcf-1kg-sample-metadata"
sample_array
# Ensembl gene/exon annotation
= "tiledb://TileDB-Inc/ensemblgene_sparse"
gene_array = "tiledb://TileDB-Inc/ensemblexon_sparse"
exon_array
# Log into TileDB Cloud
=tiledb_token)
tiledb.cloud.login(token# or use your username and password (not recommended)
# tiledb.cloud.login(username=tiledb_username, password=tiledb_password)
Tables and SQL
You can run this tutorial only on TileDB Cloud. However, TileDB Cloud has a free tier. We strongly recommend that you sign up and run everything there, as that requires no installations or deployment.
Along with the TileDB-VCF datasets (which aren’t exactly tabular), TileDB creates other important, associated tabular datasets, such as sample metadata, variant annotation, and gene models stored as generic TileDB arrays. You can access this data in a variety of ways, including the raw array programmatic APIs, but also Python pandas-like dataframe operators and, most importantly, SQL.
Although you can run this tutorial locally from your laptop, you need to have a TileDB Cloud account, as the majority of the operations leverage functionality provided only by TileDB Cloud. Visit the Tutorials: Basic TileDB Cloud section for more information on how to use TileDB Cloud.
In this tutorial, you will:
- Learn the use of a basic pandas-like dataframe operator to query a table (represented as a TileDB array).
- Filter samples for a specific demographic using SQL.
- Retrieve exon intervals for a specific gene symbol using SQL.
- Query a large VCF dataset to retrieve variants for samples within your cohort that overlap
DRD2
’s coding regions.
For more information on tabular data, visit the Key Concepts: Tables and SQL section.
First, import the necessary libraries, log into TileDB Cloud and set up the URIs of the datasets used throughout the tutorial. You can skip logging in if you run this tutorial in a notebook server inside TileDB Cloud.
You can get the contents of gene_array
into a pandas dataframe using the TileDB df
operator.
This array happens to be indexed on gene_name
. Thus, you can slice only the entry for a specific gene as follows:
You can perform the same query using SQL as follows:
The rest of this tutorial focuses on the assigned Ensembl gene ID, which you’ll need to query the array containing Ensembl’s exon annotations, in order to obtain the associated coding regions.
You can accomplish this by defining two delayed SQL queries to retrieve the samples and regions of interest, and passing their result into a distributed VCF query that will perform those two SQL queries in parallel. The SQL queries are called “delayed”, because the are not computed when they are defined, but their computation is delayed until the distributed VCF query is performed.
First, define a SQL query to retrieve the regions.
Next, define a SQL query to retieve the samples.
Finally, perform the distributed VCF query, passing the two delayed SQL objects as parameters. That will automatically build a task graph and dependency chain, and execute it in parallel.