Tables and SQL

Learn how to query tabular variant data and leverage the power of SQL.

How to run this tutorial

You can run this tutorial only on TileDB Cloud. However, TileDB Cloud has a free tier. We strongly recommend that you sign up and run everything there, as that requires no installations or deployment.

Along with the TileDB-VCF datasets (which aren’t exactly tabular), TileDB creates other important, associated tabular datasets, such as sample metadata, variant annotation, and gene models stored as generic TileDB arrays. You can access this data in a variety of ways, including the raw array programmatic APIs, but also Python pandas-like dataframe operators and, most importantly, SQL.

Although you can run this tutorial locally from your laptop, you need to have a TileDB Cloud account, as the majority of the operations leverage functionality provided only by TileDB Cloud. Visit the Tutorials: Basic TileDB Cloud section for more information on how to use TileDB Cloud.

In this tutorial, you will:

Learn the use of a basic pandas-like dataframe operator to query a table (represented as a TileDB array).
Filter samples for a specific demographic using SQL.
Retrieve exon intervals for a specific gene symbol using SQL.
Query a large VCF dataset to retrieve variants for samples within your cohort that overlap DRD2’s coding regions.

For more information on tabular data, visit the Key Concepts: Tables and SQL section.

First, import the necessary libraries, log into TileDB Cloud and set up the URIs of the datasets used throughout the tutorial. You can skip logging in if you run this tutorial in a notebook server inside TileDB Cloud.

Python

import os
import tiledb
import tiledb.cloud
import tiledb.cloud.sql
import tiledb.cloud.vcf
from tiledb.cloud.compute import DelayedSQL

# Get your credentials
tiledb_token = os.environ["TILEDB_REST_TOKEN"]
# or use your username and password (not recommended)
# tiledb_username = os.environ["TILEDB_USERNAME"]
# tiledb_password = os.environ["TILEDB_PASSWORD"]

# VCF data from the DRAGEN 1KG samples
vcf_array = "tiledb://TileDB-Inc/vcf-1kg-dragen-v376"

# Phenotypes for the 1KG samples
sample_array = "tiledb://TileDB-Inc/vcf-1kg-sample-metadata"

# Ensembl gene/exon annotation
gene_array = "tiledb://TileDB-Inc/ensemblgene_sparse"
exon_array = "tiledb://TileDB-Inc/ensemblexon_sparse"

# Log into TileDB Cloud
tiledb.cloud.login(token=tiledb_token)
# or use your username and password (not recommended)
# tiledb.cloud.login(username=tiledb_username, password=tiledb_password)

You can get the contents of gene_array into a pandas dataframe using the TileDB df operator.

Python

with tiledb.open(gene_array, ctx=tiledb.cloud.Ctx()) as A:
    df = A.df[:]
df

		chrom	pos_start	pos_end	width	strand	gene_source	gene_biotype
gene_name	gene_id
5S_rRNA	ENSG00000201285	X	147089620	147089735	116	-	ensembl	rRNA
	ENSG00000212595	X	150194767	150194878	112	-	ensembl	rRNA
	ENSG00000238602	GL000192.1	415332	415454	123	+	ensembl	rRNA
	ENSG00000238762	GL000228.1	22673	22791	119	+	ensembl	rRNA
	ENSG00000239156	GL000228.1	20113	20230	118	+	ensembl	rRNA
...	...	...	...	...	...	...	...	...
snoZ6	ENSG00000266692	21	45857004	45857058	55	+	ensembl	snoRNA
snosnR60_Z15	ENSG00000201853	2	125886409	125886490	82	+	ensembl	snoRNA
snosnR60_Z15	ENSG00000252849	7	131600994	131601081	88	-	ensembl	snoRNA
snosnR66	ENSG00000212397	11	112473077	112473175	99	-	ensembl	snoRNA
yR211F11.2	ENSG00000213076	6	159342303	159343182	880	-	havana	pseudogene

63677 rows × 7 columns

This array happens to be indexed on gene_name. Thus, you can slice only the entry for a specific gene as follows:

Python

gene_name = "KCNQ2"

with tiledb.open(gene_array, ctx=tiledb.cloud.Ctx()) as A:
    df = A.df[gene_name, :]
df

		chrom	pos_start	pos_end	width	strand	gene_source	gene_biotype
gene_name	gene_id
KCNQ2	ENSG00000075043	20	62037542	62103993	66452	-	ensembl_havana	protein_coding

You can perform the same query using SQL as follows:

Python

tiledb.cloud.sql.exec(
    query=f"select * from `{gene_array}` where `gene_name` = '{gene_name}'"
)

	gene_name	gene_id	gene_biotype	gene_source	strand	width	pos_end	pos_start	chrom
0	KCNQ2	ENSG00000075043	protein_coding	ensembl_havana	-	66452	62103993	62037542	20

The rest of this tutorial focuses on the assigned Ensembl gene ID, which you’ll need to query the array containing Ensembl’s exon annotations, in order to obtain the associated coding regions.

You can accomplish this by defining two delayed SQL queries to retrieve the samples and regions of interest, and passing their result into a distributed VCF query that will perform those two SQL queries in parallel. The SQL queries are called “delayed”, because the are not computed when they are defined, but their computation is delayed until the distributed VCF query is performed.

First, define a SQL query to retrieve the regions.

Python

ensembl_query = f"""
  SELECT
    concat("chr", ensemblexon.chrom, ":", ensemblexon.pos_start, "-", ensemblexon.pos_end) region
  FROM `{gene_array}` ensemblgene
  LEFT JOIN `{exon_array}` ensemblexon ON ensemblexon.gene_id = ensemblgene.gene_id
  WHERE ensemblgene.gene_name = ?
"""

regions = DelayedSQL(ensembl_query, name="regions", parameters=[gene_name])

Next, define a SQL query to retieve the samples.

Python

gender = "female"
pop = "GBR"

sample_query = f"""
  SELECT sampleuid FROM `{sample_array}`
  WHERE pop = ? AND gender = ?
"""

samples = DelayedSQL(sample_query, name="samples", parameters=[pop, gender])

Finally, perform the distributed VCF query, passing the two delayed SQL objects as parameters. That will automatically build a task graph and dependency chain, and execute it in parallel.

Python

df = tiledb.cloud.vcf.read(
    dataset_uri=vcf_array, samples=samples, regions=regions
).to_pandas()
df

	sample_name	contig	pos_start	alleles	fmt_GT
0	HG00097	chr20	62037705	[G, A]	[0, 1]
1	HG00097	chr20	62037705	[G, A]	[0, 1]
2	HG00097	chr20	62037705	[G, A]	[0, 1]
3	HG00097	chr20	62037705	[G, A]	[0, 1]
4	HG00120	chr20	62037705	[G, A]	[0, 1]
...	...	...	...	...	...
2573	HG00262	chr20	62078124	[G, A]	[0, 1]
2574	HG00262	chr20	62078124	[G, A]	[0, 1]
2575	HG00262	chr20	62078124	[G, A]	[0, 1]
2576	HG00262	chr20	62078124	[G, A]	[0, 1]
2577	HG00262	chr20	62078124	[G, A]	[0, 1]

2578 rows × 5 columns