Population Genomics: Scalable Queries

life sciences

genomics (vcf)

tutorials

queries

Learn about using TileDB task graphs to perform distributed queries on TileDB-VCF datasets.

How to run this tutorial

You can run this tutorial only on TileDB Cloud. However, TileDB Cloud has a free tier. We strongly recommend that you sign up and run everything there, as that requires no installations or deployment.

You can take advantage of TileDB Cloud’s distributed computing capabilities to perform scalable queries in two ways:

Building your own custom distributed algorithms using UDFs and task graphs.
Leveraging the built-in distributed read capabilities of TileDB-VCF on TileDB Cloud.

This tutorial covers the second way. Namely, it shows you how you can query a large dataset with a single TileDB-VCF command on TileDB Cloud, which is automatically executed in a distributed manner across multiple cloud workers, completely serverlessly. For more information of TileDB Cloud’s scalability, visit the Key Concepts: Distributed Compute section.

First, set up your TileDB Cloud credentials, similar to Tutorials: Basic TileDB Cloud.

Python

import os

import tiledb

# You should set the appropriate environment variables with your keys.
# Get the keys from the environment variables.
tiledb_account = os.environ["TILEDB_ACCOUNT"]
tiledb_token = os.environ["TILEDB_REST_TOKEN"]
# or use your username and password (not recommended)
# tiledb_username = os.environ["TILEDB_USERNAME"]
# tiledb_password = os.environ["TILEDB_PASSWORD"]

# Get the bucket and region from environment variables
s3_bucket = os.environ["S3_BUCKET"]

# Set the AWS keys and region to the config of the default context
# This context initialization can be performed only once.
cfg = tiledb.Config(
    {
        "rest.token": tiledb_token,
        # or use your username and password (not recommended)
        # "rest.username": tiledb_username,
        # "rest.password": tiledb_password,
    }
)
ctx = tiledb.Ctx(cfg)

Next, import the necessary libraries.

Python

import os.path

import tiledb.cloud
import tiledb.cloud.vcf
import tiledbvcf

# Print library versions
print("TileDB core version: {}".format(tiledb.libtiledb.version()))
print("TileDB-Py version: {}".format(tiledb.version()))
print("TileDB-VCF version: {}".format(tiledbvcf.version))
print("TileDB-Cloud-Py version: {}".format(tiledb.cloud.version.version))

TileDB core version: (2, 24, 2)
TileDB-Py version: (0, 30, 2)
TileDB-VCF version: 0.33.3
TileDB-Cloud-Py version: 0.12.17

Choose a dataset to read. In this tutorial, you are using the preloaded and publicly accessible 1000 genome dataset. Retrieve all the samples (which are over 3,000), and define the attributes to retrieve and genomic ranges. Finally, issue the read query.

Python

# Define the dataset URI
dataset_uri = "tiledb://TileDB-Inc/gvcf-1kg-dragen-v376"

# Get all samples from it
ds = tiledbvcf.Dataset(dataset_uri, tiledb_config=cfg)
samples = ds.samples()

# Define attributes and ranges to query on
attrs = ["sample_name", "fmt_GT", "fmt_AD", "fmt_DP"]
regions = ["chr13:32396898-32397044", "chr13:32398162-32400268"]

# Perform the read, which is going to be executed in a distributed
# fashion.
df = tiledb.cloud.vcf.read(
    dataset_uri=dataset_uri,
    regions=regions,
    samples=samples,
    attrs=attrs,
    namespace="stavros",  # this specifies which account to charge
)
df.to_pandas()

	sample_name	fmt_GT	fmt_AD	fmt_DP
0	HG00146	[0, 0]	[46, 0]	46
1	HG00116	[0, 0]	[46, 0]	46
2	HG00107	[0, 0]	[42, 0]	42
3	HG00131	[0, 0]	[40, 0]	40
4	HG00118	[0, 0]	[37, 0]	37
...	...	...	...	...
504490	NA20888	[0, 0]	[32, 1]	33
504491	NA20901	[0, 0]	[22, 0]	22
504492	NA21092	[0, 0]	[28, 0]	28
504493	NA20888	[0, 0]	[32, 0]	32
504494	NA21092	[0, 0]	[24, 0]	24

504495 rows × 4 columns

Behind the scenes, this single command generates and deploys a task graph on TileDB Cloud, which executes the query in parallel across multiple machines. You can find the task graph detailed logs under Monitor -> Logs -> Task Graphs and choosing the corresponding task graph entry. This looks as follows.

That’s it! All the serverless, distributed computing magic is built into a single TileDB-VCF command run on TileDB Cloud.T