Handle Large VCF Queries

life sciences

genomics (vcf)

tutorials

queries

Learn how to handle VCF queries that are larger than the RAM available on a single machine.

How to run this tutorial

You can run this tutorial in two ways:

Locally on your machine.
On TileDB Cloud.

However, since TileDB Cloud has a free tier, we strongly recommend that you sign up and run everything there, as that requires no installations or deployment.

This tutorial demonstrates TileDB-VCF features for overcoming memory limitations when a query returns a large amount of data to a single machine. The tutorial simulates a large query by ingesting samples to a local dataset and reading the dataset with an artificially small memory budget.

Setup

First, import the necessary libraries, set the TileDB-VCF dataset URI (i.e., its path, which in this tutorial will be on local storage), and delete any previously created datasets with the same name.

Python

import os.path
import shutil

import tiledb
import tiledbvcf

# Print library versions
print("TileDB core version: {}".format(tiledb.libtiledb.version()))
print("TileDB-Py version: {}".format(tiledb.version()))
print("TileDB-VCF version: {}".format(tiledbvcf.version))

# Set VCF dataset URI
vcf_uri = os.path.expanduser("~/large-queries")

# Clean up VCF dataset if it already exists
if os.path.exists(vcf_uri):
    shutil.rmtree(vcf_uri)

TileDB core version: (2, 24, 2)
TileDB-Py version: (0, 30, 2)
TileDB-VCF version: 0.34.0

Ingest samples

Next, create a TileDB-VCF dataset and ingest some samples into it.

Python

# Specify the samples to be ingested
vcf_bucket = "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen"
samples_to_ingest = ["HG00096_chr21.gvcf.gz", "HG00097_chr21.gvcf.gz"]
sample_uris = [f"{vcf_bucket}/{s}" for s in samples_to_ingest]

# Open a VCF dataset in write mode
ds = tiledbvcf.Dataset(uri=vcf_uri, mode="w")

# Create empty VCF dataset
ds.create_dataset()

# Ingest samples
ds.ingest_samples(sample_uris=sample_uris, threads=4, total_memory_budget_mb=2048)

Read

One strategy for accommodating large queries is to simply increase the amount of memory available to TileDB-VCF. By default, TileDB-VCF allocates 2 GiB of memory for queries. However, this value can be adjusted using the memory_budget_mb parameter. For the purposes of this tutorial the budget will be decreased to 1 GiB (1024 MiB) to demonstrate how TileDB-VCF is able to perform large queries even in a memory constrained environment.

Python

# Open the Dataset in read mode
cfg = tiledbvcf.ReadConfig(memory_budget_mb=1024)
ds = tiledbvcf.Dataset(uri=vcf_uri, cfg=cfg)

Now read data from the dataset and check if the read was complete (that is, TileDB has no more data to read) or incomplete.

Python

# Read a large chromosome region for all samples
df = ds.read(
    regions=["chr21:1-50000000"],
    attrs=["sample_name", "contig", "pos_start", "pos_end", "alleles", "fmt_GT"],
)

# Check if the read was complete (no more data to read)
ds.read_completed()

False

In this example, the read was not completed, which means TileDB has more data to read. The following code shows how to continue reading data from the point the previous incomplete query left off using continue_read(). The example larger than RAM analysis counts the total number of rows read.

Python

print(f"Rows read: {len(df):,}")

# Count the total number of rows read
total_rows = len(df)
while not ds.read_completed():
    df = ds.continue_read()
    print(f"Rows read: {len(df):,}")
    total_rows += len(df)

print(f"Total rows read: {total_rows:,}")

Rows read: 1,862,313
Rows read: 1,862,313
Rows read: 1,179,921
Total rows read: 4,904,547

Clean up

Clean up the created TileDB-VCF dataset.

Python

# Clean up VCF dataset if it already exists
if os.path.exists(vcf_uri):
    shutil.rmtree(vcf_uri)