Learn how to handles larger than RAM queries on a single machine.
How to run this tutorial
You can run this tutorial in two ways:
Locally on your machine.
On TileDB Cloud.
However, since TileDB Cloud has a free tier, we strongly recommend that you sign up and run everything there, as that requires no installations or deployment.
This tutorial demonstrates TileDB-VCF features for overcoming memory limitations when a query returns a large amount of data to a single machine. The tutorial simulates a large query by ingesting samples to a local dataset and reading the dataset with an artificially small memory budget.
Setup
First, import the necessary libraries, set the TileDB-VCF dataset URI (i.e., its path, which in this tutorial will be on local storage), and delete any previously created datasets with the same name.
# Specify the samples to be ingestedvcf_bucket ="s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen"samples_to_ingest = ["HG00096_chr21.gvcf.gz", "HG00097_chr21.gvcf.gz"]sample_uris = [f"{vcf_bucket}/{s}"for s in samples_to_ingest]# Open a VCF dataset in write modeds = tiledbvcf.Dataset(uri=vcf_uri, mode="w")# Create empty VCF datasetds.create_dataset()# Ingest samplesds.ingest_samples(sample_uris=sample_uris, threads=4, total_memory_budget_mb=2048)
Read
One strategy for accommodating large queries is to simply increase the amount of memory available to TileDB-VCF. By default, TileDB-VCF allocates 2 GiB of memory for queries. However, this value can be adjusted using the memory_budget_mb parameter. For the purposes of this tutorial the budget will be decreased to 1 GiB (1024 MiB) to demonstrate how TileDB-VCF is able to perform large queries even in a memory constrained environment.
# Read a large chromosome region for all samplesdf = ds.read( regions=["chr21:1-50000000"], attrs=["sample_name", "contig", "pos_start", "pos_end", "alleles", "fmt_GT"],)# Check if the read was complete (no more data to read)ds.read_completed()
False
In this example, the read was not completed, which means TileDB has more data to read. The following code shows how to continue reading data from the point the previous incomplete query left off using continue_read(). The example larger than RAM analysis counts the total number of rows read.