Learn how to ingest a small batch of gVCFs using TileDB-VCF.
How to run this tutorial
You can run this tutorial in two ways:
Locally on your machine.
On TileDB Cloud.
However, since TileDB Cloud has a free tier, we strongly recommend that you sign up and run everything there, as that requires no installations or deployment.
This tutorial guides you through a short example of ingesting a small batch of gVCFs using TileDB-VCF. This approach is appropriate for small datasets (under 10 samples).
Setup
First, import the necessary libraries, set the TileDB-VCF dataset URI (i.e., its path, which in this tutorial will be on local storage), and delete any previously created datasets with the same name.
vcf_bucket ="s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen"samples_to_ingest = ["HG00096_chr21.gvcf.gz","HG00097_chr21.gvcf.gz","HG00099_chr21.gvcf.gz","HG00100_chr21.gvcf.gz","HG00101_chr21.gvcf.gz",]sample_uris = [f"{vcf_bucket}/{s}"for s in samples_to_ingest]sample_uris
# Open a VCF dataset in write modeds = tiledbvcf.Dataset(uri=vcf_uri, mode="w")# Create empty VCF datasetds.create_dataset()# Ingest samplesds.ingest_samples(sample_uris=sample_uris)
The above will create a folder called vcf_uri in your local storage. You can see the contents of the folder running !tree {vcf_uri} in a notebook cell, and visit the Storage Format Specification section for more details on what each subfolder and file represents.
Tip
You can tweak parameters threads and total_memory_budget_mb in the ingest_samples function to fine tune ingestion performance.
Read some data to confirm that the samples were properly ingested in the dataset.
Despite the fact that .attributes("all") shows all the queryable fields, not all of the INFO and FMT fields are materialized as separate array attributes by default in TileDB-VCF. This may have an impact on query performance. As a rule of thumb, performance is better when a VCF field is materialized as an array attribute in TileDB-VCF.
To see the materialized VCF fields, you can run .attributes("builtin").
# Clean up VCF dataset if it already existsif os.path.exists(vcf_uri): shutil.rmtree(vcf_uri)
Materializing INFO and FMT fields
In order to boost performance of queries performed on certain INFO or FMT fields, you may want to explicitly materialize them as separate attributes in the underlying arrays that TileDB-VCF creates. This can help with both query time, as well as compression, as TileDB adopts a “columnar” format that is more storage-effective when values across a field are stored all together in a separate storage place (e.g., a separate file).
You can explicitly materialize INFO or FMT fields as follows.