Introduction
TileDB offers a specialized product, called TileDB-VCF, specifically designed for managing and analyzing biobank-scale variant data.
History
TileDB was first used to build a variant store system in 2015, as part of a collaboration between Intel Labs, Intel Health and Life Sciences, MIT and the Broad Institute. The resulting software, GenomicsDB, became part of Broad’s GATK 4.0. TileDB-VCF is an evolution of that work, built on an improved and always up-to-date version of the open-source TileDB array engine, incorporating new algorithms, features, and optimizations. TileDB-VCF has extended security, management, scalable compute, and visualization features on TileDB Cloud.
An explosion of variant data
National biobank initiatives, such as UK Biobank and All of Us, have yielded vast repositories of whole human genomes, numbering in the hundreds of thousands. According to Illumina, at least 4 million individuals have been fully sequenced. Each individual’s genome typically yields between 4 to 5 million single nucleotide polymorphisms (SNPs), as well as variations such as insertions and deletions (indels), and larger structural variants (SVs), diverging from the reference sequence. Analogous large-scale sequencing endeavors targeting agricultural crops, model organisms, microbes, and even companion animals have produced similarly extensive sets of variants. These variants form the cornerstone of the growing field of population genomics, leveraging sheer scale and genetic variability to improve our understanding of biology and medicine.
Among the groups making the most of large-scale population sequencing in TileDB are:
Biopharma and Biotech: Biopharma is leveraging large public and private biobank resources to develop leads for new compounds, as well as identify patient populations most likely to respond to treatments. Major biopharmas and biotech startups are using TileDB-VCF in basic and translational research, often in concert with other omics data.
Rare Disease & Newborn Screening: Consortiums such as BeginNGS are using in-house and public data to identify actionable variants in newborns for screening, all using TileDB-VCF in TileDB Cloud.
Microbiology and Infectious Disease: The use of VCF files in virology gained momentum during large scale sequencing of variation in SARS-CoV-2. TileDB hosts over 5.6 million Sars-CoV-2 samples released by the NIH as part of ACTIV TRACE initiative.
Agriculture: Agriculture is leading the concept of pangenomics, sequencing all inbred lines of major crops. TileDB has full support for polyallelic loci found in crops such as wheat.
The foundations of TileDB were formed in the challenge to scalably manage large and rapidly growing variant datasets. Population genomics continues to be a key component of TileDB’s life science portfolio.
VCF files and their limitations
Variant calling is the process of predicting genomic variants and genotypes from reads aligned to a reference genome. Algorithms weigh various quality metrics of these aggregated sequencing reads, which can be visualized as stacks or “pileups”, to generate calls of positions that may contain single nucleotide polymorphisms (SNP) that include substitutions, insertions or deletions (indels), as well as larger structural variants (SVs) and copy number variants (CNVs).
Variant call files (VCFs), developed during the 1000 Genomes Project, combine variant calls and genotype calls with technical metadata (depth, quality and confidence metrics), along with extensible annotation at the locus and sample level. Some might consider VCF more of a suggestion than a format, but it has proven an effective means of transmitting variants and associating annotation at the sample level.
A review of the VCF tagged questions in Biostars, a popular Q&A site for bioinformatics, reveals that many people are trying to use VCF files themselves as a kind of ad-hoc database, rather than a conduit for transmitting variant information. This approach is problematic, suffering from numerous limitations:
VCF is fine as a suitcase for small-scale variation and, to a lesser extent, annotation. But you can’t live out of your suitcase forever.
VCF and its associated command-line tools are not a database system, and they will never support region and sample queries at scale or at “web-speed” in the era of national biobanks. Even its usefulness in transmitting variants is unsustainable past a few thousand samples. Annotation can also be problematic, given that everything needs to be serialized into the
INFO
field.VCF files are monolithic. Adding a new sample can introduce new variants, forcing every sample to be re-interrogated at that position to distinguish if it was indeed reference or had insufficient coverage. In addition, this creates severe performance issues when storing the new sample alongside the rest of the dataset. This is known as the “N+1” problem.
While indexes provided by tabix or BCFtools can aid in range queries, these don’t help with joins against phenotypic data or other omic stacks. Bespoke or ad hoc solutions typically try to simplify the underlying data (storing only genotype calls or a fixed set of loci, for example) but these measures greatly limit the usefulness of the variant store.
The shift away from joint genotyping and toward single sample gVCFs as the preferred currency further muddies the waters.
VCF-file-based approaches are clearly not sustainable solutions.
TileDB-VCF and its benefits
TileDB’s population genomics solution is architected around TileDB-VCF, which is an open-source library for efficient and lossless storage, access and exporting of variant data. TileDB-VCF is further coupled with the TileDB Cloud commercial product to provide advanced security, management, scalable compute, and visualization features.
TileDB-VCF is built on top of the TileDB array engine. Specifically, it models population VCF data as 3-dimensional sparse arrays. Section Data Model describes in detail how TileDB-VCF models variant data as multi-dimensional arrays. TileDB-VCF is a first-class, open-source library written in C++ with APIs in Python and Java.
TileDB-VCF offers a broad spectrum of benefits:
- Performance: TileDB-VCF is optimized for rapidly slicing variant records by genomic regions across arbitrary numbers of samples. Additional features for ingesting VCF files and performing genomic interval intersections are all implemented in C++ for maximum speed.
- Compressibility: TileDB-VCF can efficiently store any number of samples in a compressed and lossless manner. Being a columnar format, TileDB-VCF can compress different VCF fields with different compressors, choosing the most appropriate one depending on the data types and nature of the data per field.
- Optimized for cloud object stores: Built on the TileDB core array engine, TileDB-VCF inherits all its features out-of-the-box, such as the speed and optimizations on numerous storage backends, including cloud object stores such as Amazon S3, Azure Blob Storage, and Google Cloud Storage.
- Solves the N+1 problem: TileDB-VCF allows you to rapidly add new samples to existing datasets, eliminating the so-called N+1 problem and scaling both storage and update time linearly with the number of new samples. Regardless of how large your cohort is, it is strongly recommend that you choose to ingest genomic-VCFs (gVCF), rather than vanilla single-sample VCFs. gVCFs contain spans or blocks, which distinguish regions that can be called homozygous reference from those with no coverage. These reference/no-call blocks enable N+1-compliant calls to be preserved as new samples are introduced into the TileDB-VCF store.
- Cohort level variant stats and allele count data: TileDB-VCF provides cohort-level statistics upon ingestion, a running count of the alleles observed at each position, and zygosity, which are used to calculate internal allele frequency. These convenience tables can also be transformed into useful summaries such as genotype or dosage matrices.
- Separation of genomic data and annotation: TileDB offers a growing selection of external annotation tables - gene models, gene annotation, variant annotation, and support for external models of variant interpretation and pathogenicity (including Fabric Genomics). These are optimized to work in concert with TileDB-VCF stores to accelerate queries and reports. With the aid of TileDB-VCF’s variant stats array, TileDB can support the independent annotation of existing variant stores. This means that annotation through Ensembl Variant Effect Predictor (VEP) or SnpEff can be updated as references improve without revisiting the original VCF files.
- Multiple APIs: In addition to a command-line interface, TileDB-VCF provides C++, Java, and Python APIs.
- Integration with other omic data: Groups that use genomic data often want to link transcriptomes from the same subjects to perform gene by environment (GxE) or genotype-to-phenotype experiments. TileDB-VCF can be used in concert with TileDB SOMA for single-cell or bulk RNA-Seq multiomic experiments, including eQTL analysis.
- Artificial Intelligence (AI) & Machine Learning (ML) support: TileDB has full-fledged AI & ML support. It provides implementations for saving Tensorflow Keras, PyTorch, and Scikit-Learn models as TileDB arrays. A common workflow is to generate a genotype or dosage matrix from a TileDB-VCF and feed those in as features for predicting various phenotype-derived labels. In addition, TileDB has powerful native vector search capabilities and integrations with popular LLMs.
- Open-source: TileDB Open-source and TileDB-VCF are both open source and developed under the permissive MIT license. We are passionate about supporting the scientific community with open-source software and welcome feedback and contributions to further improve our projects.
- Notebooks for reproducible reports: Within TileDB Cloud, Jupyter notebooks provide a reproducible means of distributing variant analyses to end users. Notebooks are actually stored as TileDB arrays, and are governed and versioned using the same principles.
- Variant browsers and exploratory dashboards: TileDB Cloud enables building custom hosted dashboards in RShiny or ipywidgets. Self-hosted, fully-featured web applications are also possible using TileDB Cloud APIs.
- Security, governance, and compliance: TileDB offers user-configurable encryption for all data assets, in addition to any encryption policies defined at the storage level (e.g., for S3 buckets). TileDB Cloud allows you to manage your variant data with configurable access policies, enable secure sharing and collaboration within your organization or across different organizations. It also supports logging of all activity for auditing purposes. Finally, TileDB Cloud is SOC 2 Type 2 and HIPAA compliant.
- Battle-tested: TileDB-VCF has been used in production by very large customers to manage immense amounts of variant data, in the order of many hundreds of thousands of samples.
TileDB vs. other solutions
Three categories of solutions exist that you can adopt instead of TileDB-VCF and TileDB Cloud:
- VCF-file-based: These solutions include BCFtools, Hail, and OpenCGA. These solutions suffer from the limitations described above. Most importantly, these solutions are not database systems. This means that functionality such as catalogs and secure governance, as well as compliance, need to be built on top by yourselves, which is an enormous undertaking.
- Databases: You can instead try to model the VCF data with tables and use any analytical database or data warehouse in the market (e.g., Snowflake, Redshift, Databricks, etc.). The organizations that tried this approach and eventually migrated to TileDB faced the following problems:
- Although the VCF data may seem like tabular data, they come with peculiarities when querying them. First, the queries are effectively range intersection queries rather than simple selections. Traditional databases do not have first-class support for range storage and intersection queries. This type of typical VCF query and its challenges are more thoroughly explained in the Data Model section. Second, most VCF queries are imposing conditions on multiple fields, which calls for superb support of multi-dimensional indexing. It is debatable how well traditional databases implement multi-dimensional indexing.
- The genomic variant data are typically gigantic. Most traditional databases in the market charge based on the data storage and compute consumption, which makes them prohibitively expensive for organizations working with variant data.
- You might incur a significant data engineering cost by attempting to model VCF data as tables, as well as risk for misrepresentation. More often than not, organizations represent VCF data in tables lossily (for the sake of achieving higher performance), which hinders the analysis conducted by the scientists.
- Build it yourself: Finally, you can choose to build an alternative to TileDB-VCF and TileDB Cloud from scratch. Rest assured, this is going to be more time-consuming, costly, and intellectually challenging than anything you have encountered before in your organization when it comes to data infrastructure.
TileDB-VCF and TileDB Cloud are uniquely positioned to store in a lossless manner, and securely manage and analyze at scale, population variant data, with an unmatched performance-cost ratio.
Section organization
This rest of the Population Genomics section is organized as follows:
Quickstart: This is the best way to get started with TileDB-VCF. You will learn how to install TileDB-VCF in your preferred language and run basic examples.
Foundation: This contains all the background information and internal mechanics of TileDB-VCF. Learning these will provide a very deep understanding of the TileDB-VCF technology and power, and help maximize the value users get from TileDB-VCF.
Tutorials: This is a series of tutorials covering all aspects of TileDB-VCF, from basic ingestion to massively scalable computations. Running those tutorials can help users start without any prior knowledge of TileDB-VCF and become power users.
API Reference: This lists all the TileDB-VCF functionality across the various programming languages it supports, and enables fast lookups on API usage.
You can run each of the tutorials in this section in one of two ways, which is specified in the beginning of each tutorial:
- Locally on your machine.
- On TileDB Cloud.
However, since TileDB Cloud has a free tier, we strongly recommend that you sign up and run everything there, as that requires no installations or deployment.