1. Structure
  2. Life Sciences
  3. Population Genomics
  4. Foundation
  5. Key Concepts
  6. Ingestion
  • Home
  • What is TileDB?
  • Get Started
  • Explore Content
  • Accounts
    • Individual Accounts
      • Apply for the Free Tier
      • Profile
        • Overview
        • Cloud Credentials
        • Storage Paths
        • REST API Tokens
        • Credits
    • Organization Admins
      • Create an Organization
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
    • Organization Members
      • Organization Invitations
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
  • Catalog
    • Introduction
    • Data
      • Arrays
      • Tables
      • Single-Cell (SOMA)
      • Genomics (VCF)
      • Biomedical Imaging
      • Vector Search
      • Files
    • Code
      • Notebooks
      • Dashboards
      • User-Defined Functions
      • Task Graphs
      • ML Models
    • Groups
    • Marketplace
    • Search
  • Collaborate
    • Introduction
    • Organizations
    • Access Control
      • Introduction
      • Share Assets
      • Asset Permissions
      • Public Assets
    • Logging
    • Marketplace
  • Analyze
    • Introduction
    • Slice Data
    • Multi-Region Redirection
    • Notebooks
      • Launch a Notebook
      • Usage
      • Widgets
      • Notebook Image Dependencies
    • Dashboards
      • Dashboards
      • Streamlit
    • Preview
    • User-Defined Functions
    • Task Graphs
    • Serverless SQL
    • Monitor
      • Task Log
      • Task Graph Log
  • Scale
    • Introduction
    • Task Graphs
    • API Usage
  • Structure
    • Why Structure Is Important
    • Arrays
      • Introduction
      • Quickstart
      • Foundation
        • Array Data Model
        • Key Concepts
          • Storage
            • Arrays
            • Dimensions
            • Attributes
            • Cells
            • Domain
            • Tiles
            • Data Layout
            • Compression
            • Encryption
            • Tile Filters
            • Array Schema
            • Schema Evolution
            • Fragments
            • Fragment Metadata
            • Commits
            • Indexing
            • Array Metadata
            • Datetimes
            • Groups
            • Object Stores
          • Compute
            • Writes
            • Deletions
            • Consolidation
            • Vacuuming
            • Time Traveling
            • Reads
            • Query Conditions
            • Aggregates
            • User-Defined Functions
            • Distributed Compute
            • Concurrency
            • Parallelism
        • Storage Format Spec
      • Tutorials
        • Basics
          • Basic Dense Array
          • Basic Sparse Array
          • Array Metadata
          • Compression
          • Encryption
          • Data Layout
          • Tile Filters
          • Datetimes
          • Multiple Attributes
          • Variable-Length Attributes
          • String Dimensions
          • Nullable Attributes
          • Multi-Range Reads
          • Query Conditions
          • Aggregates
          • Deletions
          • Catching Errors
          • Configuration
          • Basic S3 Example
          • Basic TileDB Cloud
          • fromDataFrame
          • Palmer Penguins
        • Advanced
          • Schema Evolution
          • Advanced Writes
            • Write at a Timestamp
            • Get Fragment Info
            • Consolidation
              • Fragments
              • Fragment List
              • Consolidation Plan
              • Commits
              • Fragment Metadata
              • Array Metadata
            • Vacuuming
              • Fragments
              • Commits
              • Fragment Metadata
              • Array Metadata
          • Advanced Reads
            • Get Fragment Info
            • Time Traveling
              • Introduction
              • Fragments
              • Array Metadata
              • Schema Evolution
          • Array Upgrade
          • Backends
            • Amazon S3
            • Azure Blob Storage
            • Google Cloud Storage
            • MinIO
            • Lustre
          • Virtual Filesystem
          • User-Defined Functions
          • Distributed Compute
          • Result Estimation
          • Incomplete Queries
        • Management
          • Array Schema
          • Groups
          • Object Management
        • Performance
          • Summary of Factors
          • Dense vs. Sparse
          • Dimensions vs. Attributes
          • Compression
          • Tiling and Data Layout
          • Tuning Writes
          • Tuning Reads
      • API Reference
    • Tables
      • Introduction
      • Quickstart
      • Foundation
        • Data Model
        • Key Concepts
          • Indexes
          • Columnar Storage
          • Compression
          • Data Manipulation
          • Optimize Tables
          • ACID
          • Serverless SQL
          • SQL Connectors
          • Dataframes
          • CSV Ingestion
      • Tutorials
        • Basics
          • Ingestion with SQL
          • CSV Ingestion
          • Basic S3 Example
          • Running Locally
        • Advanced
          • Scalable Ingestion
          • Scalable Queries
      • API Reference
    • AI & ML
      • Vector Search
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Vector Search
            • Vector Databases
            • Algorithms
            • Distance Metrics
            • Updates
            • Deployment Methods
            • Architecture
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Ingestion & Querying
            • Updates
            • Deletions
            • Basic S3 Example
            • Running Locally
          • Advanced
            • Versioning
            • Time Traveling
            • Consolidation
            • Distributed Compute
            • RAG LLM
            • LLM Memory
            • File Search
            • Image Search
            • Protein Search
          • Performance
        • API Reference
      • ML Models
        • Introduction
        • Quickstart
        • Foundation
          • Basics
          • Storage
          • Cloud Execution
          • Why TileDB for Machine Learning
        • Tutorials
          • Ingestion
            • Data Ingestion
              • Dense Datasets
              • Sparse Datasets
            • ML Model Ingestion
          • Management
            • Array Schema
            • Machine Learning: Groups
            • Time Traveling
    • Life Sciences
      • Single-cell
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Data Structures
            • Use of Apache Arrow
            • Join IDs
            • State Management
            • TileDB Cloud URIs
          • SOMA API Specification
        • Tutorials
          • Data Ingestion
          • Bulk Ingestion Tutorial
          • Data Access
          • Distributed Compute
          • Basic S3 Example
          • Multi-Experiment Queries
          • Appending Data to a SOMA Experiment
          • Add New Measurements
          • SQL Queries
          • Running Locally
          • Shapes in TileDB-SOMA
          • Drug Discovery App
        • Spatial
          • Introduction
          • Foundation
            • Spatial Data Model
            • Data Structures
          • Tutorials
            • Spatial Data Ingestion
            • Access Spatial Data
            • Manage Coordinate Spaces
        • API Reference
      • Population Genomics
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • The N+1 Problem
            • Architecture
            • Arrays
            • Ingestion
            • Reads
            • Variant Statistics
            • Annotations
            • User-Defined Functions
            • Tables and SQL
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Basic Ingestion
            • Basic Queries
            • Export to VCF
            • Add New Samples
            • Deleting Samples
            • Basic S3 Example
            • Basic TileDB Cloud
          • Advanced
            • Scalable Ingestion
            • Scalable Queries
            • Query Transforms
            • Handling Large Queries
            • Annotations
              • Finding Annotations
              • Embedded Annotations
              • External Annotations
              • Annotation VCFs
              • Ingesting Annotations
            • Variant Statistics
            • Tables and SQL
            • User-Defined Functions
            • Sample Metadata
            • Split VCF
          • Performance
        • API Reference
          • Command Line Interface
          • Python API
          • Cloud API
      • Biomedical Imaging
        • Introduction
        • Foundation
          • Data Model
          • Key Concepts
            • Arrays
            • Ingestion
            • Reads
            • User Defined Functions
          • Storage Format Spec
        • Quickstart
        • Tutorials
          • Basics
            • Ingestion
            • Read
              • OpenSlide
              • TileDB-Py
          • Advanced
            • Batched Ingestion
            • Chunked Ingestion
            • Machine Learning
              • PyTorch
            • Napari
    • Files
  • API Reference
  • Self-Hosting
    • Installation
    • Upgrades
    • Administrative Tasks
    • Image Customization
      • Customize User-Defined Function Images
      • AWS ECR Container Registry
      • Customize Jupyter Notebook Images
    • Single Sign-On
      • Configure Single Sign-On
      • OpenID Connect
      • Okta SCIM
      • Microsoft Entra
  • Glossary

On this page

  • Basic ingestion
  • Recommendations
    • Avoid fragment overlap
    • Anchor gap
    • Consolidation requirements
  • Ingestion on TileDB Cloud
  • Consolidation on TileDB Cloud
  1. Structure
  2. Life Sciences
  3. Population Genomics
  4. Foundation
  5. Key Concepts
  6. Ingestion

VCF Data Ingestion

life sciences
genomics (vcf)
foundation
ingestion
How TileDB is used to efficiently load VCF data into a TileDB-VCF dataset with optimized read performance.

Ingestion is the process of extracting variant data from VCF files, transforming the data into an analysis-ready format, and loading it into a TileDB-VCF dataset. This section describes the basic ingestion process provided by the TileDB-VCF open-source code, best practices for optimizing read performance on large datasets, and the automation provided by scalable ingestion on TileDB Cloud.

Basic ingestion

The TileDB-VCF open-source code provides basic ingestion functionality for VCF files that meet the following prerequisites:

  1. Each VCF file contains a single sample.
  2. The records in each VCF file are sorted.
  3. Each VCF file is compressed with bgzip.
  4. Each VCF file is indexed with bcftools.

VCF files are parsed with htslib. Therefore, any warnings or errors detected by htslib will be reported by TileDB-VCF. If htslib reports any errors in the VCF file, the errors must be fixed before running ingestion. When htslib reads a remote VCF file, it copies the index file to the local system to improve performance. These index files can be removed after ingestion is complete.

The basic TileDB-VCF ingestion code supports ingesting multiple batches of VCF files in parallel and appending additional batches of VCF files over time. These capabilities allow TileDB-VCF datasets to grow in a scalable fashion as described in the data model. As a dataset grows, the arrays in the dataset accumulate fragments. A large number of fragments can impact read performance if the array is not consolidated and vacuumed.

The following sections describe recommendations for maintaining optimal read performance in a large TileDB-VCF dataset, including consolidation and vacuuming of fragments.

Recommendations

This section contains recommendations for maintaining optimal read performance as a TileDB-VCF dataset grows in size. To provide maximum value to TileDB customers, these recommendations are fully automated in the TileDB-Cloud-based scalable ingestion.

Avoid fragment overlap

When reading from a TileDB array, a powerful method to reduce read query time is to prune fragments that are guaranteed not to contain results for the query. Fragment pruning is enabled by the dimensions of an array and the indexing provided by TileDB. TileDB-VCF data array dimensions, in order of precedence, are contig, start_pos, and sample, which correspond to CHROM, POS, and Sample ID in the VCF specification.

Illustrating fragment pruning with an example, consider a read query targeting chromosome chr1, position 10,000,000, and sample HG00096. TileDB-VCF ingestion stores data for different chromosomes in different fragments. Therefore, only fragments that contain data for chr1 are considered, and all others fragments are pruned. Next, the position value 10,000,000 is considered, and any fragments that do not contain data for this position are pruned. Finally, the sample value HG00096 is considered, and any fragments that cannot contain data for this sample are pruned. Visit the Arrays Key Concepts: Reads section for more details about the TileDB read algorithm.

Fragment overlap occurs when the non-empty domains (explained in the Arrays Key Concepts: Reads section) for two or more fragments overlap in a way that prevent the fragments from being pruned. Returning to the example above, consider the sample dimension and the HG00096 query.

An example of overlapping fragments:

  • Fragment 1: sample non-empty domain = (AAA, ZZZ)
  • Fragment 2: sample non-empty domain = (AAA, ZZZ)

The sample value HG00096 could be found in either fragment, therefore neither fragment can be pruned.

An example of non-overlapping fragments:

  • Fragment 1: sample non-empty domain = (AAA, MMM)
  • Fragment 2: sample non-empty domain = (NNN, ZZZ)

With this more optimal fragment layout, the sample value HG00096 can only be found in Fragment 1. Therefore, Fragment 2 is pruned from the query.

As mentioned above, TileDB-VCF ingestion avoids fragment overlap in the contig dimension by limiting data in each fragment to one chromosome. Fragment overlap in the start_pos dimension is less of a concern, since each fragment contains data for an entire chromosome, and overlap is expected. However, overlap in the sample dimension has more impact on the read query performance. Expanding on the example above, if overlap in the sample dimension prevents a read query from pruning fragments that do not contain data for sample HG00096, then all overlapping fragments must be inspected to find data for HG00096.

The way to prevent fragments that overlap in the sample dimension is to create batches of samples that contain non-overlapping sample names. For example, generating a list of sample names to be ingested, sorting the sample names lexicographically, and splitting the sorted list into sample batches. These steps are fully automated in the TileDB-Cloud-based scalable ingestion.

Anchor gap

As described in the data model, TileDB-VCF injects anchors into the data to break up long ranges and support rapid retrieval of interval intersections. The anchor gap defines the distance between anchors in these variants with long ranges. Consider the following tradeoffs between storage size and query performance when setting the size of the anchor gap:

  • A larger anchor gap inserts fewer anchors, which reduces additional storage size. Each query is expanded by a larger amount, potentially reading more unneeded data that must be filtered out and increasing query time.
  • A smaller anchor gap inserts more anchors, which increases additional storage size. Each query is expanded by a smaller amount, reducing reads of unneeded data and reducing the impact to query time.

The default anchor gap value works well with ranges in typical gVCF files. For variant data with much longer ranges, like CNV data, the anchor gap should be increased to 1,000,000.

Consolidation requirements

Maintaining optimal read performance for large datasets with a large number of fragments in each array requires consolidation and vacuuming of commits, fragment metadata, and array fragments. Consolidation of array fragments is a complex operation that is best performed in a distributed manner.

When ingesting a large TileDB-VCF dataset, it is important to periodically consolidate the arrays in the dataset. This periodic consolidation maintains a low array open time, which is required by the resume ingestion feature.

A final requirement for consolidation is ingestion (writing new fragments to the arrays) must be paused while the arrays are being consolidated.

All consolidation requirements are fully automated in the TileDB-Cloud-based scalable ingestion.

Ingestion on TileDB Cloud

Scalable ingestion on TileDB Cloud fully automates preparing VCF files to meet most prerequisites described in the Basic ingestion section, specifically:

  1. If a VCF file is not sorted or bgzipped, the VCF should be provided as a *.vcf file (not a gzipped *.vcf.gz file). The VCF file will be sorted and bgzipped on the fly using bcftools.
  2. If a VCF file is not indexed, it will be indexed automatically using bcftools.

For the final prerequisite, single sample VCF files, TileDB Cloud provides a solution for efficiently splitting multi-sample VCF files into single-sample VCF files in a distributed fashion.

To address the issue of avoiding fragment overlaps, TileDB-VCF scalable ingestion runs the following steps:

  1. Given a URI and search pattern or a URI with a list of VCF URIs, create a list of VCF URIs to ingest.
  2. For each VCF URI, read the sample name and check for a valid index file. Save this information in the manifest array.
  3. Read the list of sample names from the manifest and sort the list lexicographically.
  4. Split the sorted list of sample names into batches (default size of 100 samples) to ensure no sample names overlap between the fragments created by each batch.

To meet the requirement of periodic consolidation, scalable ingestion creates a task graph with the following pattern:

  1. Ingest N batches of samples in parallel (default 40).
Note

The number of parallel batches is limited by the bandwidth of the object store backend. For example, Amazon S3 is limited to 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per partitioned Amazon S3 prefix.

  1. Consolidate and vacuum the commits and fragment metadata in all arrays.
  2. While more sample batches to ingest remain, return to step #1.
  3. Consolidate and vacuum the fragments in the variant statistic arrays.

Consolidation on TileDB Cloud

Consolidation is a key ingredient to providing a scalable TileDB-VCF dataset with optimized read performance. As described above, consolidation is fully integrated into the scalable ingestion solution. The consolidation of commits and fragment metadata is straightforward, since these are baseline consolidation operations. The consolidation of fragments in the variant statistics arrays is more complex and is described in variant statistics consolidation.

Note

In general, consolidation of TileDB arrays is important to optimize read performance. Other TileDB arrays, like annotations, must be consolidated separately since they are not included in TileDB-VCF dataset consolidation.

A large majority of the data in a TileDB-VCF dataset is stored in the data array. Consolidation of the data array addresses the following goals:

  1. Optimize the array fragment sizes while maintaining the desired pruning characteristics of the array dimensions. This goal is achieved by creating groups of fragments with the same contig value and consolidating each group of fragments with a target size of 1 GiB per fragment.
  2. Materialize deletions with the option of consolidating deleted cells into a new fragment to preserve time traveling or permanently purging the deletions to optimize storage and read performance.

Similar to the variant statistic arrays, consolidation of the data array is fully automated and distributed on TileDB Cloud to reduce the latency of consolidation.

Arrays
Reads