1. Structure
  2. Life Sciences
  3. Population Genomics
  4. Foundation
  5. Storage Format Spec
  • Home
  • What is TileDB?
  • Get Started
  • Explore Content
  • Accounts
    • Individual Accounts
      • Apply for the Free Tier
      • Profile
        • Overview
        • Cloud Credentials
        • Storage Paths
        • REST API Tokens
        • Credits
    • Organization Admins
      • Create an Organization
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
    • Organization Members
      • Organization Invitations
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
  • Catalog
    • Introduction
    • Data
      • Arrays
      • Tables
      • Single-Cell (SOMA)
      • Genomics (VCF)
      • Biomedical Imaging
      • Vector Search
      • Files
    • Code
      • Notebooks
      • Dashboards
      • User-Defined Functions
      • Task Graphs
      • ML Models
    • Groups
    • Marketplace
    • Search
  • Collaborate
    • Introduction
    • Organizations
    • Access Control
      • Introduction
      • Share Assets
      • Asset Permissions
      • Public Assets
    • Logging
    • Marketplace
  • Analyze
    • Introduction
    • Slice Data
    • Multi-Region Redirection
    • Notebooks
      • Launch a Notebook
      • Usage
      • Widgets
      • Notebook Image Dependencies
    • Dashboards
      • Dashboards
      • Streamlit
    • Preview
    • User-Defined Functions
    • Task Graphs
    • Serverless SQL
    • Monitor
      • Task Log
      • Task Graph Log
  • Scale
    • Introduction
    • Task Graphs
    • API Usage
  • Structure
    • Why Structure Is Important
    • Arrays
      • Introduction
      • Quickstart
      • Foundation
        • Array Data Model
        • Key Concepts
          • Storage
            • Arrays
            • Dimensions
            • Attributes
            • Cells
            • Domain
            • Tiles
            • Data Layout
            • Compression
            • Encryption
            • Tile Filters
            • Array Schema
            • Schema Evolution
            • Fragments
            • Fragment Metadata
            • Commits
            • Indexing
            • Array Metadata
            • Datetimes
            • Groups
            • Object Stores
          • Compute
            • Writes
            • Deletions
            • Consolidation
            • Vacuuming
            • Time Traveling
            • Reads
            • Query Conditions
            • Aggregates
            • User-Defined Functions
            • Distributed Compute
            • Concurrency
            • Parallelism
        • Storage Format Spec
      • Tutorials
        • Basics
          • Basic Dense Array
          • Basic Sparse Array
          • Array Metadata
          • Compression
          • Encryption
          • Data Layout
          • Tile Filters
          • Datetimes
          • Multiple Attributes
          • Variable-Length Attributes
          • String Dimensions
          • Nullable Attributes
          • Multi-Range Reads
          • Query Conditions
          • Aggregates
          • Deletions
          • Catching Errors
          • Configuration
          • Basic S3 Example
          • Basic TileDB Cloud
          • fromDataFrame
          • Palmer Penguins
        • Advanced
          • Schema Evolution
          • Advanced Writes
            • Write at a Timestamp
            • Get Fragment Info
            • Consolidation
              • Fragments
              • Fragment List
              • Consolidation Plan
              • Commits
              • Fragment Metadata
              • Array Metadata
            • Vacuuming
              • Fragments
              • Commits
              • Fragment Metadata
              • Array Metadata
          • Advanced Reads
            • Get Fragment Info
            • Time Traveling
              • Introduction
              • Fragments
              • Array Metadata
              • Schema Evolution
          • Array Upgrade
          • Backends
            • Amazon S3
            • Azure Blob Storage
            • Google Cloud Storage
            • MinIO
            • Lustre
          • Virtual Filesystem
          • User-Defined Functions
          • Distributed Compute
          • Result Estimation
          • Incomplete Queries
        • Management
          • Array Schema
          • Groups
          • Object Management
        • Performance
          • Summary of Factors
          • Dense vs. Sparse
          • Dimensions vs. Attributes
          • Compression
          • Tiling and Data Layout
          • Tuning Writes
          • Tuning Reads
      • API Reference
    • Tables
      • Introduction
      • Quickstart
      • Foundation
        • Data Model
        • Key Concepts
          • Indexes
          • Columnar Storage
          • Compression
          • Data Manipulation
          • Optimize Tables
          • ACID
          • Serverless SQL
          • SQL Connectors
          • Dataframes
          • CSV Ingestion
      • Tutorials
        • Basics
          • Ingestion with SQL
          • CSV Ingestion
          • Basic S3 Example
          • Running Locally
        • Advanced
          • Scalable Ingestion
          • Scalable Queries
      • API Reference
    • AI & ML
      • Vector Search
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Vector Search
            • Vector Databases
            • Algorithms
            • Distance Metrics
            • Updates
            • Deployment Methods
            • Architecture
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Ingestion & Querying
            • Updates
            • Deletions
            • Basic S3 Example
            • Running Locally
          • Advanced
            • Versioning
            • Time Traveling
            • Consolidation
            • Distributed Compute
            • RAG LLM
            • LLM Memory
            • File Search
            • Image Search
            • Protein Search
          • Performance
        • API Reference
      • ML Models
        • Introduction
        • Quickstart
        • Foundation
          • Basics
          • Storage
          • Cloud Execution
          • Why TileDB for Machine Learning
        • Tutorials
          • Ingestion
            • Data Ingestion
              • Dense Datasets
              • Sparse Datasets
            • ML Model Ingestion
          • Management
            • Array Schema
            • Machine Learning: Groups
            • Time Traveling
    • Life Sciences
      • Single-cell
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Data Structures
            • Use of Apache Arrow
            • Join IDs
            • State Management
            • TileDB Cloud URIs
          • SOMA API Specification
        • Tutorials
          • Data Ingestion
          • Bulk Ingestion Tutorial
          • Data Access
          • Distributed Compute
          • Basic S3 Example
          • Multi-Experiment Queries
          • Appending Data to a SOMA Experiment
          • Add New Measurements
          • SQL Queries
          • Running Locally
          • Shapes in TileDB-SOMA
          • Drug Discovery App
        • Spatial
          • Introduction
          • Foundation
            • Spatial Data Model
            • Data Structures
          • Tutorials
            • Spatial Data Ingestion
            • Access Spatial Data
            • Manage Coordinate Spaces
        • API Reference
      • Population Genomics
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • The N+1 Problem
            • Architecture
            • Arrays
            • Ingestion
            • Reads
            • Variant Statistics
            • Annotations
            • User-Defined Functions
            • Tables and SQL
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Basic Ingestion
            • Basic Queries
            • Export to VCF
            • Add New Samples
            • Deleting Samples
            • Basic S3 Example
            • Basic TileDB Cloud
          • Advanced
            • Scalable Ingestion
            • Scalable Queries
            • Query Transforms
            • Handling Large Queries
            • Annotations
              • Finding Annotations
              • Embedded Annotations
              • External Annotations
              • Annotation VCFs
              • Ingesting Annotations
            • Variant Statistics
            • Tables and SQL
            • User-Defined Functions
            • Sample Metadata
            • Split VCF
          • Performance
        • API Reference
          • Command Line Interface
          • Python API
          • Cloud API
      • Biomedical Imaging
        • Introduction
        • Foundation
          • Data Model
          • Key Concepts
            • Arrays
            • Ingestion
            • Reads
            • User Defined Functions
          • Storage Format Spec
        • Quickstart
        • Tutorials
          • Basics
            • Ingestion
            • Read
              • OpenSlide
              • TileDB-Py
          • Advanced
            • Batched Ingestion
            • Chunked Ingestion
            • Machine Learning
              • PyTorch
            • Napari
    • Files
  • API Reference
  • Self-Hosting
    • Installation
    • Upgrades
    • Administrative Tasks
    • Image Customization
      • Customize User-Defined Function Images
      • AWS ECR Container Registry
      • Customize Jupyter Notebook Images
    • Single Sign-On
      • Configure Single Sign-On
      • OpenID Connect
      • Okta SCIM
      • Microsoft Entra
  • Glossary

On this page

  • Data array
    • Basic schema parameters
    • Dimensions
    • Attributes
    • Metadata
  • VCF headers array
    • Basic schema parameters
    • Dimensions
    • Attributes
  • Manifest array
    • Basic schema parameters
    • Dimensions
    • Attributes
  • Log array
    • Basic schema parameters
    • Dimensions
    • Attributes
  • Variant stats array
    • Basic schema parameters
    • Dimensions
    • Attributes
  • Allele count array
    • Basic schema parameters
    • Dimensions
    • Attributes
  • Sample stats array
    • Basic schema parameters
    • Dimensions
    • Attributes
  • Configurable parameters
  • Putting it all together
  1. Structure
  2. Life Sciences
  3. Population Genomics
  4. Foundation
  5. Storage Format Spec

TileDB-VCF Storage Format Spec

life sciences
genomics (vcf)
foundation
storage format spec
The storage format specification of TileDB-VCF.

A TileDB-VCF dataset is composed of a group of two or more separate TileDB arrays:

  • Data array: A 3D sparse array for the actual genomic variants and associated fields/attributes
  • VCF header array: A 1D sparse array for the metadata stored in each single-sample VCF header
  • Manifest: A 1D sparse array holding information about the VCF files ingested into the dataset
  • Log: A 1D sparse array holding log information from the ingestion tasks
  • Variant stats array: A 2D sparse array holding data used to compute internal allele frequency
  • Allele count array: A 2D sparse array holding counts of unique chrom-pos-ref-alt variants in the dataset
  • Sample stats array: A 1D sparse array holding variant summary statistics for each sample

Data array

Basic schema parameters

Parameter Value
Array type Sparse
Rank 3D
Cell order Row-major
Tile order Row-major

Dimensions

The dimensions in the schema are:

Dimension Name TileDB Datatype Corresponding VCF Field
contig TILEDB_STRING_ASCII CHR
start_pos uint32_t VCFPOSplus TileDB anchors
sample TILEDB_STRING_ASCII Sample name

As mentioned before, the coordinates of the 3D array are contig along the first dimension, chromosomal location of the variants start position along the second dimension, and sample names along the third dimension.

Attributes

Each field in a single-sample VCF record has a corresponding attribute in the schema.

Attribute Name TileDB Datatype Description
end_pos uint32_t VCF END position of VCF records
qual float VCF QUAL field
alleles var<char> CSV list of REF and ALT VCF fields
id var<char> VCF ID field
filter_ids var<int32_t> Vector of integer IDs of entries in the FILTER VCF field
real_start_pos uint32_t VCF POS (no anchors)
info var<uint8_t> Byte blob containing any INFO fields that are not stored as explicit attributes
fmt var<uint8_t> Byte blob containing any FMT fields that are not stored as explicit attributes
info_* var<uint8_t> One or more attributes storing specific VCF INFO fields (e.g. info_DP, info_MQ, etc. )
fmt_* var<uint8_t> One or more attributes storing specific VCF FORMAT fields (e.g. fmt_GT, fmt_MIN_DP, etc.)

The info_* and fmt_* attributes allow individual INFO or FMT VCF fields to be extracted into explicit array attributes. This can be beneficial if your queries frequently access only a subset of the INFO or FMT fields, as no unrelated data then needs to be fetched from storage.

Tip

During array creation, you can choose which fields to extract as explicit array attributes.

Any extra info or format fields not extracted as explicit array attributes are stored in the byte blob attributes, info and fmt.

Metadata

The following metadata values are updated during array creation, and are used during the export phase:

  • anchor_gap - Anchor gap value
  • extra_attributes - List of INFO or FMT field names that are stored as explicit array attributes
  • version - Array schema version

These metadata values are updated during array creation, and are used during the export phase. The metadata is stored as “array metadata” in the sparse data array.

Warning

When ingesting samples, the sample header must be identical for all samples with respect to the contig mappings. That means all samples must have the exact same set of contigs listed in the VCF header.

VCF headers array

The vcf_headers array stores the original text of every ingested VCF header in order to:

  1. Ensure the original VCF file can be fully recovered for any given sample.
  2. Reconstruct an htslib header instance when reading from the dataset, which is used for operations such as mapping a filter ID back to the filter string, etc.

Basic schema parameters

Parameter Value
Array type Sparse
Rank 1D
Cell order Row-major
Tile order Row-major

Dimensions

Dimension Name TileDB Datatype Description
sample TILEDB_STRING_ASCII Sample name

Attributes

Attribute Name TileDB Datatype Description
header var<char> Original text of the VCF header

Manifest array

The manifest array is an optional array added by scalable ingestion on TileDB Cloud. The array is used to build a list of VCF URIs sorted by sample name and to keep track of the VCF files ingested in the dataset.

Basic schema parameters

Parameter Value
Array type Sparse
Rank 1D
Cell order Row-major
Tile order Row-major

Dimensions

The dimensions in the schema are:

Dimension Name TileDB Datatype Corresponding VCF Field
sample_name TILEDB_STRING_ASCII Sample name

Attributes

For each sample, the following attributes are stored:

Attribute Name TileDB Datatype Description
status TILEDB_STRING_ASCII Status of VCF file check
vcf_uri TILEDB_STRING_ASCII VCF file URI
vcf_bytes uint64 Size of the original VCF file in bytes
index_uri TILEDB_STRING_ASCII VCF index file URI
index_bytes uint64 Size of the original VCF index file in bytes
records uint64 Number of records in the VCF file

The status attribute is used to store the status of the VCF file check, which checks for missing sample names, multiple samples in on VCF file, duplicate sample names in a batch, and missing or bad index files.

Log array

The log array is an optional array added by scalable ingestion on TileDB Cloud. The log array provides a flexible, time-series array to store application specific events. In the case of TileDB-VCF scalable ingestion, the log array is used to store information about the ingestion process, which can be used for debugging, monitoring, and reporting.

Basic schema parameters

Parameter Value
Array type Sparse
Rank 1D
Cell order Row-major
Tile order Row-major

Dimensions

The dimensions in the schema are:

Dimension Name TileDB Datatype Corresponding VCF Field
time_ms uint64 Timestamp for the log event

Attributes

For each log event, the following attributes are stored:

Attribute Name TileDB Datatype Description
id TILEDB_STRING_ASCII Log event ID
op TILEDB_STRING_ASCII Log event operation
data TILEDB_STRING_ASCII Log event data
extra TILEDB_STRING_ASCII Log event extra data

Variant stats array

The variant_stats array holds data used to compute internal allele frequency (IAF) as described in the data model section.

Basic schema parameters

Parameter Value
Array type Sparse
Rank 2D
Cell order Row-major
Tile order Row-major

Dimensions

The dimensions in the schema are:

Dimension Name TileDB Datatype Corresponding VCF Field
contig TILEDB_STRING_ASCII CHROM from the VCF file
pos uint32 POS from the VCF file, 0-indexed

Attributes

For each contig-pos location, the following attributes are stored:

Attribute Name TileDB Datatype Description
allele TILEDB_STRING_ASCII Normalized allele value
ac int32 Allele count
n_hom int32 Number of homozygous calls

Allele count array

The allele_count array holds counts of unique chrom-pos-ref-alt variants as described in the data model section.

Basic schema parameters

Parameter Value
Array type Sparse
Rank 2D
Cell order Row-major
Tile order Row-major

Dimensions

The dimensions in the schema are:

Dimension Name TileDB Datatype Corresponding VCF Field
contig TILEDB_STRING_ASCII CHROM from the VCF file
pos uint32 POS from the VCF file, 0-indexed

Attributes

For each contig-pos location, the following attributes are stored:

Attribute Name TileDB Datatype Description
ref TILEDB_STRING_ASCII REF from the VCF file
alt TILEDB_STRING_ASCII ALT from the VCF file
filter TILEDB_STRING_ASCII FILTER from the VCF file
gt TILEDB_STRING_ASCII Normalized FORMAT/GT from the VCF file
count int32 Number of records with the same attribute values

The filter and gt attributes allow further filtering of the allele count data. The value of gt is normalized to one of the following values:

  • 1 - Homozygous alternate (diploid)
  • .,1 - Heterozygous with one missing allele
  • 0,1 - Heterozygous
  • 1,1 - Homozygous alternate
  • 1,2 - Multiallelic heterozygous alternate

Sample stats array

The sample_stats array holds variant summary statistics for each sample as described in the data model section.

Basic schema parameters

Parameter Value
Array type Sparse
Rank 1D
Cell order Row-major
Tile order Row-major

Dimensions

The dimensions in the schema are:

Dimension Name TileDB Datatype Corresponding VCF Field
sample TILEDB_STRING_ASCII Sample name

Attributes

For each sample, the following attributes are stored:

Attribute Name TileDB Datatype Description
dp_sum uint64 Read depth sum
dp_sum2 uint64 Read depth sum squared (for stddev aggregation)
dp_count uint64 Read depth counts
dp_min uint64 Read depth minimum value
dp_max uint64 Read depth maximum value
gq_sum uint64 Genotype quality sum
gq_sum2 uint64 Genotype quality sum squared (for stddev aggregation)
gq_count uint64 Genotype quality counts
gq_min uint64 Genotype quality minimum value
gq_max uint64 Genotype quality maximum value
n_records uint64 Number of records
n_called uint64 Number of calls
n_not_called uint64 Number of missing calls
n_hom_ref uint64 Number of homozygous reference calls
n_het uint64 Number of heterozygous calls
n_singleton uint64 Number of singletons
n_snp uint64 Number of SNPs
n_insertion uint64 Number of insertions
n_deletion uint64 Number of deletions
n_transition uint64 Number of transitions
n_transversion uint64 Number of transversions
n_star uint64 Number of star alleles
n_multiallelic uint64 Number of multiallelic records

Configurable parameters

During array creation, you can specify different array-related parameters, including the following:

  • Array data tile capacity (default 10,000).
  • The “anchor gap” size (default 1,000).
  • The list of INFO and FMT fields to store as explicit array attributes (default is none).

Once chosen, these parameters cannot be changed.

During sample ingestion, the user can specify the sample batch size (the default value is 10).

The above parameters may impact read and write performance, as well as the size of the persisted array. Therefore, you should perform adequate testing to determine good values for these parameters before ingesting a large amount of data into an array.

Putting it all together

To summarize, three main entities exist in this data model:

  • The variant data array (3D sparse)
  • The general metadata, stored in the variant data array as metadata
  • The VCF header array (1D sparse)

Three arrays for variant statistics:

  • The variant stats array (2D sparse)
  • The allele count array (2D sparse)
  • The sample stats array (1D sparse)

Two optional entries added by scalable ingestion on TileDB Cloud:

  • The manifest array (1D sparse)
  • The log array (1D sparse)

These components form the “TileDB-VCF dataset.” Expressed as a directory hierarchy, a TileDB-VCF dataset has the following structure:

<dataset_uri>/
  |_ allele_count/
      |_ __schema
      ... <other array directories/fragments and files>
  |_ data/
      |_ __schema
      |_ __meta/
            |_ <general-metadata-here>
      ... <other array directories/fragments and files>
  |_ log/
      |_ __schema
      ... <other array directories/fragments and files>
  |_ manifest/
      |_ __schema
      ... <other array directories/fragments and files>
  |_ sample_stats/
      |_ __schema
      ... <other array directories/fragments and files>
  |_ variant_stats/
      |_ __schema
      ... <other array directories/fragments and files>
  |_ vcf_headers/
      |_ __schema
      ... <other array directories/fragments and files>
  |_ __tiledb_group.tdb

The root of the dataset, <dataset_uri> is a TileDB group, and all of the arrays described above are members of the dataset group.

Distributed Compute
Tutorials