TileDB-VCF Storage Format Spec

life sciences

genomics (vcf)

foundation

storage format spec

The storage format specification of TileDB-VCF.

A TileDB-VCF dataset is composed of a group of two or more separate TileDB arrays:

Data array: A 3D sparse array for the actual genomic variants and associated fields/attributes
VCF header array: A 1D sparse array for the metadata stored in each single-sample VCF header
Manifest: A 1D sparse array holding information about the VCF files ingested into the dataset
Log: A 1D sparse array holding log information from the ingestion tasks
Variant stats array: A 2D sparse array holding data used to compute internal allele frequency
Allele count array: A 2D sparse array holding counts of unique chrom-pos-ref-alt variants in the dataset
Sample stats array: A 1D sparse array holding variant summary statistics for each sample

Data array

Basic schema parameters

Parameter	Value
Array type	Sparse
Rank	3D
Cell order	Row-major
Tile order	Row-major

Dimensions

The dimensions in the schema are:

Dimension Name	TileDB Datatype	Corresponding VCF Field
`contig`	`TILEDB_STRING_ASCII`	`CHR`
`start_pos`	`uint32_t`	VCF`POS`plus TileDB anchors
`sample`	`TILEDB_STRING_ASCII`	Sample name

As mentioned before, the coordinates of the 3D array are contig along the first dimension, chromosomal location of the variants start position along the second dimension, and sample names along the third dimension.

Attributes

Each field in a single-sample VCF record has a corresponding attribute in the schema.

Attribute Name	TileDB Datatype	Description
`end_pos`	`uint32_t`	VCF `END` position of VCF records
`qual`	`float`	VCF `QUAL` field
`alleles`	`var<char>`	CSV list of `REF` and `ALT` VCF fields
`id`	`var<char>`	VCF `ID` field
`filter_ids`	`var<int32_t>`	Vector of integer IDs of entries in the `FILTER` VCF field
`real_start_pos`	`uint32_t`	VCF `POS` (no anchors)
`info`	`var<uint8_t>`	Byte blob containing any `INFO` fields that are not stored as explicit attributes
`fmt`	`var<uint8_t>`	Byte blob containing any `FMT` fields that are not stored as explicit attributes
`info_*`	`var<uint8_t>`	One or more attributes storing specific VCF `INFO` fields (e.g. `info_DP`, `info_MQ`, etc. )
`fmt_*`	`var<uint8_t>`	One or more attributes storing specific VCF `FORMAT` fields (e.g. `fmt_GT`, `fmt_MIN_DP`, etc.)

The info_* and fmt_* attributes allow individual INFO or FMT VCF fields to be extracted into explicit array attributes. This can be beneficial if your queries frequently access only a subset of the INFO or FMT fields, as no unrelated data then needs to be fetched from storage.

Tip

During array creation, you can choose which fields to extract as explicit array attributes.

Any extra info or format fields not extracted as explicit array attributes are stored in the byte blob attributes, info and fmt.

Metadata

The following metadata values are updated during array creation, and are used during the export phase:

anchor_gap - Anchor gap value
extra_attributes - List of INFO or FMT field names that are stored as explicit array attributes
version - Array schema version

These metadata values are updated during array creation, and are used during the export phase. The metadata is stored as “array metadata” in the sparse data array.

Warning

When ingesting samples, the sample header must be identical for all samples with respect to the contig mappings. That means all samples must have the exact same set of contigs listed in the VCF header.

VCF headers array

The vcf_headers array stores the original text of every ingested VCF header in order to:

Ensure the original VCF file can be fully recovered for any given sample.
Reconstruct an htslib header instance when reading from the dataset, which is used for operations such as mapping a filter ID back to the filter string, etc.

Basic schema parameters

Parameter	Value
Array type	Sparse
Rank	1D
Cell order	Row-major
Tile order	Row-major

Dimensions

Dimension Name	TileDB Datatype	Description
`sample`	`TILEDB_STRING_ASCII`	Sample name

Attributes

Attribute Name	TileDB Datatype	Description
`header`	`var<char>`	Original text of the VCF header

Manifest array

The manifest array is an optional array added by scalable ingestion on TileDB Cloud. The array is used to build a list of VCF URIs sorted by sample name and to keep track of the VCF files ingested in the dataset.

Basic schema parameters

Parameter	Value
Array type	Sparse
Rank	1D
Cell order	Row-major
Tile order	Row-major

Dimensions

The dimensions in the schema are:

Dimension Name	TileDB Datatype	Corresponding VCF Field
`sample_name`	`TILEDB_STRING_ASCII`	Sample name

Attributes

For each sample, the following attributes are stored:

Attribute Name	TileDB Datatype	Description
`status`	`TILEDB_STRING_ASCII`	Status of VCF file check
`vcf_uri`	`TILEDB_STRING_ASCII`	VCF file URI
`vcf_bytes`	`uint64`	Size of the original VCF file in bytes
`index_uri`	`TILEDB_STRING_ASCII`	VCF index file URI
`index_bytes`	`uint64`	Size of the original VCF index file in bytes
`records`	`uint64`	Number of records in the VCF file

The status attribute is used to store the status of the VCF file check, which checks for missing sample names, multiple samples in on VCF file, duplicate sample names in a batch, and missing or bad index files.

Log array

The log array is an optional array added by scalable ingestion on TileDB Cloud. The log array provides a flexible, time-series array to store application specific events. In the case of TileDB-VCF scalable ingestion, the log array is used to store information about the ingestion process, which can be used for debugging, monitoring, and reporting.

Basic schema parameters

Parameter	Value
Array type	Sparse
Rank	1D
Cell order	Row-major
Tile order	Row-major

Dimensions

The dimensions in the schema are:

Dimension Name	TileDB Datatype	Corresponding VCF Field
`time_ms`	`uint64`	Timestamp for the log event

Attributes

For each log event, the following attributes are stored:

Attribute Name	TileDB Datatype	Description
`id`	`TILEDB_STRING_ASCII`	Log event ID
`op`	`TILEDB_STRING_ASCII`	Log event operation
`data`	`TILEDB_STRING_ASCII`	Log event data
`extra`	`TILEDB_STRING_ASCII`	Log event extra data

Variant stats array

The variant_stats array holds data used to compute internal allele frequency (IAF) as described in the data model section.

Basic schema parameters

Parameter	Value
Array type	Sparse
Rank	2D
Cell order	Row-major
Tile order	Row-major

Dimensions

The dimensions in the schema are:

Dimension Name	TileDB Datatype	Corresponding VCF Field
`contig`	`TILEDB_STRING_ASCII`	`CHROM` from the VCF file
`pos`	`uint32`	`POS` from the VCF file, 0-indexed

Attributes

For each contig-pos location, the following attributes are stored:

Attribute Name	TileDB Datatype	Description
`allele`	`TILEDB_STRING_ASCII`	Normalized allele value
`ac`	`int32`	Allele count
`n_hom`	`int32`	Number of homozygous calls

Allele count array

The allele_count array holds counts of unique chrom-pos-ref-alt variants as described in the data model section.

Basic schema parameters

Parameter	Value
Array type	Sparse
Rank	2D
Cell order	Row-major
Tile order	Row-major

Dimensions

The dimensions in the schema are:

Dimension Name	TileDB Datatype	Corresponding VCF Field
`contig`	`TILEDB_STRING_ASCII`	`CHROM` from the VCF file
`pos`	`uint32`	`POS` from the VCF file, 0-indexed

Attributes

For each contig-pos location, the following attributes are stored:

Attribute Name	TileDB Datatype	Description
`ref`	`TILEDB_STRING_ASCII`	`REF` from the VCF file
`alt`	`TILEDB_STRING_ASCII`	`ALT` from the VCF file
`filter`	`TILEDB_STRING_ASCII`	`FILTER` from the VCF file
`gt`	`TILEDB_STRING_ASCII`	Normalized `FORMAT/GT` from the VCF file
`count`	`int32`	Number of records with the same attribute values

The filter and gt attributes allow further filtering of the allele count data. The value of gt is normalized to one of the following values:

1 - Homozygous alternate (diploid)
.,1 - Heterozygous with one missing allele
0,1 - Heterozygous
1,1 - Homozygous alternate
1,2 - Multiallelic heterozygous alternate

Sample stats array

The sample_stats array holds variant summary statistics for each sample as described in the data model section.

Basic schema parameters

Parameter	Value
Array type	Sparse
Rank	1D
Cell order	Row-major
Tile order	Row-major

Dimensions

The dimensions in the schema are:

Dimension Name	TileDB Datatype	Corresponding VCF Field
`sample`	`TILEDB_STRING_ASCII`	Sample name

Attributes

For each sample, the following attributes are stored:

Attribute Name	TileDB Datatype	Description
`dp_sum`	`uint64`	Read depth sum
`dp_sum2`	`uint64`	Read depth sum squared (for stddev aggregation)
`dp_count`	`uint64`	Read depth counts
`dp_min`	`uint64`	Read depth minimum value
`dp_max`	`uint64`	Read depth maximum value
`gq_sum`	`uint64`	Genotype quality sum
`gq_sum2`	`uint64`	Genotype quality sum squared (for stddev aggregation)
`gq_count`	`uint64`	Genotype quality counts
`gq_min`	`uint64`	Genotype quality minimum value
`gq_max`	`uint64`	Genotype quality maximum value
`n_records`	`uint64`	Number of records
`n_called`	`uint64`	Number of calls
`n_not_called`	`uint64`	Number of missing calls
`n_hom_ref`	`uint64`	Number of homozygous reference calls
`n_het`	`uint64`	Number of heterozygous calls
`n_singleton`	`uint64`	Number of singletons
`n_snp`	`uint64`	Number of SNPs
`n_insertion`	`uint64`	Number of insertions
`n_deletion`	`uint64`	Number of deletions
`n_transition`	`uint64`	Number of transitions
`n_transversion`	`uint64`	Number of transversions
`n_star`	`uint64`	Number of star alleles
`n_multiallelic`	`uint64`	Number of multiallelic records

Configurable parameters

During array creation, you can specify different array-related parameters, including the following:

Array data tile capacity (default 10,000).
The “anchor gap” size (default 1,000).
The list of INFO and FMT fields to store as explicit array attributes (default is none).

Once chosen, these parameters cannot be changed.

During sample ingestion, the user can specify the sample batch size (the default value is 10).

The above parameters may impact read and write performance, as well as the size of the persisted array. Therefore, you should perform adequate testing to determine good values for these parameters before ingesting a large amount of data into an array.

Putting it all together

To summarize, three main entities exist in this data model:

The variant data array (3D sparse)
The general metadata, stored in the variant data array as metadata
The VCF header array (1D sparse)

Three arrays for variant statistics:

The variant stats array (2D sparse)
The allele count array (2D sparse)
The sample stats array (1D sparse)

Two optional entries added by scalable ingestion on TileDB Cloud:

The manifest array (1D sparse)
The log array (1D sparse)

These components form the “TileDB-VCF dataset.” Expressed as a directory hierarchy, a TileDB-VCF dataset has the following structure:

<dataset_uri>/
  |_ allele_count/
      |_ __schema
      ... <other array directories/fragments and files>
  |_ data/
      |_ __schema
      |_ __meta/
            |_ <general-metadata-here>
      ... <other array directories/fragments and files>
  |_ log/
      |_ __schema
      ... <other array directories/fragments and files>
  |_ manifest/
      |_ __schema
      ... <other array directories/fragments and files>
  |_ sample_stats/
      |_ __schema
      ... <other array directories/fragments and files>
  |_ variant_stats/
      |_ __schema
      ... <other array directories/fragments and files>
  |_ vcf_headers/
      |_ __schema
      ... <other array directories/fragments and files>
  |_ __tiledb_group.tdb

The root of the dataset, <dataset_uri> is a TileDB group, and all of the arrays described above are members of the dataset group.