Storage Format Spec
A TileDB-VCF dataset is composed of a group of two or more separate TileDB arrays:
- Data array: A 3D sparse array for the actual genomic variants and associated fields/attributes
- VCF header array: A 1D sparse array for the metadata stored in each single-sample VCF header
- Manifest: A 1D sparse array holding information about the VCF files ingested into the dataset
- Log: A 1D sparse array holding log information from the ingestion tasks
- Variant stats array: A 2D sparse array holding data used to compute internal allele frequency
- Allele count array: A 2D sparse array holding counts of unique
chrom-pos-ref-alt
variants in the dataset - Sample stats array: A 1D sparse array holding variant summary statistics for each sample
Data array
Basic schema parameters
Parameter | Value |
---|---|
Array type | Sparse |
Rank | 3D |
Cell order | Row-major |
Tile order | Row-major |
Dimensions
The dimensions in the schema are:
Dimension Name | TileDB Datatype | Corresponding VCF Field |
---|---|---|
contig |
TILEDB_STRING_ASCII |
CHR |
start_pos |
uint32_t |
VCFPOS plus TileDB anchors |
sample |
TILEDB_STRING_ASCII |
Sample name |
As mentioned before, the coordinates of the 3D array are contig
along the first dimension, chromosomal location of the variants start position
along the second dimension, and sample
names along the third dimension.
Attributes
Each field in a single-sample VCF record has a corresponding attribute in the schema.
Attribute Name | TileDB Datatype | Description |
---|---|---|
end_pos |
uint32_t |
VCF END position of VCF records |
qual |
float |
VCF QUAL field |
alleles |
var<char> |
CSV list of REF and ALT VCF fields |
id |
var<char> |
VCF ID field |
filter_ids |
var<int32_t> |
Vector of integer IDs of entries in the FILTER VCF field |
real_start_pos |
uint32_t |
VCF POS (no anchors) |
info |
var<uint8_t> |
Byte blob containing any INFO fields that are not stored as explicit attributes |
fmt |
var<uint8_t> |
Byte blob containing any FMT fields that are not stored as explicit attributes |
info_* |
var<uint8_t> |
One or more attributes storing specific VCF INFO fields (e.g. info_DP , info_MQ , etc. ) |
fmt_* |
var<uint8_t> |
One or more attributes storing specific VCF FORMAT fields (e.g. fmt_GT , fmt_MIN_DP , etc.) |
The info_*
and fmt_*
attributes allow individual INFO
or FMT
VCF fields to be extracted into explicit array attributes. This can be beneficial if your queries frequently access only a subset of the INFO
or FMT
fields, as no unrelated data then needs to be fetched from storage.
During array creation, you can choose which fields to extract as explicit array attributes.
Any extra info or format fields not extracted as explicit array attributes are stored in the byte blob attributes, info
and fmt
.
Metadata
The following metadata values are updated during array creation, and are used during the export phase:
anchor_gap
- Anchor gap valueextra_attributes
- List ofINFO
orFMT
field names that are stored as explicit array attributesversion
- Array schema version
These metadata values are updated during array creation, and are used during the export phase. The metadata is stored as “array metadata” in the sparse data
array.
When ingesting samples, the sample header must be identical for all samples with respect to the contig mappings. That means all samples must have the exact same set of contigs listed in the VCF header.
VCF headers array
The vcf_headers
array stores the original text of every ingested VCF header in order to:
- Ensure the original VCF file can be fully recovered for any given sample.
- Reconstruct an
htslib
header instance when reading from the dataset, which is used for operations such as mapping a filter ID back to the filter string, etc.
Basic schema parameters
Parameter | Value |
---|---|
Array type | Sparse |
Rank | 1D |
Cell order | Row-major |
Tile order | Row-major |
Dimensions
Dimension Name | TileDB Datatype | Description |
---|---|---|
sample |
TILEDB_STRING_ASCII |
Sample name |
Attributes
Attribute Name | TileDB Datatype | Description |
---|---|---|
header |
var<char> |
Original text of the VCF header |
Manifest array
The manifest array is an optional array added by scalable ingestion on TileDB Cloud. The array is used to build a list of VCF URIs sorted by sample name and to keep track of the VCF files ingested in the dataset.
Basic schema parameters
Parameter | Value |
---|---|
Array type | Sparse |
Rank | 1D |
Cell order | Row-major |
Tile order | Row-major |
Dimensions
The dimensions in the schema are:
Dimension Name | TileDB Datatype | Corresponding VCF Field |
---|---|---|
sample_name |
TILEDB_STRING_ASCII |
Sample name |
Attributes
For each sample, the following attributes are stored:
Attribute Name | TileDB Datatype | Description |
---|---|---|
status |
TILEDB_STRING_ASCII |
Status of VCF file check |
vcf_uri |
TILEDB_STRING_ASCII |
VCF file URI |
vcf_bytes |
uint64 |
Size of the original VCF file in bytes |
index_uri |
TILEDB_STRING_ASCII |
VCF index file URI |
index_bytes |
uint64 |
Size of the original VCF index file in bytes |
records |
uint64 |
Number of records in the VCF file |
The status
attribute is used to store the status of the VCF file check, which checks for missing sample names, multiple samples in on VCF file, duplicate sample names in a batch, and missing or bad index files.
Log array
The log array is an optional array added by scalable ingestion on TileDB Cloud. The log array provides a flexible, time-series array to store application specific events. In the case of TileDB-VCF scalable ingestion, the log array is used to store information about the ingestion process, which can be used for debugging, monitoring, and reporting.
Basic schema parameters
Parameter | Value |
---|---|
Array type | Sparse |
Rank | 1D |
Cell order | Row-major |
Tile order | Row-major |
Dimensions
The dimensions in the schema are:
Dimension Name | TileDB Datatype | Corresponding VCF Field |
---|---|---|
time_ms |
uint64 |
Timestamp for the log event |
Attributes
For each log event, the following attributes are stored:
Attribute Name | TileDB Datatype | Description |
---|---|---|
id |
TILEDB_STRING_ASCII |
Log event ID |
op |
TILEDB_STRING_ASCII |
Log event operation |
data |
TILEDB_STRING_ASCII |
Log event data |
extra |
TILEDB_STRING_ASCII |
Log event extra data |
Variant stats array
The variant_stats
array holds data used to compute internal allele frequency (IAF) as described in the data model section.
Basic schema parameters
Parameter | Value |
---|---|
Array type | Sparse |
Rank | 2D |
Cell order | Row-major |
Tile order | Row-major |
Dimensions
The dimensions in the schema are:
Dimension Name | TileDB Datatype | Corresponding VCF Field |
---|---|---|
contig |
TILEDB_STRING_ASCII |
CHROM from the VCF file |
pos |
uint32 |
POS from the VCF file, 0-indexed |
Attributes
For each contig-pos
location, the following attributes are stored:
Attribute Name | TileDB Datatype | Description |
---|---|---|
allele |
TILEDB_STRING_ASCII |
Normalized allele value |
ac |
int32 |
Allele count |
n_hom |
int32 |
Number of homozygous calls |
Allele count array
The allele_count
array holds counts of unique chrom-pos-ref-alt
variants as described in the data model section.
Basic schema parameters
Parameter | Value |
---|---|
Array type | Sparse |
Rank | 2D |
Cell order | Row-major |
Tile order | Row-major |
Dimensions
The dimensions in the schema are:
Dimension Name | TileDB Datatype | Corresponding VCF Field |
---|---|---|
contig |
TILEDB_STRING_ASCII |
CHROM from the VCF file |
pos |
uint32 |
POS from the VCF file, 0-indexed |
Attributes
For each contig-pos
location, the following attributes are stored:
Attribute Name | TileDB Datatype | Description |
---|---|---|
ref |
TILEDB_STRING_ASCII |
REF from the VCF file |
alt |
TILEDB_STRING_ASCII |
ALT from the VCF file |
filter |
TILEDB_STRING_ASCII |
FILTER from the VCF file |
gt |
TILEDB_STRING_ASCII |
Normalized FORMAT/GT from the VCF file |
count |
int32 |
Number of records with the same attribute values |
The filter
and gt
attributes allow further filtering of the allele count data. The value of gt
is normalized to one of the following values:
1
- Homozygous alternate (diploid).,1
- Heterozygous with one missing allele0,1
- Heterozygous1,1
- Homozygous alternate1,2
- Multiallelic heterozygous alternate
Sample stats array
The sample_stats
array holds variant summary statistics for each sample as described in the data model section.
Basic schema parameters
Parameter | Value |
---|---|
Array type | Sparse |
Rank | 1D |
Cell order | Row-major |
Tile order | Row-major |
Dimensions
The dimensions in the schema are:
Dimension Name | TileDB Datatype | Corresponding VCF Field |
---|---|---|
sample |
TILEDB_STRING_ASCII |
Sample name |
Attributes
For each sample, the following attributes are stored:
Attribute Name | TileDB Datatype | Description |
---|---|---|
dp_sum |
uint64 |
Read depth sum |
dp_sum2 |
uint64 |
Read depth sum squared (for stddev aggregation) |
dp_count |
uint64 |
Read depth counts |
dp_min |
uint64 |
Read depth minimum value |
dp_max |
uint64 |
Read depth maximum value |
gq_sum |
uint64 |
Genotype quality sum |
gq_sum2 |
uint64 |
Genotype quality sum squared (for stddev aggregation) |
gq_count |
uint64 |
Genotype quality counts |
gq_min |
uint64 |
Genotype quality minimum value |
gq_max |
uint64 |
Genotype quality maximum value |
n_records |
uint64 |
Number of records |
n_called |
uint64 |
Number of calls |
n_not_called |
uint64 |
Number of missing calls |
n_hom_ref |
uint64 |
Number of homozygous reference calls |
n_het |
uint64 |
Number of heterozygous calls |
n_singleton |
uint64 |
Number of singletons |
n_snp |
uint64 |
Number of SNPs |
n_insertion |
uint64 |
Number of insertions |
n_deletion |
uint64 |
Number of deletions |
n_transition |
uint64 |
Number of transitions |
n_transversion |
uint64 |
Number of transversions |
n_star |
uint64 |
Number of star alleles |
n_multiallelic |
uint64 |
Number of multiallelic records |
Configurable parameters
During array creation, you can specify different array-related parameters, including the following:
- Array data tile capacity (default 10,000).
- The “anchor gap” size (default 1,000).
- The list of
INFO
andFMT
fields to store as explicit array attributes (default is none).
Once chosen, these parameters cannot be changed.
During sample ingestion, the user can specify the sample batch size (the default value is 10).
The above parameters may impact read and write performance, as well as the size of the persisted array. Therefore, you should perform adequate testing to determine good values for these parameters before ingesting a large amount of data into an array.
Putting it all together
To summarize, three main entities exist in this data model:
- The variant data array (3D sparse)
- The general metadata, stored in the variant data array as metadata
- The VCF header array (1D sparse)
Three arrays for variant statistics:
- The variant stats array (2D sparse)
- The allele count array (2D sparse)
- The sample stats array (1D sparse)
Two optional entries added by scalable ingestion on TileDB Cloud:
- The manifest array (1D sparse)
- The log array (1D sparse)
These components form the “TileDB-VCF dataset.” Expressed as a directory hierarchy, a TileDB-VCF dataset has the following structure:
<dataset_uri>/
|_ allele_count/
|_ __schema
... <other array directories/fragments and files>
|_ data/
|_ __schema
|_ __meta/
|_ <general-metadata-here>
... <other array directories/fragments and files>
|_ log/
|_ __schema
... <other array directories/fragments and files>
|_ manifest/
|_ __schema
... <other array directories/fragments and files>
|_ sample_stats/
|_ __schema
... <other array directories/fragments and files>
|_ variant_stats/
|_ __schema
... <other array directories/fragments and files>
|_ vcf_headers/
|_ __schema
... <other array directories/fragments and files>
|_ __tiledb_group.tdb
The root of the dataset, <dataset_uri>
is a TileDB group, and all of the arrays described above are members of the dataset group.