Basic Ingestion

life sciences

genomics (vcf)

tutorials

ingestion

Learn how to ingest a small batch of gVCFs using TileDB-VCF.

How to run this tutorial

You can run this tutorial in two ways:

Locally on your machine.
On TileDB Cloud.

However, since TileDB Cloud has a free tier, we strongly recommend that you sign up and run everything there, as that requires no installations or deployment.

This tutorial guides you through a short example of ingesting a small batch of gVCFs using TileDB-VCF. This approach is appropriate for small datasets (under 10 samples).

Setup

First, import the necessary libraries, set the TileDB-VCF dataset URI (i.e., its path, which in this tutorial will be on local storage), and delete any previously created datasets with the same name.

Python

import os.path
import shutil

import tiledb
import tiledbvcf

# Print library versions
print("TileDB core version: {}".format(tiledb.libtiledb.version()))
print("TileDB-Py version: {}".format(tiledb.version()))
print("TileDB-VCF version: {}".format(tiledbvcf.version))

# Set VCF dataset URI
vcf_uri = os.path.expanduser("~/basic_ingestion")

# Clean up VCF dataset if it already exists
if os.path.exists(vcf_uri):
    shutil.rmtree(vcf_uri)

TileDB core version: (2, 24, 2)
TileDB-Py version: (0, 30, 2)
TileDB-VCF version: 0.33.2

Default ingestion

Specify the samples to be ingested, which are readily available on a TileDB-owned public S3 bucket.

Python

vcf_bucket = "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen"
samples_to_ingest = [
    "HG00096_chr21.gvcf.gz",
    "HG00097_chr21.gvcf.gz",
    "HG00099_chr21.gvcf.gz",
    "HG00100_chr21.gvcf.gz",
    "HG00101_chr21.gvcf.gz",
]
sample_uris = [f"{vcf_bucket}/{s}" for s in samples_to_ingest]
sample_uris

['s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00096_chr21.gvcf.gz',
 's3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00097_chr21.gvcf.gz',
 's3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00099_chr21.gvcf.gz',
 's3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00100_chr21.gvcf.gz',
 's3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00101_chr21.gvcf.gz']

Next, create a TileDB-VCF dataset and ingest the samples in it.

Python

# Open a VCF dataset in write mode
ds = tiledbvcf.Dataset(uri=vcf_uri, mode="w")

# Create empty VCF dataset
ds.create_dataset()

# Ingest samples
ds.ingest_samples(sample_uris=sample_uris)

The above will create a folder called vcf_uri in your local storage. You can see the contents of the folder running !tree {vcf_uri} in a notebook cell, and visit the Storage Format Specification section for more details on what each subfolder and file represents.

Tip

You can tweak parameters threads and total_memory_budget_mb in the ingest_samples function to fine tune ingestion performance.

Read some data to confirm that the samples were properly ingested in the dataset.

Python

# Open the Dataset in read mode
ds = tiledbvcf.Dataset(uri=vcf_uri, mode="r")

# Read a chromosome region, and subset on samples and attributes
df = ds.read(
    regions=["chr21:8220186-8405573"],
    samples=["HG00096", "HG00097"],
    attrs=["sample_name", "contig", "pos_start", "pos_end", "alleles", "fmt_GT"],
)
df

	sample_name	contig	pos_start	pos_end	alleles	fmt_GT
0	HG00096	chr21	8220186	8220206	[TCTCCCTCCCTCCCTCCCTCC, T, TCTCC, TCTCCCTCC, T...	[0, 1]
1	HG00097	chr21	8220186	8220194	[TCTCCCTCC, T, TCTCC, CCTCCCTCC, <NON_REF>]	[1, 2]
2	HG00096	chr21	8220187	8220208	[C, <NON_REF>]	[-1, -1]
3	HG00097	chr21	8220187	8220198	[C, <NON_REF>]	[-1, -1]
4	HG00097	chr21	8220199	8220199	[C, <NON_REF>]	[0, 0]
...	...	...	...	...	...	...
7337	HG00097	chr21	8405412	8405523	[T, <NON_REF>]	[0, 0]
7338	HG00096	chr21	8405524	8405572	[C, <NON_REF>]	[0, 0]
7339	HG00097	chr21	8405524	8405572	[C, <NON_REF>]	[0, 0]
7340	HG00096	chr21	8405573	8405579	[ATGTGTG, ATGTG, A, ATG, ATGTGTGTG, <NON_REF>]	[0, 1]
7341	HG00097	chr21	8405573	8405579	[ATGTGTG, ATG, ATGTG, A, ATGTGTGTGTGTG, ATGTAT...	[0, 1]

7342 rows × 6 columns

TileDB-VCF allows you to see all the queryable fields (i.e., attributes) in the dataset.

Python

# Show all the attributes of the dataset
ds.attributes()  # same as ds.attributes("all")

['alleles',
 'contig',
 'filters',
 'fmt',
 'fmt_AD',
 'fmt_AF',
 'fmt_DP',
 'fmt_F1R2',
 'fmt_F2R1',
 'fmt_GP',
 'fmt_GQ',
 'fmt_GT',
 'fmt_ICNT',
 'fmt_MB',
 'fmt_MIN_DP',
 'fmt_PL',
 'fmt_PRI',
 'fmt_PS',
 'fmt_SB',
 'fmt_SPL',
 'fmt_SQ',
 'id',
 'info',
 'info_DB',
 'info_DP',
 'info_END',
 'info_FS',
 'info_FractionInformativeReads',
 'info_LOD',
 'info_MQ',
 'info_MQRankSum',
 'info_QD',
 'info_R2_5P_bias',
 'info_ReadPosRankSum',
 'info_SOR',
 'pos_end',
 'pos_start',
 'qual',
 'query_bed_end',
 'query_bed_line',
 'query_bed_start',
 'sample_name']

You can also see only the attributes in the INFO and FMT VCF fields.

Python

# Show the info attributes of the dataset
ds.attributes("info")

['info_DB',
 'info_DP',
 'info_END',
 'info_FS',
 'info_FractionInformativeReads',
 'info_LOD',
 'info_MQ',
 'info_MQRankSum',
 'info_QD',
 'info_R2_5P_bias',
 'info_ReadPosRankSum',
 'info_SOR']

# Show the fmt attributes of the dataset
ds.attributes("fmt")

['fmt_AD',
 'fmt_AF',
 'fmt_DP',
 'fmt_F1R2',
 'fmt_F2R1',
 'fmt_GP',
 'fmt_GQ',
 'fmt_GT',
 'fmt_ICNT',
 'fmt_MB',
 'fmt_MIN_DP',
 'fmt_PL',
 'fmt_PRI',
 'fmt_PS',
 'fmt_SB',
 'fmt_SPL',
 'fmt_SQ']

Note

Despite the fact that .attributes("all") shows all the queryable fields, not all of the INFO and FMT fields are materialized as separate array attributes by default in TileDB-VCF. This may have an impact on query performance. As a rule of thumb, performance is better when a VCF field is materialized as an array attribute in TileDB-VCF.

To see the materialized VCF fields, you can run .attributes("builtin").

Python

# Show the materialized attributes of the dataset
ds.attributes("builtin")

['alleles',
 'contig',
 'filters',
 'fmt',
 'fmt_GT',
 'id',
 'info',
 'pos_end',
 'pos_start',
 'qual',
 'query_bed_end',
 'query_bed_line',
 'query_bed_start',
 'sample_name']

The next subsection explains how you can control which INFO and FMT fields to materialize as separate attributes.

Make sure to clean up the created dataset before proceeding.

Python

# Clean up VCF dataset if it already exists
if os.path.exists(vcf_uri):
    shutil.rmtree(vcf_uri)

Materializing `INFO` and `FMT` fields

In order to boost performance of queries performed on certain INFO or FMT fields, you may want to explicitly materialize them as separate attributes in the underlying arrays that TileDB-VCF creates. This can help with both query time, as well as compression, as TileDB adopts a “columnar” format that is more storage-effective when values across a field are stored all together in a separate storage place (e.g., a separate file).

You can explicitly materialize INFO or FMT fields as follows.

Python

# Open a VCF dataset in write mode
ds = tiledbvcf.Dataset(uri=vcf_uri, mode="w")

# Create empty VCF dataset
ds.create_dataset(extra_attrs=["fmt_AD", "info_DB"])

# Ingest samples
ds.ingest_samples(sample_uris=sample_uris)

You can see that fields fmt_AD and info_DB are both materialized, by opening the dataset in read mode and showing the builtin attributes.

Python

# Open the Dataset in read mode
ds = tiledbvcf.Dataset(uri=vcf_uri, mode="r")

# Show the materialized attributes of the dataset
ds.attributes("builtin")

['alleles',
 'contig',
 'filters',
 'fmt',
 'fmt_AD',
 'id',
 'info',
 'info_DB',
 'pos_end',
 'pos_start',
 'qual',
 'query_bed_end',
 'query_bed_line',
 'query_bed_start',
 'sample_name']

Clean up the created TileDB-VCF dataset.

Python

# Clean up VCF dataset if it already exists
if os.path.exists(vcf_uri):
    shutil.rmtree(vcf_uri)

Setup

Default ingestion

Materializing INFO and FMT fields

Materializing `INFO` and `FMT` fields