Population Genomics Quickstart

life sciences

genomics (vcf)

quickstart

tutorials

python

This tutorial covers the basics of working with VCF data using TileDB-VCF.

How to run this tutorial

You can run this tutorial in two ways:

Locally on your machine.
On TileDB Cloud.

However, since TileDB Cloud has a free tier, we strongly recommend that you sign up and run everything there, as that requires no installations or deployment.

This quickstart is designed to provide a rapid introduction to TileDB-VCF and its capabilities. It covers the following topics:

Create a VCF dataset and add new human VCF samples to it.
Run region and sample queries.
Export data to VCF.

Installation

You should familiarize yourself with Jupyter notebooks to run data exploration and analysis efficiently. You can review Jupyter’s documentation on installing and running notebooks.

The following libraries and programs need to be installed:

TileDB-VCF, which provides methods for import, export and querying of variant data
TileDB-Py, the Python wrapper of TileDB Embedded (to start using TileDB arrays)
NumPy (to handle data with Python)
pandas (to see and manipulate dataframes)
Apache Arrow (to boost performance via zero-copy to pandas)

Conda and mamba are the preferred mechanisms for installing TileDB-VCF.

# enter the following two lines if you are on a M1 Mac
CONDA_SUBDIR=os
conda config --env --set subdir osx-64

# create the conda environment
conda create -n tiledb-vcf "python<3.10"
conda activate tiledb-vcf

# mamba is a faster and more reliable alternative to conda
conda install -c conda-forge mamba

# Install TileDB-Py and TileDB-VCF, align with other useful libraries
mamba install -y -c conda-forge -c bioconda -c tiledb tiledb-py tiledbvcf-py pandas pyarrow numpy

Setup

Start by importing the libraries used in this tutorial, setting the local VCF dataset URI where you will ingest some VCF samples into, and cleaning up any older datasets with the same name.

import os.path
import shutil

import tiledb
import tiledbvcf

# Print library versions
print("TileDB core version: {}".format(tiledb.libtiledb.version()))
print("TileDB-Py version: {}".format(tiledb.version()))
print("TileDB-VCF version: {}".format(tiledbvcf.version))

# Set VCF dataset URI
vcf_uri = os.path.expanduser("~/my_vcf_dataset")

# Clean up VCF dataset if it already exists
if os.path.exists(vcf_uri):
    shutil.rmtree(vcf_uri)

# Clean up combined VCF
combined_uri = os.path.expanduser("~/combined.vcf")
if os.path.exists(combined_uri):
    os.remove(combined_uri)

# Clean up single VCFs
HG00097_uri = os.path.expanduser("~/HG00097.vcf")
if os.path.exists(HG00097_uri):
    os.remove(HG00097_uri)
HG00101_uri = os.path.expanduser("~/HG00101.vcf")
if os.path.exists(HG00101_uri):
    os.remove(HG00101_uri)

TileDB core version: (2, 24, 2)
TileDB-Py version: (0, 30, 2)
TileDB-VCF version: 0.33.2

Ingestion

This process will ingest the VCF data directly from a public S3 bucket into a local VCF dataset, without needing to download the source VCF files beforehand. The ingestion should take about a minute from your laptop. S3 throughput forms the majority of the ingestion time here, along with the parsing cost of the htslib library that TileDB-VCF is using internally to ingest the VCF files.

You will use samples from the latest DRAGEN 3.5.7b re-analysis of the 1000 Genomes dataset. Specify the samples to ingest as follows.

vcf_bucket = "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen"
samples_to_ingest = [
    "HG00096_chr21.gvcf.gz",
    "HG00097_chr21.gvcf.gz",
    "HG00099_chr21.gvcf.gz",
    "HG00100_chr21.gvcf.gz",
    "HG00101_chr21.gvcf.gz",
]
sample_uris = [f"{vcf_bucket}/{s}" for s in samples_to_ingest]
sample_uris

['s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00096_chr21.gvcf.gz',
 's3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00097_chr21.gvcf.gz',
 's3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00099_chr21.gvcf.gz',
 's3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00100_chr21.gvcf.gz',
 's3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00101_chr21.gvcf.gz']

To ingest these samples locally, run the following:

# Open a VCF dataset in write mode
ds = tiledbvcf.Dataset(uri=vcf_uri, mode="w")

# Create empty VCF dataset
ds.create_dataset()

# Ingest samples
ds.ingest_samples(sample_uris=sample_uris)

The created VCF dataset is materialized as a directory on your local storage, and is modeled as a TileDB group that contains various TileDB arrays. You can run !tree {vcf_uri} to see the file hierarchy inside the VCF dataset directory. For more details on the meaning of those different TileDB objects, visit the Data Model section.

Reading

To access any information from the VCF dataset, you first need to open it in read (r) mode:

# Open the Dataset in read mode
ds = tiledbvcf.Dataset(uri=vcf_uri, mode="r")

You can now see what samples the dataset contains.

# Show which samples were ingested
ds.samples()

['HG00096', 'HG00097', 'HG00099', 'HG00100', 'HG00101']

These are the same as the ones you ingested.

Next, print what “attributes” (i.e., VCF fields) you can access in later queries.

# Show the attributes of the dataset
ds.attributes()

['alleles',
 'contig',
 'filters',
 'fmt',
 'fmt_AD',
 'fmt_AF',
 'fmt_DP',
 'fmt_F1R2',
 'fmt_F2R1',
 'fmt_GP',
 'fmt_GQ',
 'fmt_GT',
 'fmt_ICNT',
 'fmt_MB',
 'fmt_MIN_DP',
 'fmt_PL',
 'fmt_PRI',
 'fmt_PS',
 'fmt_SB',
 'fmt_SPL',
 'fmt_SQ',
 'id',
 'info',
 'info_DB',
 'info_DP',
 'info_END',
 'info_FS',
 'info_FractionInformativeReads',
 'info_LOD',
 'info_MQ',
 'info_MQRankSum',
 'info_QD',
 'info_R2_5P_bias',
 'info_ReadPosRankSum',
 'info_SOR',
 'pos_end',
 'pos_start',
 'qual',
 'query_bed_end',
 'query_bed_line',
 'query_bed_start',
 'sample_name']

You can read data with the .read() method, which allows you to select samples, attributes, and genomic regions on which to slice:

# Read a chromosome region, and subset on samples and attributes
df = ds.read(
    regions=["chr21:8220186-8405573"],
    samples=["HG00096", "HG00097"],
    attrs=["sample_name", "contig", "pos_start", "pos_end", "alleles", "fmt_GT"],
)
df

	sample_name	contig	pos_start	pos_end	alleles	fmt_GT
0	HG00096	chr21	8220186	8220206	[TCTCCCTCCCTCCCTCCCTCC, T, TCTCC, TCTCCCTCC, T...	[0, 1]
1	HG00097	chr21	8220186	8220194	[TCTCCCTCC, T, TCTCC, CCTCCCTCC, <NON_REF>]	[1, 2]
2	HG00096	chr21	8220187	8220208	[C, <NON_REF>]	[-1, -1]
3	HG00097	chr21	8220187	8220198	[C, <NON_REF>]	[-1, -1]
4	HG00097	chr21	8220199	8220199	[C, <NON_REF>]	[0, 0]
...	...	...	...	...	...	...
7337	HG00097	chr21	8405412	8405523	[T, <NON_REF>]	[0, 0]
7338	HG00096	chr21	8405524	8405572	[C, <NON_REF>]	[0, 0]
7339	HG00097	chr21	8405524	8405572	[C, <NON_REF>]	[0, 0]
7340	HG00096	chr21	8405573	8405579	[ATGTGTG, ATGTG, A, ATG, ATGTGTGTG, <NON_REF>]	[0, 1]
7341	HG00097	chr21	8405573	8405579	[ATGTGTG, ATG, ATGTG, A, ATGTGTGTGTGTG, ATGTAT...	[0, 1]

7342 rows × 6 columns

Once you have efficiently queried the data of interest, you can manipulate them using pandas in a variety of ways, for example:

# Pivot so that each sample is a column and it displays the GT
df["fmt_GT"] = df["fmt_GT"].apply(lambda x: "/".join(map(str, x)))
df["fmt_GT"] = df["fmt_GT"].apply(lambda x: "./." if x == "-1/-1" else x)
df.pivot(index="pos_start", columns="sample_name", values="fmt_GT")

sample_name	HG00096	HG00097
pos_start
8220186	0/1	1/2
8220187	./.	./.
8220199	NaN	0/0
8220200	NaN	0/0
8220201	NaN	0/0
...	...	...
8405370	0/0	0/0
8405409	0/0	NaN
8405412	NaN	0/0
8405524	0/0	0/0
8405573	0/1	0/1

5824 rows × 2 columns

VCF export

TileDB-VCF ingests VCF samples in a lossless manner, and allows you to export the data back into the original VCF format, or into a combined VCF format.

To export samples into their original (single-sample) VCF files, you can run the following:

# Export two VCF samples
ds.export(
    regions=["chr21:8220186-8405573"],
    samples=["HG00101", "HG00097"],
    output_format="v",
    output_dir=os.path.expanduser("~"),
)

Two single-sample VCF files were created in your directory. You can use bcftools to confirm that the files were exported correctly.

!bcftools view --no-header {HG00101_uri} | head -10

chr21   8220186 .   TCTCCCTCCCTCCCTCC   T,TCTCCCTCC,CCTCCCTCCCTCCCTCC,<NON_REF> 44.45   PASS    END=8220202;DP=162;MQ=50.71;MQRankSum=-3.588;ReadPosRankSum=-2.546;FractionInformativeReads=0.636;R2_5P_bias=0  GT:AD:AF:DP:F1R2:F2R1:GQ:PL:SPL:ICNT:GP:PRI:SB:MB   0/1:27,47,24,5,0:0.456,0.233,0.049,0:103:11,21,11,1,0:16,26,13,4,0:40:46,0,39,486,446,494,1263,758,1099,806,743,529,903,1339,1125:255,0,255:40,29:44.453,0.0004173,42.201,450,448.53,450,450,450,450,450,450,450,450,450,450:0,2,5,2,4,5,34.77,36.77,36.77,37.77,34.77,36.77,36.77,69.54,37.77:0,27,4,72:16,11,41,35
chr21   8220187 .   C   <NON_REF>   .   LowGQ   END=8220202 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  ./.:318,65:383:0:370:0,0,0:37,0,255:40,30
chr21   8220203 .   C   <NON_REF>   .   PASS    END=8220203 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  0/0:334,56:390:46:390:0,46,8757:0,46,255:40,2
chr21   8220204 .   T   <NON_REF>   .   LowGQ   END=8220204 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  0/0:228,162:390:0:390:0,0,4267:255,0,255:40,2
chr21   8220205 .   C   <NON_REF>   .   PASS    END=8220205 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  0/0:349,37:386:99:386:0,120,1800:0,255,255:40,2
chr21   8220206 .   C   <NON_REF>   .   LowGQ   END=8220206 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  0/0:255,133:388:0:388:0,0,3549:255,0,255:40,2
chr21   8220207 .   C   <NON_REF>   .   PASS    END=8220207 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  0/0:348,44:392:99:392:0,120,1800:0,255,255:40,0
chr21   8220208 .   T   <NON_REF>   .   LowGQ   END=8220208 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  0/0:235,162:397:0:397:0,0,4234:255,0,255:40,0
chr21   8220209 .   C   <NON_REF>   .   PASS    END=8220209 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  0/0:360,34:394:99:394:0,120,1800:0,255,255:40,0
chr21   8220210 .   C   <NON_REF>   .   LowGQ   END=8220210 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  0/0:320,65:385:0:385:0,0,7583:22,0,255:40,0

Note that you can export the whole file, or any genomic region, for convenience.

To export multiple samples into a CombinedVCF file, you can run:

# Export to combined VCF
ds.export(
    regions=["chr21:8220186-8405573"],
    samples=ds.samples()[0:2],
    merge=True,  # this will create a combined VCF file
    output_format="v",
    output_path=combined_uri,
)

You can confirm that the data has been exported correctly using bcftools:

!bcftools view --no-header {combined_uri} | head -10

chr21   8220186 .   TCTCCCTCCCTCCCTCCCTCC   T,TCTCC,TCTCCCTCC,TCTCCCTCCCTCC,TCTCCCTCCCTCCCTCC,<NON_REF>,CCTCCCTCCCTCCCTCCCTCC   228.26  PASS    R2_5P_bias=0;FractionInformativeReads=0.754;ReadPosRankSum=-0.751;MQ=51.65;DP=461;MQRankSum=0.09;END=8220206    GT:AD:AF:DP:F1R2:F2R1:GQ:PL:SPL:ICNT:GP:PRI:SB:MB   0/1:45,97,10,33,24,6,0,.:0.451,0.047,0.153,0.112,0.028,0,.:215:18,59,8,18,12,3,0,.:27,38,2,15,12,3,0,.:26:48,0,23,3973,1791,1840,1536,349,2890,398,2171,323,3253,2104,372,3480,977,5679,2614,3195,1025,2130,569,3076,2115,2436,3079,2592,.,.,.,.,.,.,.,.:255,0,255:40,107:46.261,0.010693,26.134,448.53,450,450,450,351.46,450,400.71,450,325.43,448.53,450,374.68,448.53,450,448.53,450,450,450,450,450,450,450,450,450,450,.,.,.,.,.,.,.,.:0,2,5,2,4,5,2,4,4,5,2,4,4,4,5,2,4,4,4,4,5,34.77,36.77,36.77,36.77,36.77,36.77,37.77,.,.,.,.,.,.,.,.:11,34,7,163:24,21,75,95    4/5:5,.,.,.,82,16,0,1:.,.,.,0.788,0.154,0,0.01:104:1,.,.,.,44,6,0,1:4,.,.,.,38,10,0,0:7:231,.,.,.,.,.,.,.,.,.,182,.,.,.,5,868,.,.,.,0,48,1091,.,.,.,212,792,1002,1804,.,.,.,482,1202,1425,531:255,0,255:40,41:228.26,.,.,.,.,.,.,.,.,.,180.71,.,.,.,7.0947,450,.,.,.,0.94329,49.937,450,.,.,.,245.54,450,450,450,.,.,.,450,450,450,450:0,.,.,.,.,.,.,.,.,.,2,.,.,.,5,2,.,.,.,4,5,34.77,.,.,.,36.77,36.77,37.77,34.77,.,.,.,36.77,36.77,69.54,37.77:0,5,0,99:4,1,48,51
chr21   8220187 .   C   <NON_REF>   .   LowGQ   END=8220208 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  ./.:289,140:429:0:398:0,0,0:255,0,255:40,108    ./.:247,121:368:0:353:0,0,0:255,0,255:40,42
chr21   8220199 .   C   <NON_REF>   .   PASS    END=8220199 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  ./.:.:.:.:.:.:.:.   0/0:328,36:364:99:364:0,120,1800:0,255,255:40,7
chr21   8220200 .   T   <NON_REF>   .   LowGQ   END=8220200 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  ./.:.:.:.:.:.:.:.   0/0:203,158:361:0:361:0,0,2749:255,0,255:40,7
chr21   8220201 .   C   <NON_REF>   .   PASS    END=8220201 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  ./.:.:.:.:.:.:.:.   0/0:317,48:365:87:365:0,87,8403:0,87,255:40,7
chr21   8220202 .   C   <NON_REF>   .   LowGQ   END=8220202 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  ./.:.:.:.:.:.:.:.   ./.:156,201:357:0:357:0,0,0:255,0,255:40,7
chr21   8220203 .   C   <NON_REF>   .   PASS    END=8220203 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  ./.:.:.:.:.:.:.:.   0/0:331,33:364:99:364:0,120,1800:0,255,255:40,6
chr21   8220204 .   T   <NON_REF>   .   LowGQ   END=8220204 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  ./.:.:.:.:.:.:.:.   0/0:219,137:356:0:356:0,0,4292:255,0,255:40,6
chr21   8220205 .   C   <NON_REF>   .   PASS    END=8220205 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  ./.:.:.:.:.:.:.:.   0/0:316,46:362:99:362:0,120,1800:0,251,255:40,6
chr21   8220206 .   C   <NON_REF>   .   LowGQ   END=8220206 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  ./.:.:.:.:.:.:.:.   0/0:207,155:362:0:362:0,0,1420:255,0,255:40,6

Clean up

Clean up in the end by deleting the array and generated VCF files.

# Clean up VCF dataset if it already exists
if os.path.exists(vcf_uri):
    shutil.rmtree(vcf_uri)

# Clean up combined VCF
if os.path.exists(combined_uri):
    os.remove(combined_uri)

# Clean up single VCFs
if os.path.exists(HG00097_uri):
    os.remove(HG00097_uri)
if os.path.exists(HG00101_uri):
    os.remove(HG00101_uri)