You can run this tutorial in two ways:
Locally on your machine.
On TileDB Cloud.
However, since TileDB Cloud has a free tier , we strongly recommend that you sign up and run everything there, as that requires no installations or deployment.
This tutorial demonstrates how to ingest a small TileDB-VCF dataset and run some simple queries against it.
Setup
First, import the necessary libraries, set the TileDB-VCF dataset URI (i.e., its path, which in this tutorial will be on local storage), and delete any previously created datasets with the same name.
import tiledb
import tiledbvcf
import shutil
import os.path
# Print library versions
print ("TileDB core version: {} " .format (tiledb.libtiledb.version()))
print ("TileDB-Py version: {} " .format (tiledb.version()))
print ("TileDB-VCF version: {} " .format (tiledbvcf.version))
# Set VCF dataset URI
vcf_uri = os.path.expanduser("~/basic_queries" )
# Clean up VCF dataset if it already exists
if os.path.exists(vcf_uri):
shutil.rmtree(vcf_uri)
TileDB core version: (2, 24, 2)
TileDB-Py version: (0, 30, 2)
TileDB-VCF version: 0.33.2
Ingestion
Specify the samples to be ingested, which are readily available on a TileDB-owned public S3 bucket.
vcf_bucket = "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen"
samples_to_ingest = [
"HG00096_chr21.gvcf.gz" ,
"HG00097_chr21.gvcf.gz" ,
"HG00099_chr21.gvcf.gz" ,
"HG00100_chr21.gvcf.gz" ,
"HG00101_chr21.gvcf.gz" ,
]
sample_uris = [f" { vcf_bucket} / { s} " for s in samples_to_ingest]
sample_uris
['s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00096_chr21.gvcf.gz',
's3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00097_chr21.gvcf.gz',
's3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00099_chr21.gvcf.gz',
's3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00100_chr21.gvcf.gz',
's3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00101_chr21.gvcf.gz']
Next, create a TileDB-VCF dataset and ingest the samples in it.
# Open a VCF dataset in write mode
ds = tiledbvcf.Dataset(uri= vcf_uri, mode= "w" )
# Create empty VCF dataset
ds.create_dataset()
# Ingest samples
ds.ingest_samples(sample_uris= sample_uris)
Reading
Whenever you want to read from a dataset, you need to open it in read mode.
# Open the Dataset in read mode
ds = tiledbvcf.Dataset(uri= vcf_uri, mode= "r" )
To get the ingested samples, run the following:
samples = ds.samples()
print (f"Number of samples = { len (samples)} " )
You can see which VCF fields (i.e., attributes) exist in the dataset as follows.
['alleles',
'contig',
'filters',
'fmt',
'fmt_AD',
'fmt_AF',
'fmt_DP',
'fmt_F1R2',
'fmt_F2R1',
'fmt_GP',
'fmt_GQ',
'fmt_GT',
'fmt_ICNT',
'fmt_MB',
'fmt_MIN_DP',
'fmt_PL',
'fmt_PRI',
'fmt_PS',
'fmt_SB',
'fmt_SPL',
'fmt_SQ',
'id',
'info',
'info_DB',
'info_DP',
'info_END',
'info_FS',
'info_FractionInformativeReads',
'info_LOD',
'info_MQ',
'info_MQRankSum',
'info_QD',
'info_R2_5P_bias',
'info_ReadPosRankSum',
'info_SOR',
'pos_end',
'pos_start',
'qual',
'query_bed_end',
'query_bed_line',
'query_bed_start',
'sample_name']
You can read from the dataset specifying a subset of attributes, a subset of samples a specific genomic range.
# Read a chromosome region, and subset on samples and attributes
df = ds.read(
regions= ["chr21:8220186-8405573" ],
samples= ["HG00096" , "HG00097" ],
attrs= ["sample_name" , "contig" , "pos_start" , "pos_end" , "alleles" , "fmt_GT" ],
)
df
0
HG00096
chr21
8220186
8220206
[TCTCCCTCCCTCCCTCCCTCC, T, TCTCC, TCTCCCTCC, T...
[0, 1]
1
HG00097
chr21
8220186
8220194
[TCTCCCTCC, T, TCTCC, CCTCCCTCC, <NON_REF>]
[1, 2]
2
HG00096
chr21
8220187
8220208
[C, <NON_REF>]
[-1, -1]
3
HG00097
chr21
8220187
8220198
[C, <NON_REF>]
[-1, -1]
4
HG00097
chr21
8220199
8220199
[C, <NON_REF>]
[0, 0]
...
...
...
...
...
...
...
7337
HG00097
chr21
8405412
8405523
[T, <NON_REF>]
[0, 0]
7338
HG00096
chr21
8405524
8405572
[C, <NON_REF>]
[0, 0]
7339
HG00097
chr21
8405524
8405572
[C, <NON_REF>]
[0, 0]
7340
HG00096
chr21
8405573
8405579
[ATGTGTG, ATGTG, A, ATG, ATGTGTGTG, <NON_REF>]
[0, 1]
7341
HG00097
chr21
8405573
8405579
[ATGTGTG, ATG, ATGTG, A, ATGTGTGTGTGTG, ATGTAT...
[0, 1]
7342 rows × 6 columns
Clean up the created TileDB-VCF dataset.
# Clean up VCF dataset if it already exists
if os.path.exists(vcf_uri):
shutil.rmtree(vcf_uri)