Demonstration of basic usage of TileDB-VCF on TileDB Cloud.
How to run this tutorial
You can run this tutorial only on TileDB Cloud. However, TileDB Cloud has a free tier. We strongly recommend that you sign up and run everything there, as that requires no installations or deployment.
This tutorial shows basic usage of TileDB-VCF on TileDB Cloud. It assumes you have already completed the Get Started section.
Start by setting some environment variables. It can look something like this:
import tiledbimport os# You should set the appropriate environment variables with your keys.# Get the keys from the environment variables.tiledb_account = os.environ["TILEDB_ACCOUNT"]tiledb_token = os.environ["TILEDB_REST_TOKEN"]# or use your username and password (not recommended)# tiledb_username = os.environ["TILEDB_USERNAME"]# tiledb_password = os.environ["TILEDB_PASSWORD"]# Get the bucket and region from environment variabless3_bucket = os.environ["S3_BUCKET"]# Set the AWS keys and region to the config of the default context# This context initialization can be performed only once.cfg = tiledb.Config( {"vfs.s3.no_sign_request": "true", # boosts performance when accessing public S3 buckets"rest.token": tiledb_token,# or use your username and password (not recommended)# "rest.username": tiledb_username,# "rest.password": tiledb_password, })ctx = tiledb.Ctx(cfg)
The rest of the tutorial is very similar to the Tutorials: Basic Ingestion section, whereas you can create, write, and read any VCF dataset in the same manner after setting up your TileDB credentials as shown above.
First, import the necessary libraries, set the TileDB-VCF dataset URI (i.e., its path, which in this tutorial will be on local storage), and delete any previously created datasets with the same name.
# Specify the sample URIsvcf_bucket ="s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen"samples_to_ingest = ["HG00096_chr21.gvcf.gz","HG00097_chr21.gvcf.gz","HG00099_chr21.gvcf.gz","HG00100_chr21.gvcf.gz","HG00101_chr21.gvcf.gz",]sample_uris = [f"{vcf_bucket}/{s}"for s in samples_to_ingest]sample_uris
Next, create the TileDB-VCF dataset and ingest the specified samples. The only difference between TileDB Cloud and TileDB-VCF open-source when creating datasets is that the TileDB Cloud URI should be of the form: tiledb://<account>/s3://<bucket>/<vcf_dataset>. TileDB Cloud understands that you are trying to create a VCF dataset in S3 URI s3://<bucket>/<vcf_dataset> and register it under <account>. After you create the dataset, you can access the array as tiledb://<account>/<vcf_dataset>.
Warning
The following block may take a lot of time if you are running it from your local machine with poor internet connection. For best performance, it is highly recommended that you run it from a TileDB Cloud notebook.
# NOTE: This is the only special thing about TileDB Cloud when# creating and registering VCF datasets: the URI should be of the form:# tiledb://<account>/s3://<bucket>/<vcf_name># TileDB Cloud understands that you are trying to create VCF dataset in# s3://<bucket>/<vcf_name> and register it under <account>.# After the VCF dataset is created and registered, it will be accessible# simply as tiledb://<account>/<vcf_name>vcf_uri_reg ="tiledb://"+ tiledb_account +"/"+ s3_bucket +"/"+ vcf_name# Open a VCF dataset in write mode.# Notice you need ot pass the TileDB Cloud config for authentication.ds = tiledbvcf.Dataset(uri=vcf_uri_reg, mode="w", tiledb_config=cfg)# Create empty VCF datasetds.create_dataset()# Ingest samplesds.ingest_samples(sample_uris=sample_uris)
Read some data using TileDB-VCF. Again, you access the array with URI tiledb://<account>/<vcf_dataset>.
# Open the Dataset in read mode.# Notice you need to pass the TileDB Cloud config for authentication.ds = tiledbvcf.Dataset(uri=vcf_uri, mode="r", tiledb_config=cfg)# Read a chromosome region, and subset on samples and attributesdf = ds.read( regions=["chr21:8220186-8405573"], samples=["HG00096", "HG00097"], attrs=["sample_name", "contig", "pos_start", "pos_end", "alleles", "fmt_GT"],)df
sample_name
contig
pos_start
pos_end
alleles
fmt_GT
0
HG00096
chr21
8220186
8220206
[TCTCCCTCCCTCCCTCCCTCC, T, TCTCC, TCTCCCTCC, T...
[0, 1]
1
HG00097
chr21
8220186
8220194
[TCTCCCTCC, T, TCTCC, CCTCCCTCC, <NON_REF>]
[1, 2]
2
HG00096
chr21
8220187
8220208
[C, <NON_REF>]
[-1, -1]
3
HG00097
chr21
8220187
8220198
[C, <NON_REF>]
[-1, -1]
4
HG00097
chr21
8220199
8220199
[C, <NON_REF>]
[0, 0]
...
...
...
...
...
...
...
7337
HG00097
chr21
8405412
8405523
[T, <NON_REF>]
[0, 0]
7338
HG00096
chr21
8405524
8405572
[C, <NON_REF>]
[0, 0]
7339
HG00097
chr21
8405524
8405572
[C, <NON_REF>]
[0, 0]
7340
HG00096
chr21
8405573
8405579
[ATGTGTG, ATGTG, A, ATG, ATGTGTGTG, <NON_REF>]
[0, 1]
7341
HG00097
chr21
8405573
8405579
[ATGTGTG, ATG, ATGTG, A, ATGTGTGTGTGTG, ATGTAT...
[0, 1]
7342 rows × 6 columns
Clean up in the end by deleting the dataset. This command will unregister the VCF dataset from TileDB Cloud and physically delete all groups and arrays comprising the dataset from the physical storage.