Basic TileDB Cloud

life sciences

genomics (vcf)

tutorials

remote access

Demonstration of basic usage of TileDB-VCF on TileDB Cloud.

How to run this tutorial

You can run this tutorial only on TileDB Cloud. However, TileDB Cloud has a free tier. We strongly recommend that you sign up and run everything there, as that requires no installations or deployment.

This tutorial shows basic usage of TileDB-VCF on TileDB Cloud. It assumes you have already completed the Get Started section.

Start by setting some environment variables. It can look something like this:

Python

import os

os.environ["S3_BUCKET"] = "s3://<your_bucket_name>"
os.environ["TILEDB_ACCOUNT"] = "<your_tiledb_account_username>"

Next, fetch the environment variables you created and define the global context:

Python

import os

import tiledb

# You should set the appropriate environment variables with your keys.
# Get the keys from the environment variables.
tiledb_account = os.environ["TILEDB_ACCOUNT"]
tiledb_token = os.environ["TILEDB_REST_TOKEN"]
# or use your username and password (not recommended)
# tiledb_username = os.environ["TILEDB_USERNAME"]
# tiledb_password = os.environ["TILEDB_PASSWORD"]


# Get the bucket and region from environment variables
s3_bucket = os.environ["S3_BUCKET"]

# Set the AWS keys and region to the config of the default context
# This context initialization can be performed only once.
cfg = tiledb.Config(
    {
        "vfs.s3.no_sign_request": "true",  # boosts performance when accessing public S3 buckets
        "rest.token": tiledb_token,
        # or use your username and password (not recommended)
        # "rest.username": tiledb_username,
        # "rest.password": tiledb_password,
    }
)
ctx = tiledb.Ctx(cfg)

The rest of the tutorial is very similar to the Tutorials: Basic Ingestion section, whereas you can create, write, and read any VCF dataset in the same manner after setting up your TileDB credentials as shown above.

First, import the necessary libraries, set the TileDB-VCF dataset URI (i.e., its path, which in this tutorial will be on local storage), and delete any previously created datasets with the same name.

Python

import os.path

import tiledb.cloud
import tiledbvcf

# Print library versions
print("TileDB core version: {}".format(tiledb.libtiledb.version()))
print("TileDB-Py version: {}".format(tiledb.version()))
print("TileDB-VCF version: {}".format(tiledbvcf.version))
print("TileDB-Cloud-Py version: {}".format(tiledb.cloud.version.version))

# Set array URI
vcf_name = "basic_tiledb_cloud"
vcf_uri = "tiledb://" + tiledb_account + "/" + vcf_name

# Clean up VCF dataset if it already exists
if tiledb.object_type(vcf_uri, ctx=ctx) == "group":
    tiledb.cloud.asset.delete(vcf_uri, recursive=True)

TileDB core version: (2, 24, 2)
TileDB-Py version: (0, 30, 2)
TileDB-VCF version: 0.33.3
TileDB-Cloud-Py version: 0.12.17

Specify the VCF samples you would like to ingest.

Python

# Specify the sample URIs
vcf_bucket = "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen"
samples_to_ingest = [
    "HG00096_chr21.gvcf.gz",
    "HG00097_chr21.gvcf.gz",
    "HG00099_chr21.gvcf.gz",
    "HG00100_chr21.gvcf.gz",
    "HG00101_chr21.gvcf.gz",
]
sample_uris = [f"{vcf_bucket}/{s}" for s in samples_to_ingest]
sample_uris

['s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00096_chr21.gvcf.gz',
 's3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00097_chr21.gvcf.gz',
 's3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00099_chr21.gvcf.gz',
 's3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00100_chr21.gvcf.gz',
 's3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00101_chr21.gvcf.gz']

Next, create the TileDB-VCF dataset and ingest the specified samples. The only difference between TileDB Cloud and TileDB-VCF open-source when creating datasets is that the TileDB Cloud URI should be of the form: tiledb://<account>/s3://<bucket>/<vcf_dataset>. TileDB Cloud understands that you are trying to create a VCF dataset in S3 URI s3://<bucket>/<vcf_dataset> and register it under <account>. After you create the dataset, you can access the array as tiledb://<account>/<vcf_dataset>.

Warning

The following block may take a lot of time if you are running it from your local machine with poor internet connection. For best performance, it is highly recommended that you run it from a TileDB Cloud notebook.

Python

# NOTE: This is the only special thing about TileDB Cloud when
# creating and registering VCF datasets: the URI should be of the form:
# tiledb://<account>/s3://<bucket>/<vcf_name>
# TileDB Cloud understands that you are trying to create VCF dataset in
# s3://<bucket>/<vcf_name> and register it under <account>.
# After the VCF dataset is created and registered, it will be accessible
# simply as tiledb://<account>/<vcf_name>
vcf_uri_reg = "tiledb://" + tiledb_account + "/" + s3_bucket + "/" + vcf_name

# Open a VCF dataset in write mode.
# Notice you need ot pass the TileDB Cloud config for authentication.
ds = tiledbvcf.Dataset(uri=vcf_uri_reg, mode="w", tiledb_config=cfg)

# Create empty VCF dataset
ds.create_dataset()

# Ingest samples
ds.ingest_samples(sample_uris=sample_uris)

Read some data using TileDB-VCF. Again, you access the array with URI tiledb://<account>/<vcf_dataset>.

Python

# Open the Dataset in read mode.
# Notice you need to pass the TileDB Cloud config for authentication.
ds = tiledbvcf.Dataset(uri=vcf_uri, mode="r", tiledb_config=cfg)

# Read a chromosome region, and subset on samples and attributes
df = ds.read(
    regions=["chr21:8220186-8405573"],
    samples=["HG00096", "HG00097"],
    attrs=["sample_name", "contig", "pos_start", "pos_end", "alleles", "fmt_GT"],
)
df

	sample_name	contig	pos_start	pos_end	alleles	fmt_GT
0	HG00096	chr21	8220186	8220206	[TCTCCCTCCCTCCCTCCCTCC, T, TCTCC, TCTCCCTCC, T...	[0, 1]
1	HG00097	chr21	8220186	8220194	[TCTCCCTCC, T, TCTCC, CCTCCCTCC, <NON_REF>]	[1, 2]
2	HG00096	chr21	8220187	8220208	[C, <NON_REF>]	[-1, -1]
3	HG00097	chr21	8220187	8220198	[C, <NON_REF>]	[-1, -1]
4	HG00097	chr21	8220199	8220199	[C, <NON_REF>]	[0, 0]
...	...	...	...	...	...	...
7337	HG00097	chr21	8405412	8405523	[T, <NON_REF>]	[0, 0]
7338	HG00096	chr21	8405524	8405572	[C, <NON_REF>]	[0, 0]
7339	HG00097	chr21	8405524	8405572	[C, <NON_REF>]	[0, 0]
7340	HG00096	chr21	8405573	8405579	[ATGTGTG, ATGTG, A, ATG, ATGTGTGTG, <NON_REF>]	[0, 1]
7341	HG00097	chr21	8405573	8405579	[ATGTGTG, ATG, ATGTG, A, ATGTGTGTGTGTG, ATGTAT...	[0, 1]

7342 rows × 6 columns

Clean up in the end by deleting the dataset. This command will unregister the VCF dataset from TileDB Cloud and physically delete all groups and arrays comprising the dataset from the physical storage.

Python

# Clean up VCF dataset
if tiledb.object_type(vcf_uri, ctx=ctx) == "group":
    tiledb.cloud.asset.delete(vcf_uri, recursive=True)