Add New Samples

life sciences

genomics (vcf)

tutorials

writes

updates

Learn how to add new samples to a TileDB-VCF dataset, solving the N+1 problem.

How to run this tutorial

You can run this tutorial in two ways:

Locally on your machine.
On TileDB Cloud.

However, since TileDB Cloud has a free tier, we strongly recommend that you sign up and run everything there, as that requires no installations or deployment.

This tutorial shows how you can efficiently add new samples to an existing TileDB-VCF dataset. To find more details on why it is possible for TileDB-VCF to rapidly add new samples, visit the Key Concepts: N+1 Problem section.

Setup

First, import the necessary libraries, set the TileDB-VCF dataset URI (i.e., its path, which in this tutorial will be on local storage), and delete any previously created datasets with the same name.

Python

import os.path
import shutil

import tiledb
import tiledbvcf

# Print library versions
print("TileDB core version: {}".format(tiledb.libtiledb.version()))
print("TileDB-Py version: {}".format(tiledb.version()))
print("TileDB-VCF version: {}".format(tiledbvcf.version))

# Set VCF dataset URI
vcf_uri = os.path.expanduser("~/adding_new_samples")

# Clean up VCF dataset if it already exists
if os.path.exists(vcf_uri):
    shutil.rmtree(vcf_uri)

TileDB core version: (2, 24, 2)
TileDB-Py version: (0, 30, 2)
TileDB-VCF version: 0.33.2

Initial ingestion

Specify the samples to be ingested, which are readily available on a TileDB-owned public S3 bucket.

Python

vcf_bucket = "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen"
samples_to_ingest = [
    "HG00096_chr21.gvcf.gz",
    "HG00097_chr21.gvcf.gz",
    "HG00099_chr21.gvcf.gz",
    "HG00100_chr21.gvcf.gz",
    "HG00101_chr21.gvcf.gz",
]
sample_uris = [f"{vcf_bucket}/{s}" for s in samples_to_ingest]
sample_uris

['s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00096_chr21.gvcf.gz',
 's3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00097_chr21.gvcf.gz',
 's3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00099_chr21.gvcf.gz',
 's3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00100_chr21.gvcf.gz',
 's3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00101_chr21.gvcf.gz']

Next, create a TileDB-VCF dataset, and ingest the samples in it.

Python

# Open a VCF dataset in write mode
ds = tiledbvcf.Dataset(uri=vcf_uri, mode="w")

# Create empty VCF dataset
ds.create_dataset()

# Ingest samples
ds.ingest_samples(sample_uris=sample_uris)

Show which samples were ingested.

Python

# Open the Dataset in read mode
ds = tiledbvcf.Dataset(uri=vcf_uri, mode="r")

# Show which samples were ingested
ds.samples()

['HG00096', 'HG00097', 'HG00099', 'HG00100', 'HG00101']

Add new samples

Adding new samples to an existing TileDB-VCF dataset is almost identical to the initial ingestion. You just need to open the dataset in write mode and ingest the specified samples.

First, specify some new samples to ingest:

Python

samples_to_add = [
    "HG00102_chr21.gvcf.gz",
    "HG00103_chr21.gvcf.gz",
    "HG00105_chr21.gvcf.gz",
    "HG00106_chr21.gvcf.gz",
    "HG00107_chr21.gvcf.gz",
]
new_sample_uris = [f"{vcf_bucket}/{s}" for s in samples_to_add]
new_sample_uris

['s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00102_chr21.gvcf.gz',
 's3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00103_chr21.gvcf.gz',
 's3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00105_chr21.gvcf.gz',
 's3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00106_chr21.gvcf.gz',
 's3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00107_chr21.gvcf.gz']

Next, open the TileDB-VCF dataset, and ingest the new samples.

Python

# Open a VCF dataset in write mode
ds = tiledbvcf.Dataset(uri=vcf_uri, mode="w")

# Ingest samples
ds.ingest_samples(sample_uris=new_sample_uris)

Check the samples that exist in the TileDB-VCF dataset after the new samples are ingested.

Python

# Open the Dataset in read mode
ds = tiledbvcf.Dataset(uri=vcf_uri, mode="r")

# Show which samples were ingested
ds.samples()

['HG00096',
 'HG00097',
 'HG00099',
 'HG00100',
 'HG00101',
 'HG00102',
 'HG00103',
 'HG00105',
 'HG00106',
 'HG00107']

Clean up

Clean up the created TileDB-VCF dataset.

Python

# Clean up VCF dataset if it already exists
if os.path.exists(vcf_uri):
    shutil.rmtree(vcf_uri)