Deleting Samples

life sciences

genomics (vcf)

tutorials

deletions

Learn how to delete samples from a TileDB-VCF dataset.

How to run this tutorial

You can run this tutorial in two ways:

Locally on your machine.
On TileDB Cloud.

However, since TileDB Cloud has a free tier, we strongly recommend that you sign up and run everything there, as that requires no installations or deployment.

This tutorial guides you through a short example of deleting samples from a TileDB-VCF dataset. Since deleting samples requires a dataset containing samples, some steps are repeated from the Basic Ingestion tutorial.

Setup

First, import the necessary libraries, set the TileDB-VCF dataset URI (i.e., its path, which in this tutorial will be on local storage), and delete any previously created datasets with the same name.

Python

import os.path
import shutil

import tiledb
import tiledbvcf

# Print library versions
print("TileDB core version: {}".format(tiledb.libtiledb.version()))
print("TileDB-Py version: {}".format(tiledb.version()))
print("TileDB-VCF version: {}".format(tiledbvcf.version))

# Set VCF dataset URI
vcf_uri = os.path.expanduser("~/deleting_samples")

# Clean up VCF dataset if it already exists
if os.path.exists(vcf_uri):
    shutil.rmtree(vcf_uri)

TileDB core version: (2, 24, 2)
TileDB-Py version: (0, 30, 2)
TileDB-VCF version: 0.33.2

Ingest Samples

Next, create a TileDB-VCF dataset and ingest some samples into it.

Python

# Specify the samples to be ingested
vcf_bucket = "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen"
samples_to_ingest = ["HG00096_chr21.gvcf.gz", "HG00097_chr21.gvcf.gz"]
sample_uris = [f"{vcf_bucket}/{s}" for s in samples_to_ingest]

# Open a VCF dataset in write mode
ds = tiledbvcf.Dataset(uri=vcf_uri, mode="w")

# Create empty VCF dataset
ds.create_dataset()

# Ingest samples
ds.ingest_samples(sample_uris=sample_uris)

Print a list of samples in the dataset and read some data to show the state of the dataset before deleting samples.

Python

# Open the Dataset in read mode
ds = tiledbvcf.Dataset(uri=vcf_uri, mode="r")

# Print a list of samples in the dataset
print("Samples in the dataset:", ds.samples())

# Read a chromosome region, and subset on samples and attributes
df = ds.read(
    regions=["chr21:8220186-8405573"],
    samples=["HG00096", "HG00097"],
    attrs=["sample_name", "contig", "pos_start", "pos_end", "alleles", "fmt_GT"],
)
df

Samples in the dataset: ['HG00096', 'HG00097']

	sample_name	contig	pos_start	pos_end	alleles	fmt_GT
0	HG00096	chr21	8220186	8220206	[TCTCCCTCCCTCCCTCCCTCC, T, TCTCC, TCTCCCTCC, T...	[0, 1]
1	HG00097	chr21	8220186	8220194	[TCTCCCTCC, T, TCTCC, CCTCCCTCC, <NON_REF>]	[1, 2]
2	HG00096	chr21	8220187	8220208	[C, <NON_REF>]	[-1, -1]
3	HG00097	chr21	8220187	8220198	[C, <NON_REF>]	[-1, -1]
4	HG00097	chr21	8220199	8220199	[C, <NON_REF>]	[0, 0]
...	...	...	...	...	...	...
7337	HG00097	chr21	8405412	8405523	[T, <NON_REF>]	[0, 0]
7338	HG00096	chr21	8405524	8405572	[C, <NON_REF>]	[0, 0]
7339	HG00097	chr21	8405524	8405572	[C, <NON_REF>]	[0, 0]
7340	HG00096	chr21	8405573	8405579	[ATGTGTG, ATGTG, A, ATG, ATGTGTGTG, <NON_REF>]	[0, 1]
7341	HG00097	chr21	8405573	8405579	[ATGTGTG, ATG, ATGTG, A, ATGTGTGTGTGTG, ATGTAT...	[0, 1]

7342 rows × 6 columns

Delete Samples

To delete a sample from the VCF dataset, provide the dataset URI and sample name to the delete command of the TileDB-VCF CLI.

Python

sample_to_delete = "HG00096"

!tiledbvcf delete --uri {vcf_uri} --sample-names {sample_to_delete}

Print a list of samples in the dataset and read some data to show the state of the dataset after deleting samples.

Python

# Open the Dataset in read mode
ds = tiledbvcf.Dataset(uri=vcf_uri, mode="r")

# Print a list of samples in the dataset
print("Samples in the dataset:", ds.samples())

# Read a chromosome region, and subset on samples and attributes
df = ds.read(
    regions=["chr21:8220186-8405573"],
    samples=["HG00096", "HG00097"],
    attrs=["sample_name", "contig", "pos_start", "pos_end", "alleles", "fmt_GT"],
)
df

Samples in the dataset: ['HG00097']

	sample_name	contig	pos_start	pos_end	alleles	fmt_GT
0	HG00097	chr21	8220186	8220194	[TCTCCCTCC, T, TCTCC, CCTCCCTCC, <NON_REF>]	[1, 2]
1	HG00097	chr21	8220187	8220198	[C, <NON_REF>]	[-1, -1]
2	HG00097	chr21	8220199	8220199	[C, <NON_REF>]	[0, 0]
3	HG00097	chr21	8220200	8220200	[T, <NON_REF>]	[0, 0]
4	HG00097	chr21	8220201	8220201	[C, <NON_REF>]	[0, 0]
...	...	...	...	...	...	...
3873	HG00097	chr21	8405369	8405369	[C, <NON_REF>]	[0, 0]
3874	HG00097	chr21	8405370	8405411	[T, <NON_REF>]	[0, 0]
3875	HG00097	chr21	8405412	8405523	[T, <NON_REF>]	[0, 0]
3876	HG00097	chr21	8405524	8405572	[C, <NON_REF>]	[0, 0]
3877	HG00097	chr21	8405573	8405579	[ATGTGTG, ATG, ATGTG, A, ATGTGTGTGTGTG, ATGTAT...	[0, 1]

3878 rows × 6 columns

Clean up

Clean up the created TileDB-VCF dataset.

Python

# Clean up VCF dataset if it already exists
if os.path.exists(vcf_uri):
    shutil.rmtree(vcf_uri)