Learn how to delete samples from a TileDB-VCF dataset.
How to run this tutorial
You can run this tutorial in two ways:
Locally on your machine.
On TileDB Cloud.
However, since TileDB Cloud has a free tier, we strongly recommend that you sign up and run everything there, as that requires no installations or deployment.
This tutorial guides you through a short example of deleting samples from a TileDB-VCF dataset. Since deleting samples requires a dataset containing samples, some steps are repeated from the Basic Ingestion tutorial.
Setup
First, import the necessary libraries, set the TileDB-VCF dataset URI (i.e., its path, which in this tutorial will be on local storage), and delete any previously created datasets with the same name.
# Specify the samples to be ingestedvcf_bucket ="s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen"samples_to_ingest = ["HG00096_chr21.gvcf.gz", "HG00097_chr21.gvcf.gz"]sample_uris = [f"{vcf_bucket}/{s}"for s in samples_to_ingest]# Open a VCF dataset in write modeds = tiledbvcf.Dataset(uri=vcf_uri, mode="w")# Create empty VCF datasetds.create_dataset()# Ingest samplesds.ingest_samples(sample_uris=sample_uris)
Print a list of samples in the dataset and read some data to show the state of the dataset before deleting samples.
# Open the Dataset in read modeds = tiledbvcf.Dataset(uri=vcf_uri, mode="r")# Print a list of samples in the datasetprint("Samples in the dataset:", ds.samples())# Read a chromosome region, and subset on samples and attributesdf = ds.read( regions=["chr21:8220186-8405573"], samples=["HG00096", "HG00097"], attrs=["sample_name", "contig", "pos_start", "pos_end", "alleles", "fmt_GT"],)df
Samples in the dataset: ['HG00096', 'HG00097']
sample_name
contig
pos_start
pos_end
alleles
fmt_GT
0
HG00096
chr21
8220186
8220206
[TCTCCCTCCCTCCCTCCCTCC, T, TCTCC, TCTCCCTCC, T...
[0, 1]
1
HG00097
chr21
8220186
8220194
[TCTCCCTCC, T, TCTCC, CCTCCCTCC, <NON_REF>]
[1, 2]
2
HG00096
chr21
8220187
8220208
[C, <NON_REF>]
[-1, -1]
3
HG00097
chr21
8220187
8220198
[C, <NON_REF>]
[-1, -1]
4
HG00097
chr21
8220199
8220199
[C, <NON_REF>]
[0, 0]
...
...
...
...
...
...
...
7337
HG00097
chr21
8405412
8405523
[T, <NON_REF>]
[0, 0]
7338
HG00096
chr21
8405524
8405572
[C, <NON_REF>]
[0, 0]
7339
HG00097
chr21
8405524
8405572
[C, <NON_REF>]
[0, 0]
7340
HG00096
chr21
8405573
8405579
[ATGTGTG, ATGTG, A, ATG, ATGTGTGTG, <NON_REF>]
[0, 1]
7341
HG00097
chr21
8405573
8405579
[ATGTGTG, ATG, ATGTG, A, ATGTGTGTGTGTG, ATGTAT...
[0, 1]
7342 rows × 6 columns
Delete Samples
To delete a sample from the VCF dataset, provide the dataset URI and sample name to the delete command of the TileDB-VCF CLI.
# Open the Dataset in read modeds = tiledbvcf.Dataset(uri=vcf_uri, mode="r")# Print a list of samples in the datasetprint("Samples in the dataset:", ds.samples())# Read a chromosome region, and subset on samples and attributesdf = ds.read( regions=["chr21:8220186-8405573"], samples=["HG00096", "HG00097"], attrs=["sample_name", "contig", "pos_start", "pos_end", "alleles", "fmt_GT"],)df