Learn how to add new samples to a TileDB-VCF dataset, solving the N+1 problem.
How to run this tutorial
You can run this tutorial in two ways:
Locally on your machine.
On TileDB Cloud.
However, since TileDB Cloud has a free tier, we strongly recommend that you sign up and run everything there, as that requires no installations or deployment.
This tutorial shows how you can efficiently add new samples to an existing TileDB-VCF dataset. To find more details on why it is possible for TileDB-VCF to rapidly add new samples, visit the Key Concepts: N+1 Problem section.
Setup
First, import the necessary libraries, set the TileDB-VCF dataset URI (i.e., its path, which in this tutorial will be on local storage), and delete any previously created datasets with the same name.
vcf_bucket ="s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen"samples_to_ingest = ["HG00096_chr21.gvcf.gz","HG00097_chr21.gvcf.gz","HG00099_chr21.gvcf.gz","HG00100_chr21.gvcf.gz","HG00101_chr21.gvcf.gz",]sample_uris = [f"{vcf_bucket}/{s}"for s in samples_to_ingest]sample_uris
Adding new samples to an existing TileDB-VCF dataset is almost identical to the initial ingestion. You just need to open the dataset in write mode and ingest the specified samples.
samples_to_add = ["HG00102_chr21.gvcf.gz","HG00103_chr21.gvcf.gz","HG00105_chr21.gvcf.gz","HG00106_chr21.gvcf.gz","HG00107_chr21.gvcf.gz",]new_sample_uris = [f"{vcf_bucket}/{s}"for s in samples_to_add]new_sample_uris