1. Structure
  2. Life Sciences
  3. Population Genomics
  4. Quickstart
  • Home
  • What is TileDB?
  • Get Started
  • Explore Content
  • Accounts
    • Individual Accounts
      • Apply for the Free Tier
      • Profile
        • Overview
        • Cloud Credentials
        • Storage Paths
        • REST API Tokens
        • Credits
    • Organization Admins
      • Create an Organization
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
    • Organization Members
      • Organization Invitations
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
  • Catalog
    • Introduction
    • Data
      • Arrays
      • Tables
      • Single-Cell (SOMA)
      • Genomics (VCF)
      • Biomedical Imaging
      • Vector Search
      • Files
    • Code
      • Notebooks
      • Dashboards
      • User-Defined Functions
      • Task Graphs
      • ML Models
    • Groups
    • Marketplace
    • Search
  • Collaborate
    • Introduction
    • Organizations
    • Access Control
      • Introduction
      • Share Assets
      • Asset Permissions
      • Public Assets
    • Logging
    • Marketplace
  • Analyze
    • Introduction
    • Slice Data
    • Multi-Region Redirection
    • Notebooks
      • Launch a Notebook
      • Usage
      • Widgets
      • Notebook Image Dependencies
    • Dashboards
      • Dashboards
      • Streamlit
    • Preview
    • User-Defined Functions
    • Task Graphs
    • Serverless SQL
    • Monitor
      • Task Log
      • Task Graph Log
  • Scale
    • Introduction
    • Task Graphs
    • API Usage
  • Structure
    • Why Structure Is Important
    • Arrays
      • Introduction
      • Quickstart
      • Foundation
        • Array Data Model
        • Key Concepts
          • Storage
            • Arrays
            • Dimensions
            • Attributes
            • Cells
            • Domain
            • Tiles
            • Data Layout
            • Compression
            • Encryption
            • Tile Filters
            • Array Schema
            • Schema Evolution
            • Fragments
            • Fragment Metadata
            • Commits
            • Indexing
            • Array Metadata
            • Datetimes
            • Groups
            • Object Stores
          • Compute
            • Writes
            • Deletions
            • Consolidation
            • Vacuuming
            • Time Traveling
            • Reads
            • Query Conditions
            • Aggregates
            • User-Defined Functions
            • Distributed Compute
            • Concurrency
            • Parallelism
        • Storage Format Spec
      • Tutorials
        • Basics
          • Basic Dense Array
          • Basic Sparse Array
          • Array Metadata
          • Compression
          • Encryption
          • Data Layout
          • Tile Filters
          • Datetimes
          • Multiple Attributes
          • Variable-Length Attributes
          • String Dimensions
          • Nullable Attributes
          • Multi-Range Reads
          • Query Conditions
          • Aggregates
          • Deletions
          • Catching Errors
          • Configuration
          • Basic S3 Example
          • Basic TileDB Cloud
          • fromDataFrame
          • Palmer Penguins
        • Advanced
          • Schema Evolution
          • Advanced Writes
            • Write at a Timestamp
            • Get Fragment Info
            • Consolidation
              • Fragments
              • Fragment List
              • Consolidation Plan
              • Commits
              • Fragment Metadata
              • Array Metadata
            • Vacuuming
              • Fragments
              • Commits
              • Fragment Metadata
              • Array Metadata
          • Advanced Reads
            • Get Fragment Info
            • Time Traveling
              • Introduction
              • Fragments
              • Array Metadata
              • Schema Evolution
          • Array Upgrade
          • Backends
            • Amazon S3
            • Azure Blob Storage
            • Google Cloud Storage
            • MinIO
            • Lustre
          • Virtual Filesystem
          • User-Defined Functions
          • Distributed Compute
          • Result Estimation
          • Incomplete Queries
        • Management
          • Array Schema
          • Groups
          • Object Management
        • Performance
          • Summary of Factors
          • Dense vs. Sparse
          • Dimensions vs. Attributes
          • Compression
          • Tiling and Data Layout
          • Tuning Writes
          • Tuning Reads
      • API Reference
    • Tables
      • Introduction
      • Quickstart
      • Foundation
        • Data Model
        • Key Concepts
          • Indexes
          • Columnar Storage
          • Compression
          • Data Manipulation
          • Optimize Tables
          • ACID
          • Serverless SQL
          • SQL Connectors
          • Dataframes
          • CSV Ingestion
      • Tutorials
        • Basics
          • Ingestion with SQL
          • CSV Ingestion
          • Basic S3 Example
          • Running Locally
        • Advanced
          • Scalable Ingestion
          • Scalable Queries
      • API Reference
    • AI & ML
      • Vector Search
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Vector Search
            • Vector Databases
            • Algorithms
            • Distance Metrics
            • Updates
            • Deployment Methods
            • Architecture
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Ingestion & Querying
            • Updates
            • Deletions
            • Basic S3 Example
            • Running Locally
          • Advanced
            • Versioning
            • Time Traveling
            • Consolidation
            • Distributed Compute
            • RAG LLM
            • LLM Memory
            • File Search
            • Image Search
            • Protein Search
          • Performance
        • API Reference
      • ML Models
        • Introduction
        • Quickstart
        • Foundation
          • Basics
          • Storage
          • Cloud Execution
          • Why TileDB for Machine Learning
        • Tutorials
          • Ingestion
            • Data Ingestion
              • Dense Datasets
              • Sparse Datasets
            • ML Model Ingestion
          • Management
            • Array Schema
            • Machine Learning: Groups
            • Time Traveling
    • Life Sciences
      • Single-cell
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Data Structures
            • Use of Apache Arrow
            • Join IDs
            • State Management
            • TileDB Cloud URIs
          • SOMA API Specification
        • Tutorials
          • Data Ingestion
          • Bulk Ingestion Tutorial
          • Data Access
          • Distributed Compute
          • Basic S3 Example
          • Multi-Experiment Queries
          • Appending Data to a SOMA Experiment
          • Add New Measurements
          • SQL Queries
          • Running Locally
          • Shapes in TileDB-SOMA
          • Drug Discovery App
        • Spatial
          • Introduction
          • Foundation
            • Spatial Data Model
            • Data Structures
          • Tutorials
            • Spatial Data Ingestion
            • Access Spatial Data
            • Manage Coordinate Spaces
        • API Reference
      • Population Genomics
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • The N+1 Problem
            • Architecture
            • Arrays
            • Ingestion
            • Reads
            • Variant Statistics
            • Annotations
            • User-Defined Functions
            • Tables and SQL
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Basic Ingestion
            • Basic Queries
            • Export to VCF
            • Add New Samples
            • Deleting Samples
            • Basic S3 Example
            • Basic TileDB Cloud
          • Advanced
            • Scalable Ingestion
            • Scalable Queries
            • Query Transforms
            • Handling Large Queries
            • Annotations
              • Finding Annotations
              • Embedded Annotations
              • External Annotations
              • Annotation VCFs
              • Ingesting Annotations
            • Variant Statistics
            • Tables and SQL
            • User-Defined Functions
            • Sample Metadata
            • Split VCF
          • Performance
        • API Reference
          • Command Line Interface
          • Python API
          • Cloud API
      • Biomedical Imaging
        • Introduction
        • Foundation
          • Data Model
          • Key Concepts
            • Arrays
            • Ingestion
            • Reads
            • User Defined Functions
          • Storage Format Spec
        • Quickstart
        • Tutorials
          • Basics
            • Ingestion
            • Read
              • OpenSlide
              • TileDB-Py
          • Advanced
            • Batched Ingestion
            • Chunked Ingestion
            • Machine Learning
              • PyTorch
            • Napari
    • Files
  • API Reference
  • Self-Hosting
    • Installation
    • Upgrades
    • Administrative Tasks
    • Image Customization
      • Customize User-Defined Function Images
      • AWS ECR Container Registry
      • Customize Jupyter Notebook Images
    • Single Sign-On
      • Configure Single Sign-On
      • OpenID Connect
      • Okta SCIM
      • Microsoft Entra
  • Glossary

On this page

  • Installation
  • Setup
  • Ingestion
  • Reading
  • VCF export
  • Clean up
  1. Structure
  2. Life Sciences
  3. Population Genomics
  4. Quickstart

Population Genomics Quickstart

life sciences
genomics (vcf)
quickstart
tutorials
python
This tutorial covers the basics of working with VCF data using TileDB-VCF.
How to run this tutorial

You can run this tutorial in two ways:

  1. Locally on your machine.
  2. On TileDB Cloud.

However, since TileDB Cloud has a free tier, we strongly recommend that you sign up and run everything there, as that requires no installations or deployment.

This quickstart is designed to provide a rapid introduction to TileDB-VCF and its capabilities. It covers the following topics:

  • Create a VCF dataset and add new human VCF samples to it.
  • Run region and sample queries.
  • Export data to VCF.

Installation

You should familiarize yourself with Jupyter notebooks to run data exploration and analysis efficiently. You can review Jupyter’s documentation on installing and running notebooks.

The following libraries and programs need to be installed:

  • TileDB-VCF, which provides methods for import, export and querying of variant data
  • TileDB-Py, the Python wrapper of TileDB Embedded (to start using TileDB arrays)
  • NumPy (to handle data with Python)
  • pandas (to see and manipulate dataframes)
  • Apache Arrow (to boost performance via zero-copy to pandas)

Conda and mamba are the preferred mechanisms for installing TileDB-VCF.

# enter the following two lines if you are on a M1 Mac
CONDA_SUBDIR=os
conda config --env --set subdir osx-64

# create the conda environment
conda create -n tiledb-vcf "python<3.10"
conda activate tiledb-vcf

# mamba is a faster and more reliable alternative to conda
conda install -c conda-forge mamba

# Install TileDB-Py and TileDB-VCF, align with other useful libraries
mamba install -y -c conda-forge -c bioconda -c tiledb tiledb-py tiledbvcf-py pandas pyarrow numpy

Setup

Start by importing the libraries used in this tutorial, setting the local VCF dataset URI where you will ingest some VCF samples into, and cleaning up any older datasets with the same name.

import os.path
import shutil

import tiledb
import tiledbvcf

# Print library versions
print("TileDB core version: {}".format(tiledb.libtiledb.version()))
print("TileDB-Py version: {}".format(tiledb.version()))
print("TileDB-VCF version: {}".format(tiledbvcf.version))

# Set VCF dataset URI
vcf_uri = os.path.expanduser("~/my_vcf_dataset")

# Clean up VCF dataset if it already exists
if os.path.exists(vcf_uri):
    shutil.rmtree(vcf_uri)

# Clean up combined VCF
combined_uri = os.path.expanduser("~/combined.vcf")
if os.path.exists(combined_uri):
    os.remove(combined_uri)

# Clean up single VCFs
HG00097_uri = os.path.expanduser("~/HG00097.vcf")
if os.path.exists(HG00097_uri):
    os.remove(HG00097_uri)
HG00101_uri = os.path.expanduser("~/HG00101.vcf")
if os.path.exists(HG00101_uri):
    os.remove(HG00101_uri)
TileDB core version: (2, 24, 2)
TileDB-Py version: (0, 30, 2)
TileDB-VCF version: 0.33.2

Ingestion

This process will ingest the VCF data directly from a public S3 bucket into a local VCF dataset, without needing to download the source VCF files beforehand. The ingestion should take about a minute from your laptop. S3 throughput forms the majority of the ingestion time here, along with the parsing cost of the htslib library that TileDB-VCF is using internally to ingest the VCF files.

You will use samples from the latest DRAGEN 3.5.7b re-analysis of the 1000 Genomes dataset. Specify the samples to ingest as follows.

vcf_bucket = "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen"
samples_to_ingest = [
    "HG00096_chr21.gvcf.gz",
    "HG00097_chr21.gvcf.gz",
    "HG00099_chr21.gvcf.gz",
    "HG00100_chr21.gvcf.gz",
    "HG00101_chr21.gvcf.gz",
]
sample_uris = [f"{vcf_bucket}/{s}" for s in samples_to_ingest]
sample_uris
['s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00096_chr21.gvcf.gz',
 's3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00097_chr21.gvcf.gz',
 's3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00099_chr21.gvcf.gz',
 's3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00100_chr21.gvcf.gz',
 's3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00101_chr21.gvcf.gz']

To ingest these samples locally, run the following:

# Open a VCF dataset in write mode
ds = tiledbvcf.Dataset(uri=vcf_uri, mode="w")

# Create empty VCF dataset
ds.create_dataset()

# Ingest samples
ds.ingest_samples(sample_uris=sample_uris)

The created VCF dataset is materialized as a directory on your local storage, and is modeled as a TileDB group that contains various TileDB arrays. You can run !tree {vcf_uri} to see the file hierarchy inside the VCF dataset directory. For more details on the meaning of those different TileDB objects, visit the Data Model section.

Reading

To access any information from the VCF dataset, you first need to open it in read (r) mode:

# Open the Dataset in read mode
ds = tiledbvcf.Dataset(uri=vcf_uri, mode="r")

You can now see what samples the dataset contains.

# Show which samples were ingested
ds.samples()
['HG00096', 'HG00097', 'HG00099', 'HG00100', 'HG00101']

These are the same as the ones you ingested.

Next, print what “attributes” (i.e., VCF fields) you can access in later queries.

# Show the attributes of the dataset
ds.attributes()
['alleles',
 'contig',
 'filters',
 'fmt',
 'fmt_AD',
 'fmt_AF',
 'fmt_DP',
 'fmt_F1R2',
 'fmt_F2R1',
 'fmt_GP',
 'fmt_GQ',
 'fmt_GT',
 'fmt_ICNT',
 'fmt_MB',
 'fmt_MIN_DP',
 'fmt_PL',
 'fmt_PRI',
 'fmt_PS',
 'fmt_SB',
 'fmt_SPL',
 'fmt_SQ',
 'id',
 'info',
 'info_DB',
 'info_DP',
 'info_END',
 'info_FS',
 'info_FractionInformativeReads',
 'info_LOD',
 'info_MQ',
 'info_MQRankSum',
 'info_QD',
 'info_R2_5P_bias',
 'info_ReadPosRankSum',
 'info_SOR',
 'pos_end',
 'pos_start',
 'qual',
 'query_bed_end',
 'query_bed_line',
 'query_bed_start',
 'sample_name']

You can read data with the .read() method, which allows you to select samples, attributes, and genomic regions on which to slice:

# Read a chromosome region, and subset on samples and attributes
df = ds.read(
    regions=["chr21:8220186-8405573"],
    samples=["HG00096", "HG00097"],
    attrs=["sample_name", "contig", "pos_start", "pos_end", "alleles", "fmt_GT"],
)
df
sample_name contig pos_start pos_end alleles fmt_GT
0 HG00096 chr21 8220186 8220206 [TCTCCCTCCCTCCCTCCCTCC, T, TCTCC, TCTCCCTCC, T... [0, 1]
1 HG00097 chr21 8220186 8220194 [TCTCCCTCC, T, TCTCC, CCTCCCTCC, <NON_REF>] [1, 2]
2 HG00096 chr21 8220187 8220208 [C, <NON_REF>] [-1, -1]
3 HG00097 chr21 8220187 8220198 [C, <NON_REF>] [-1, -1]
4 HG00097 chr21 8220199 8220199 [C, <NON_REF>] [0, 0]
... ... ... ... ... ... ...
7337 HG00097 chr21 8405412 8405523 [T, <NON_REF>] [0, 0]
7338 HG00096 chr21 8405524 8405572 [C, <NON_REF>] [0, 0]
7339 HG00097 chr21 8405524 8405572 [C, <NON_REF>] [0, 0]
7340 HG00096 chr21 8405573 8405579 [ATGTGTG, ATGTG, A, ATG, ATGTGTGTG, <NON_REF>] [0, 1]
7341 HG00097 chr21 8405573 8405579 [ATGTGTG, ATG, ATGTG, A, ATGTGTGTGTGTG, ATGTAT... [0, 1]

7342 rows × 6 columns

Once you have efficiently queried the data of interest, you can manipulate them using pandas in a variety of ways, for example:

# Pivot so that each sample is a column and it displays the GT
df["fmt_GT"] = df["fmt_GT"].apply(lambda x: "/".join(map(str, x)))
df["fmt_GT"] = df["fmt_GT"].apply(lambda x: "./." if x == "-1/-1" else x)
df.pivot(index="pos_start", columns="sample_name", values="fmt_GT")
sample_name HG00096 HG00097
pos_start
8220186 0/1 1/2
8220187 ./. ./.
8220199 NaN 0/0
8220200 NaN 0/0
8220201 NaN 0/0
... ... ...
8405370 0/0 0/0
8405409 0/0 NaN
8405412 NaN 0/0
8405524 0/0 0/0
8405573 0/1 0/1

5824 rows × 2 columns

VCF export

TileDB-VCF ingests VCF samples in a lossless manner, and allows you to export the data back into the original VCF format, or into a combined VCF format.

To export samples into their original (single-sample) VCF files, you can run the following:

# Export two VCF samples
ds.export(
    regions=["chr21:8220186-8405573"],
    samples=["HG00101", "HG00097"],
    output_format="v",
    output_dir=os.path.expanduser("~"),
)

Two single-sample VCF files were created in your directory. You can use bcftools to confirm that the files were exported correctly.

!bcftools view --no-header {HG00101_uri} | head -10
chr21   8220186 .   TCTCCCTCCCTCCCTCC   T,TCTCCCTCC,CCTCCCTCCCTCCCTCC,<NON_REF> 44.45   PASS    END=8220202;DP=162;MQ=50.71;MQRankSum=-3.588;ReadPosRankSum=-2.546;FractionInformativeReads=0.636;R2_5P_bias=0  GT:AD:AF:DP:F1R2:F2R1:GQ:PL:SPL:ICNT:GP:PRI:SB:MB   0/1:27,47,24,5,0:0.456,0.233,0.049,0:103:11,21,11,1,0:16,26,13,4,0:40:46,0,39,486,446,494,1263,758,1099,806,743,529,903,1339,1125:255,0,255:40,29:44.453,0.0004173,42.201,450,448.53,450,450,450,450,450,450,450,450,450,450:0,2,5,2,4,5,34.77,36.77,36.77,37.77,34.77,36.77,36.77,69.54,37.77:0,27,4,72:16,11,41,35
chr21   8220187 .   C   <NON_REF>   .   LowGQ   END=8220202 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  ./.:318,65:383:0:370:0,0,0:37,0,255:40,30
chr21   8220203 .   C   <NON_REF>   .   PASS    END=8220203 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  0/0:334,56:390:46:390:0,46,8757:0,46,255:40,2
chr21   8220204 .   T   <NON_REF>   .   LowGQ   END=8220204 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  0/0:228,162:390:0:390:0,0,4267:255,0,255:40,2
chr21   8220205 .   C   <NON_REF>   .   PASS    END=8220205 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  0/0:349,37:386:99:386:0,120,1800:0,255,255:40,2
chr21   8220206 .   C   <NON_REF>   .   LowGQ   END=8220206 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  0/0:255,133:388:0:388:0,0,3549:255,0,255:40,2
chr21   8220207 .   C   <NON_REF>   .   PASS    END=8220207 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  0/0:348,44:392:99:392:0,120,1800:0,255,255:40,0
chr21   8220208 .   T   <NON_REF>   .   LowGQ   END=8220208 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  0/0:235,162:397:0:397:0,0,4234:255,0,255:40,0
chr21   8220209 .   C   <NON_REF>   .   PASS    END=8220209 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  0/0:360,34:394:99:394:0,120,1800:0,255,255:40,0
chr21   8220210 .   C   <NON_REF>   .   LowGQ   END=8220210 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  0/0:320,65:385:0:385:0,0,7583:22,0,255:40,0

Note that you can export the whole file, or any genomic region, for convenience.

To export multiple samples into a CombinedVCF file, you can run:

# Export to combined VCF
ds.export(
    regions=["chr21:8220186-8405573"],
    samples=ds.samples()[0:2],
    merge=True,  # this will create a combined VCF file
    output_format="v",
    output_path=combined_uri,
)

You can confirm that the data has been exported correctly using bcftools:

!bcftools view --no-header {combined_uri} | head -10
chr21   8220186 .   TCTCCCTCCCTCCCTCCCTCC   T,TCTCC,TCTCCCTCC,TCTCCCTCCCTCC,TCTCCCTCCCTCCCTCC,<NON_REF>,CCTCCCTCCCTCCCTCCCTCC   228.26  PASS    R2_5P_bias=0;FractionInformativeReads=0.754;ReadPosRankSum=-0.751;MQ=51.65;DP=461;MQRankSum=0.09;END=8220206    GT:AD:AF:DP:F1R2:F2R1:GQ:PL:SPL:ICNT:GP:PRI:SB:MB   0/1:45,97,10,33,24,6,0,.:0.451,0.047,0.153,0.112,0.028,0,.:215:18,59,8,18,12,3,0,.:27,38,2,15,12,3,0,.:26:48,0,23,3973,1791,1840,1536,349,2890,398,2171,323,3253,2104,372,3480,977,5679,2614,3195,1025,2130,569,3076,2115,2436,3079,2592,.,.,.,.,.,.,.,.:255,0,255:40,107:46.261,0.010693,26.134,448.53,450,450,450,351.46,450,400.71,450,325.43,448.53,450,374.68,448.53,450,448.53,450,450,450,450,450,450,450,450,450,450,.,.,.,.,.,.,.,.:0,2,5,2,4,5,2,4,4,5,2,4,4,4,5,2,4,4,4,4,5,34.77,36.77,36.77,36.77,36.77,36.77,37.77,.,.,.,.,.,.,.,.:11,34,7,163:24,21,75,95    4/5:5,.,.,.,82,16,0,1:.,.,.,0.788,0.154,0,0.01:104:1,.,.,.,44,6,0,1:4,.,.,.,38,10,0,0:7:231,.,.,.,.,.,.,.,.,.,182,.,.,.,5,868,.,.,.,0,48,1091,.,.,.,212,792,1002,1804,.,.,.,482,1202,1425,531:255,0,255:40,41:228.26,.,.,.,.,.,.,.,.,.,180.71,.,.,.,7.0947,450,.,.,.,0.94329,49.937,450,.,.,.,245.54,450,450,450,.,.,.,450,450,450,450:0,.,.,.,.,.,.,.,.,.,2,.,.,.,5,2,.,.,.,4,5,34.77,.,.,.,36.77,36.77,37.77,34.77,.,.,.,36.77,36.77,69.54,37.77:0,5,0,99:4,1,48,51
chr21   8220187 .   C   <NON_REF>   .   LowGQ   END=8220208 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  ./.:289,140:429:0:398:0,0,0:255,0,255:40,108    ./.:247,121:368:0:353:0,0,0:255,0,255:40,42
chr21   8220199 .   C   <NON_REF>   .   PASS    END=8220199 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  ./.:.:.:.:.:.:.:.   0/0:328,36:364:99:364:0,120,1800:0,255,255:40,7
chr21   8220200 .   T   <NON_REF>   .   LowGQ   END=8220200 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  ./.:.:.:.:.:.:.:.   0/0:203,158:361:0:361:0,0,2749:255,0,255:40,7
chr21   8220201 .   C   <NON_REF>   .   PASS    END=8220201 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  ./.:.:.:.:.:.:.:.   0/0:317,48:365:87:365:0,87,8403:0,87,255:40,7
chr21   8220202 .   C   <NON_REF>   .   LowGQ   END=8220202 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  ./.:.:.:.:.:.:.:.   ./.:156,201:357:0:357:0,0,0:255,0,255:40,7
chr21   8220203 .   C   <NON_REF>   .   PASS    END=8220203 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  ./.:.:.:.:.:.:.:.   0/0:331,33:364:99:364:0,120,1800:0,255,255:40,6
chr21   8220204 .   T   <NON_REF>   .   LowGQ   END=8220204 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  ./.:.:.:.:.:.:.:.   0/0:219,137:356:0:356:0,0,4292:255,0,255:40,6
chr21   8220205 .   C   <NON_REF>   .   PASS    END=8220205 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  ./.:.:.:.:.:.:.:.   0/0:316,46:362:99:362:0,120,1800:0,251,255:40,6
chr21   8220206 .   C   <NON_REF>   .   LowGQ   END=8220206 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  ./.:.:.:.:.:.:.:.   0/0:207,155:362:0:362:0,0,1420:255,0,255:40,6

Clean up

Clean up in the end by deleting the array and generated VCF files.

# Clean up VCF dataset if it already exists
if os.path.exists(vcf_uri):
    shutil.rmtree(vcf_uri)

# Clean up combined VCF
if os.path.exists(combined_uri):
    os.remove(combined_uri)

# Clean up single VCFs
if os.path.exists(HG00097_uri):
    os.remove(HG00097_uri)
if os.path.exists(HG00101_uri):
    os.remove(HG00101_uri)
Introduction
Foundation