1. Structure
  2. Life Sciences
  3. Population Genomics
  4. Tutorials
  5. Basics
  6. Basic Queries
  • Home
  • What is TileDB?
  • Get Started
  • Explore Content
  • Accounts
    • Individual Accounts
      • Apply for the Free Tier
      • Profile
        • Overview
        • Cloud Credentials
        • Storage Paths
        • REST API Tokens
        • Credits
    • Organization Admins
      • Create an Organization
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
    • Organization Members
      • Organization Invitations
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
  • Catalog
    • Introduction
    • Data
      • Arrays
      • Tables
      • Single-Cell (SOMA)
      • Genomics (VCF)
      • Biomedical Imaging
      • Vector Search
      • Files
    • Code
      • Notebooks
      • Dashboards
      • User-Defined Functions
      • Task Graphs
      • ML Models
    • Groups
    • Marketplace
    • Search
  • Collaborate
    • Introduction
    • Organizations
    • Access Control
      • Introduction
      • Share Assets
      • Asset Permissions
      • Public Assets
    • Logging
    • Marketplace
  • Analyze
    • Introduction
    • Slice Data
    • Multi-Region Redirection
    • Notebooks
      • Launch a Notebook
      • Usage
      • Widgets
      • Notebook Image Dependencies
    • Dashboards
      • Dashboards
      • Streamlit
    • Preview
    • User-Defined Functions
    • Task Graphs
    • Serverless SQL
    • Monitor
      • Task Log
      • Task Graph Log
  • Scale
    • Introduction
    • Task Graphs
    • API Usage
  • Structure
    • Why Structure Is Important
    • Arrays
      • Introduction
      • Quickstart
      • Foundation
        • Array Data Model
        • Key Concepts
          • Storage
            • Arrays
            • Dimensions
            • Attributes
            • Cells
            • Domain
            • Tiles
            • Data Layout
            • Compression
            • Encryption
            • Tile Filters
            • Array Schema
            • Schema Evolution
            • Fragments
            • Fragment Metadata
            • Commits
            • Indexing
            • Array Metadata
            • Datetimes
            • Groups
            • Object Stores
          • Compute
            • Writes
            • Deletions
            • Consolidation
            • Vacuuming
            • Time Traveling
            • Reads
            • Query Conditions
            • Aggregates
            • User-Defined Functions
            • Distributed Compute
            • Concurrency
            • Parallelism
        • Storage Format Spec
      • Tutorials
        • Basics
          • Basic Dense Array
          • Basic Sparse Array
          • Array Metadata
          • Compression
          • Encryption
          • Data Layout
          • Tile Filters
          • Datetimes
          • Multiple Attributes
          • Variable-Length Attributes
          • String Dimensions
          • Nullable Attributes
          • Multi-Range Reads
          • Query Conditions
          • Aggregates
          • Deletions
          • Catching Errors
          • Configuration
          • Basic S3 Example
          • Basic TileDB Cloud
          • fromDataFrame
          • Palmer Penguins
        • Advanced
          • Schema Evolution
          • Advanced Writes
            • Write at a Timestamp
            • Get Fragment Info
            • Consolidation
              • Fragments
              • Fragment List
              • Consolidation Plan
              • Commits
              • Fragment Metadata
              • Array Metadata
            • Vacuuming
              • Fragments
              • Commits
              • Fragment Metadata
              • Array Metadata
          • Advanced Reads
            • Get Fragment Info
            • Time Traveling
              • Introduction
              • Fragments
              • Array Metadata
              • Schema Evolution
          • Array Upgrade
          • Backends
            • Amazon S3
            • Azure Blob Storage
            • Google Cloud Storage
            • MinIO
            • Lustre
          • Virtual Filesystem
          • User-Defined Functions
          • Distributed Compute
          • Result Estimation
          • Incomplete Queries
        • Management
          • Array Schema
          • Groups
          • Object Management
        • Performance
          • Summary of Factors
          • Dense vs. Sparse
          • Dimensions vs. Attributes
          • Compression
          • Tiling and Data Layout
          • Tuning Writes
          • Tuning Reads
      • API Reference
    • Tables
      • Introduction
      • Quickstart
      • Foundation
        • Data Model
        • Key Concepts
          • Indexes
          • Columnar Storage
          • Compression
          • Data Manipulation
          • Optimize Tables
          • ACID
          • Serverless SQL
          • SQL Connectors
          • Dataframes
          • CSV Ingestion
      • Tutorials
        • Basics
          • Ingestion with SQL
          • CSV Ingestion
          • Basic S3 Example
          • Running Locally
        • Advanced
          • Scalable Ingestion
          • Scalable Queries
      • API Reference
    • AI & ML
      • Vector Search
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Vector Search
            • Vector Databases
            • Algorithms
            • Distance Metrics
            • Updates
            • Deployment Methods
            • Architecture
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Ingestion & Querying
            • Updates
            • Deletions
            • Basic S3 Example
            • Running Locally
          • Advanced
            • Versioning
            • Time Traveling
            • Consolidation
            • Distributed Compute
            • RAG LLM
            • LLM Memory
            • File Search
            • Image Search
            • Protein Search
          • Performance
        • API Reference
      • ML Models
        • Introduction
        • Quickstart
        • Foundation
          • Basics
          • Storage
          • Cloud Execution
          • Why TileDB for Machine Learning
        • Tutorials
          • Ingestion
            • Data Ingestion
              • Dense Datasets
              • Sparse Datasets
            • ML Model Ingestion
          • Management
            • Array Schema
            • Machine Learning: Groups
            • Time Traveling
    • Life Sciences
      • Single-cell
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Data Structures
            • Use of Apache Arrow
            • Join IDs
            • State Management
            • TileDB Cloud URIs
          • SOMA API Specification
        • Tutorials
          • Data Ingestion
          • Bulk Ingestion Tutorial
          • Data Access
          • Distributed Compute
          • Basic S3 Example
          • Multi-Experiment Queries
          • Appending Data to a SOMA Experiment
          • Add New Measurements
          • SQL Queries
          • Running Locally
          • Shapes in TileDB-SOMA
          • Drug Discovery App
        • Spatial
          • Introduction
          • Foundation
            • Spatial Data Model
            • Data Structures
          • Tutorials
            • Spatial Data Ingestion
            • Access Spatial Data
            • Manage Coordinate Spaces
        • API Reference
      • Population Genomics
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • The N+1 Problem
            • Architecture
            • Arrays
            • Ingestion
            • Reads
            • Variant Statistics
            • Annotations
            • User-Defined Functions
            • Tables and SQL
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Basic Ingestion
            • Basic Queries
            • Export to VCF
            • Add New Samples
            • Deleting Samples
            • Basic S3 Example
            • Basic TileDB Cloud
          • Advanced
            • Scalable Ingestion
            • Scalable Queries
            • Query Transforms
            • Handling Large Queries
            • Annotations
              • Finding Annotations
              • Embedded Annotations
              • External Annotations
              • Annotation VCFs
              • Ingesting Annotations
            • Variant Statistics
            • Tables and SQL
            • User-Defined Functions
            • Sample Metadata
            • Split VCF
          • Performance
        • API Reference
          • Command Line Interface
          • Python API
          • Cloud API
      • Biomedical Imaging
        • Introduction
        • Foundation
          • Data Model
          • Key Concepts
            • Arrays
            • Ingestion
            • Reads
            • User Defined Functions
          • Storage Format Spec
        • Quickstart
        • Tutorials
          • Basics
            • Ingestion
            • Read
              • OpenSlide
              • TileDB-Py
          • Advanced
            • Batched Ingestion
            • Chunked Ingestion
            • Machine Learning
              • PyTorch
            • Napari
    • Files
  • API Reference
  • Self-Hosting
    • Installation
    • Upgrades
    • Administrative Tasks
    • Image Customization
      • Customize User-Defined Function Images
      • AWS ECR Container Registry
      • Customize Jupyter Notebook Images
    • Single Sign-On
      • Configure Single Sign-On
      • OpenID Connect
      • Okta SCIM
      • Microsoft Entra
  • Glossary

On this page

  • Setup
  • Ingestion
  • Reading
  1. Structure
  2. Life Sciences
  3. Population Genomics
  4. Tutorials
  5. Basics
  6. Basic Queries

Basic Queries

life sciences
genomics (vcf)
tutorials
queries
Run basic read queries on a TileDB-VCF dataset.
How to run this tutorial

You can run this tutorial in two ways:

  1. Locally on your machine.
  2. On TileDB Cloud.

However, since TileDB Cloud has a free tier, we strongly recommend that you sign up and run everything there, as that requires no installations or deployment.

This tutorial demonstrates how to ingest a small TileDB-VCF dataset and run some simple queries against it.

Setup

First, import the necessary libraries, set the TileDB-VCF dataset URI (i.e., its path, which in this tutorial will be on local storage), and delete any previously created datasets with the same name.

  • Python
import os.path
import shutil

import tiledb
import tiledbvcf

# Print library versions
print("TileDB core version: {}".format(tiledb.libtiledb.version()))
print("TileDB-Py version: {}".format(tiledb.version()))
print("TileDB-VCF version: {}".format(tiledbvcf.version))

# Set VCF dataset URI
vcf_uri = os.path.expanduser("~/basic_queries")

# Clean up VCF dataset if it already exists
if os.path.exists(vcf_uri):
    shutil.rmtree(vcf_uri)
TileDB core version: (2, 24, 2)
TileDB-Py version: (0, 30, 2)
TileDB-VCF version: 0.33.2

Ingestion

Specify the samples to be ingested, which are readily available on a TileDB-owned public S3 bucket.

  • Python
vcf_bucket = "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen"
samples_to_ingest = [
    "HG00096_chr21.gvcf.gz",
    "HG00097_chr21.gvcf.gz",
    "HG00099_chr21.gvcf.gz",
    "HG00100_chr21.gvcf.gz",
    "HG00101_chr21.gvcf.gz",
]
sample_uris = [f"{vcf_bucket}/{s}" for s in samples_to_ingest]
sample_uris
['s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00096_chr21.gvcf.gz',
 's3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00097_chr21.gvcf.gz',
 's3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00099_chr21.gvcf.gz',
 's3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00100_chr21.gvcf.gz',
 's3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00101_chr21.gvcf.gz']

Next, create a TileDB-VCF dataset and ingest the samples in it.

  • Python
# Open a VCF dataset in write mode
ds = tiledbvcf.Dataset(uri=vcf_uri, mode="w")

# Create empty VCF dataset
ds.create_dataset()

# Ingest samples
ds.ingest_samples(sample_uris=sample_uris)

Reading

Whenever you want to read from a dataset, you need to open it in read mode.

  • Python
# Open the Dataset in read mode
ds = tiledbvcf.Dataset(uri=vcf_uri, mode="r")

To get the ingested samples, run the following:

  • Python
samples = ds.samples()
print(f"Number of samples = {len(samples)}")
Number of samples = 5

You can see which VCF fields (i.e., attributes) exist in the dataset as follows.

  • Python
ds.attributes()
['alleles',
 'contig',
 'filters',
 'fmt',
 'fmt_AD',
 'fmt_AF',
 'fmt_DP',
 'fmt_F1R2',
 'fmt_F2R1',
 'fmt_GP',
 'fmt_GQ',
 'fmt_GT',
 'fmt_ICNT',
 'fmt_MB',
 'fmt_MIN_DP',
 'fmt_PL',
 'fmt_PRI',
 'fmt_PS',
 'fmt_SB',
 'fmt_SPL',
 'fmt_SQ',
 'id',
 'info',
 'info_DB',
 'info_DP',
 'info_END',
 'info_FS',
 'info_FractionInformativeReads',
 'info_LOD',
 'info_MQ',
 'info_MQRankSum',
 'info_QD',
 'info_R2_5P_bias',
 'info_ReadPosRankSum',
 'info_SOR',
 'pos_end',
 'pos_start',
 'qual',
 'query_bed_end',
 'query_bed_line',
 'query_bed_start',
 'sample_name']

You can read from the dataset specifying a subset of attributes, a subset of samples a specific genomic range.

  • Python
# Read a chromosome region, and subset on samples and attributes
df = ds.read(
    regions=["chr21:8220186-8405573"],
    samples=["HG00096", "HG00097"],
    attrs=["sample_name", "contig", "pos_start", "pos_end", "alleles", "fmt_GT"],
)
df
sample_name contig pos_start pos_end alleles fmt_GT
0 HG00096 chr21 8220186 8220206 [TCTCCCTCCCTCCCTCCCTCC, T, TCTCC, TCTCCCTCC, T... [0, 1]
1 HG00097 chr21 8220186 8220194 [TCTCCCTCC, T, TCTCC, CCTCCCTCC, <NON_REF>] [1, 2]
2 HG00096 chr21 8220187 8220208 [C, <NON_REF>] [-1, -1]
3 HG00097 chr21 8220187 8220198 [C, <NON_REF>] [-1, -1]
4 HG00097 chr21 8220199 8220199 [C, <NON_REF>] [0, 0]
... ... ... ... ... ... ...
7337 HG00097 chr21 8405412 8405523 [T, <NON_REF>] [0, 0]
7338 HG00096 chr21 8405524 8405572 [C, <NON_REF>] [0, 0]
7339 HG00097 chr21 8405524 8405572 [C, <NON_REF>] [0, 0]
7340 HG00096 chr21 8405573 8405579 [ATGTGTG, ATGTG, A, ATG, ATGTGTGTG, <NON_REF>] [0, 1]
7341 HG00097 chr21 8405573 8405579 [ATGTGTG, ATG, ATGTG, A, ATGTGTGTGTGTG, ATGTAT... [0, 1]

7342 rows × 6 columns

Clean up the created TileDB-VCF dataset.

  • Python
# Clean up VCF dataset if it already exists
if os.path.exists(vcf_uri):
    shutil.rmtree(vcf_uri)
Basic Ingestion
Export to VCF