1. Catalog
  2. Data
  3. Genomics (VCF)
  • Home
  • What is TileDB?
  • Get Started
  • Explore Content
  • Accounts
    • Individual Accounts
      • Apply for the Free Tier
      • Profile
        • Overview
        • Cloud Credentials
        • Storage Paths
        • REST API Tokens
        • Credits
    • Organization Admins
      • Create an Organization
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
    • Organization Members
      • Organization Invitations
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
  • Catalog
    • Introduction
    • Data
      • Arrays
      • Tables
      • Single-Cell (SOMA)
      • Genomics (VCF)
      • Biomedical Imaging
      • Vector Search
      • Files
    • Code
      • Notebooks
      • Dashboards
      • User-Defined Functions
      • Task Graphs
      • ML Models
    • Groups
    • Marketplace
    • Search
  • Collaborate
    • Introduction
    • Organizations
    • Access Control
      • Introduction
      • Share Assets
      • Asset Permissions
      • Public Assets
    • Logging
    • Marketplace
  • Analyze
    • Introduction
    • Slice Data
    • Multi-Region Redirection
    • Notebooks
      • Launch a Notebook
      • Usage
      • Widgets
      • Notebook Image Dependencies
    • Dashboards
      • Dashboards
      • Streamlit
    • Preview
    • User-Defined Functions
    • Task Graphs
    • Serverless SQL
    • Monitor
      • Task Log
      • Task Graph Log
  • Scale
    • Introduction
    • Task Graphs
    • API Usage
  • Structure
    • Why Structure Is Important
    • Arrays
      • Introduction
      • Quickstart
      • Foundation
        • Array Data Model
        • Key Concepts
          • Storage
            • Arrays
            • Dimensions
            • Attributes
            • Cells
            • Domain
            • Tiles
            • Data Layout
            • Compression
            • Encryption
            • Tile Filters
            • Array Schema
            • Schema Evolution
            • Fragments
            • Fragment Metadata
            • Commits
            • Indexing
            • Array Metadata
            • Datetimes
            • Groups
            • Object Stores
          • Compute
            • Writes
            • Deletions
            • Consolidation
            • Vacuuming
            • Time Traveling
            • Reads
            • Query Conditions
            • Aggregates
            • User-Defined Functions
            • Distributed Compute
            • Concurrency
            • Parallelism
        • Storage Format Spec
      • Tutorials
        • Basics
          • Basic Dense Array
          • Basic Sparse Array
          • Array Metadata
          • Compression
          • Encryption
          • Data Layout
          • Tile Filters
          • Datetimes
          • Multiple Attributes
          • Variable-Length Attributes
          • String Dimensions
          • Nullable Attributes
          • Multi-Range Reads
          • Query Conditions
          • Aggregates
          • Deletions
          • Catching Errors
          • Configuration
          • Basic S3 Example
          • Basic TileDB Cloud
          • fromDataFrame
          • Palmer Penguins
        • Advanced
          • Schema Evolution
          • Advanced Writes
            • Write at a Timestamp
            • Get Fragment Info
            • Consolidation
              • Fragments
              • Fragment List
              • Consolidation Plan
              • Commits
              • Fragment Metadata
              • Array Metadata
            • Vacuuming
              • Fragments
              • Commits
              • Fragment Metadata
              • Array Metadata
          • Advanced Reads
            • Get Fragment Info
            • Time Traveling
              • Introduction
              • Fragments
              • Array Metadata
              • Schema Evolution
          • Array Upgrade
          • Backends
            • Amazon S3
            • Azure Blob Storage
            • Google Cloud Storage
            • MinIO
            • Lustre
          • Virtual Filesystem
          • User-Defined Functions
          • Distributed Compute
          • Result Estimation
          • Incomplete Queries
        • Management
          • Array Schema
          • Groups
          • Object Management
        • Performance
          • Summary of Factors
          • Dense vs. Sparse
          • Dimensions vs. Attributes
          • Compression
          • Tiling and Data Layout
          • Tuning Writes
          • Tuning Reads
      • API Reference
    • Tables
      • Introduction
      • Quickstart
      • Foundation
        • Data Model
        • Key Concepts
          • Indexes
          • Columnar Storage
          • Compression
          • Data Manipulation
          • Optimize Tables
          • ACID
          • Serverless SQL
          • SQL Connectors
          • Dataframes
          • CSV Ingestion
      • Tutorials
        • Basics
          • Ingestion with SQL
          • CSV Ingestion
          • Basic S3 Example
          • Running Locally
        • Advanced
          • Scalable Ingestion
          • Scalable Queries
      • API Reference
    • AI & ML
      • Vector Search
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Vector Search
            • Vector Databases
            • Algorithms
            • Distance Metrics
            • Updates
            • Deployment Methods
            • Architecture
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Ingestion & Querying
            • Updates
            • Deletions
            • Basic S3 Example
            • Running Locally
          • Advanced
            • Versioning
            • Time Traveling
            • Consolidation
            • Distributed Compute
            • RAG LLM
            • LLM Memory
            • File Search
            • Image Search
            • Protein Search
          • Performance
        • API Reference
      • ML Models
        • Introduction
        • Quickstart
        • Foundation
          • Basics
          • Storage
          • Cloud Execution
          • Why TileDB for Machine Learning
        • Tutorials
          • Ingestion
            • Data Ingestion
              • Dense Datasets
              • Sparse Datasets
            • ML Model Ingestion
          • Management
            • Array Schema
            • Machine Learning: Groups
            • Time Traveling
    • Life Sciences
      • Single-cell
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Data Structures
            • Use of Apache Arrow
            • Join IDs
            • State Management
            • TileDB Cloud URIs
          • SOMA API Specification
        • Tutorials
          • Data Ingestion
          • Bulk Ingestion Tutorial
          • Data Access
          • Distributed Compute
          • Basic S3 Example
          • Multi-Experiment Queries
          • Appending Data to a SOMA Experiment
          • Add New Measurements
          • SQL Queries
          • Running Locally
          • Shapes in TileDB-SOMA
          • Drug Discovery App
        • Spatial
          • Introduction
          • Foundation
            • Spatial Data Model
            • Data Structures
          • Tutorials
            • Spatial Data Ingestion
            • Access Spatial Data
            • Manage Coordinate Spaces
        • API Reference
      • Population Genomics
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • The N+1 Problem
            • Architecture
            • Arrays
            • Ingestion
            • Reads
            • Variant Statistics
            • Annotations
            • User-Defined Functions
            • Tables and SQL
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Basic Ingestion
            • Basic Queries
            • Export to VCF
            • Add New Samples
            • Deleting Samples
            • Basic S3 Example
            • Basic TileDB Cloud
          • Advanced
            • Scalable Ingestion
            • Scalable Queries
            • Query Transforms
            • Handling Large Queries
            • Annotations
              • Finding Annotations
              • Embedded Annotations
              • External Annotations
              • Annotation VCFs
              • Ingesting Annotations
            • Variant Statistics
            • Tables and SQL
            • User-Defined Functions
            • Sample Metadata
            • Split VCF
          • Performance
        • API Reference
          • Command Line Interface
          • Python API
          • Cloud API
      • Biomedical Imaging
        • Introduction
        • Foundation
          • Data Model
          • Key Concepts
            • Arrays
            • Ingestion
            • Reads
            • User Defined Functions
          • Storage Format Spec
        • Quickstart
        • Tutorials
          • Basics
            • Ingestion
            • Read
              • OpenSlide
              • TileDB-Py
          • Advanced
            • Batched Ingestion
            • Chunked Ingestion
            • Machine Learning
              • PyTorch
            • Napari
    • Files
  • API Reference
  • Self-Hosting
    • Installation
    • Upgrades
    • Administrative Tasks
    • Image Customization
      • Customize User-Defined Function Images
      • AWS ECR Container Registry
      • Customize Jupyter Notebook Images
    • Single Sign-On
      • Configure Single Sign-On
      • OpenID Connect
      • Okta SCIM
      • Microsoft Entra
  • Glossary

On this page

  • Add VCF
    • Create and register in one step
    • Create first, register later
  • Overview
  • Contents
  • Metadata
  • Sharing & activity
  • Settings
  • Rename VCF
  • Delete VCF
  1. Catalog
  2. Data
  3. Genomics (VCF)

Genomics (VCF)

genomics (vcf)
catalog
How to manage VCF data in TileDB.
TileDB is ideal for modeling and efficiently processing genomic variants

TileDB architects its entire data engine around the multi-dimensional array, a powerful data structure that shape-shifts to adapt and structure any modality, no matter how complex. VCF datasets are able to take advantage of the multi-dimensional array format and are made up of 3-dimensional sparse arrays. You can learn more about VCF datasets and population genomics at Structure: Population Genomics.

Add VCF

TileDB Cloud provides a method to ingest a batch of single sample VCF files into a VCF dataset and add the dataset to the catalog, all in one step. The source VCF files are read from cloud object store and written into TileDB arrays defined by TileDB-VCF. You can learn about the mechanics, theoretical background, and detailed tutorials in section Structure: Population Genomics.

You can add a VCF dataset to the TileDB catalog in one of two ways:

  • Create and register in one step: TileDB offers you a means of one-click ingestion in the TileDB UI.
  • Create first, register after: Alternatively, you may first physically create a VCF dataset in your object store, without necessitating cataloging it with TileDB as well. At any point in time, you can register any existing VCF files with TileDB using the UI or an API command.

Create and register in one step

To ingest a batch of VCF files into a VCF dataset, perform the following steps:

  1. Navigate to the Assets section.

  2. Select Add asset.

  3. Select Data as the type of asset to add.

    A modal for selecting which type of asset to add. A modal for selecting which type of asset to add.

  4. Expand the Life sciences category and select VCF.

    A modal for selecting the type of data to add. A modal for selecting the type of data to add.

  5. Select Ingest VCF dataset.

    A modal for selecting the input method for adding the VCF dataset. A modal for selecting the input method for adding the VCF dataset.

  6. Add a VCF name of your choice, which will be the VCF dataset asset name on TileDB Cloud.

  7. Select the Cloud credentials used to access cloud storage.

  8. Add a Source path, which is the cloud storage URI that will be recursively searched.

  9. Add a Matching pattern, which is a glob pattern used to select the VCF files being ingested.

    A modal for configuring the scalable VCF ingestion job. A modal for configuring the scalable VCF ingestion job.

  10. Select Ingest.

The ingestion task graphs can be viewed in the Logs - Task graphs monitor.

Once the ingestion completes, you can view the VCF dataset in Assets.

For steps on how to programmatically create and register a VCF dataset in TileDB, review the Population Genomics: Basic TileDB Cloud tutorial.

Create first, register later

Assuming you have already created a VCF dataset and stored it in S3, perform the following steps to register it in the UI:

  1. Navigate to the Assets section.

  2. Select Add asset.

  3. Select Data as the type of asset to add.

    A modal for selecting which type of asset to add. A modal for selecting which type of asset to add.

  4. Expand the Life sciences category and select VCF.

    A modal for selecting the type of data to add. A modal for selecting the type of data to add.

  5. Select Register VCF group.

    A modal for selecting the input method for adding the VCF dataset. A modal for selecting the input method for adding the VCF dataset.

  6. Choose the appropriate Cloud credentials to access the VCF.

  7. Specify the URI where the VCF dataset lives in the Register from… field. This is the full URI including the VCF dataset folder.

  8. Specify a meaningful VCF Name.

  9. Optionally specify a License for your VCF dataset. This is especially important if you make your VCF dataset public.

  10. Optionally specify Tags for your VCF dataset.

  11. Select Register.

    A modal for registering an existing VCF dataset in TileDB. A modal for registering an existing VCF dataset in TileDB.

Overview

In this screen, you can find basic information about the VCF dataset:

  • VCF name - This appears at the very top of the screen, and consists of the account name and the name you provided to the VCF dataset when you registered it.
  • TileDB URI - The unique resource identifier for TileDB, based on which you can call the VCF dataset when coding. It comprises the namespace identifier and the UUID of the VCF dataset.
  • Original URI - The location on cloud storage where the VCF dataset is physically stored.
  • UUID - The unique identifier for the VCF dataset.
  • Total number of assets - The total number of assets, which you can preview in the Contents tab.
  • Author - The author of the asset.
  • Permissions - What rights the current user has on this VCF dataset. Possible values are READ and ADMIN.
  • Region - The region in which the VCF dataset is stored on cloud storage.
  • Tags - Any tags attached to the VCF dataset for searchability purposes.
  • Description - If the user has provided a description to the VCF dataset (programmatically or in Settings), it is visible here. The description is indexed and searchable in the catalog. Therefore, it’s recommended to add a meaningful description for all your assets.
Referring to the VCF dataset programmatically

It is important to understand how to refer to your VCF dataset programmatically. You can do it in two ways:

  1. Using the TileDB URI format tiledb://<account>/<vcf_name>. This is the most user-friendly way, but TileDB allows duplicated VCF dataset names, and if you have a VCF dataset with a non-unique name, this will throw an error.
  2. Using the TileDB URI from the asset’s Overview tab (that is, the URI with format tiledb://<account>/<UUID>). TileDB URIs referencing the asset’s UUID are unique. Thus, this method will always work.

A screenshot of the 'Overview' tab of a VCF dataset in TileDB. A screenshot of the 'Overview' tab of a VCF dataset in TileDB.

You can programmatically get overview information about the VCF dataset with the following command:

  • Python
# The following will return a JSON file with various info about the VCF dataset.
tiledb.cloud.asset.info("tiledb://<account>/<vcf_name>")

Contents

In the Contents tab of a VCF dataset, you can see all of the assets that comprise the VCF dataset, along with their sizes, your permissions on those assets, and when the assets were last modified. You can also select one of the assets to view details about the asset in the catalog.

A screenshot of the 'Contents' tab of a VCF dataset in TileDB. A screenshot of the 'Contents' tab of a VCF dataset in TileDB.

For information about the purpose of each asset in a VCF dataset, visit Population Genomics: Data Model.

Metadata

VCF datasets may be associated with metadata in the form of key-value pairs, which is visible in the Metadata tab.

A screenshot of the 'Metadata' tab of a VCF dataset in TileDB. A screenshot of the 'Metadata' tab of a VCF dataset in TileDB.

Sharing & activity

The Sharing screen allows you to securely share your VCF dataset with other TileDB users, whereas the Activity screen shows you the various accesses performed on the VCF by you or any other user with whom you have shared your VCF dataset. They are both covered in detail in the Collaborate section.

Settings

In the VCF dataset settings, you can:

  • Add a description - Note that this is indexed and, thus, searchable in the TileDB catalog.
  • Mark the VCF as read-only - This is useful if you want to prevent any changes to the VCF dataset by you or someone with whom you shared the VCF dataset.
  • Make public - If you wish to share the VCF dataset with all TileDB users. This will appear in the Marketplace tab in the left navigation menu. If you make a VCF dataset public, you can easily change it back to private in the same manner.
  • Change cloud credentials - Credentials should be provided so that TileDB can securely access the VCF on the cloud store where it is physically stored.
  • Rename VCF - Take caution when renaming VCF datasets, as any URLs including the previous VCF dataset name will no longer work.
  • Delete VCF - Visit Delete VCF for more information.

The Settings tab of a VCF dataset in TileDB. The Settings tab of a VCF dataset in TileDB.

You can programmatically update some VCF dataset settings with the following command:

  • Python
tiledb.cloud.asset.update_info(
    uri="tiledb://<account>/<vcf_name>",
    description=None,  # Optional - A new description
    name=None,  # Optional - A new name for the VCF dataset
    tags=None,  # Optional - VCF dataset tags that will be searchable in the catalog
    credentials_name=None,  # Optional - The cloud credentials that access the VCF dataset (should already exist in your account settings)
)

To make a VCF dataset public programmatically, run the following:

  • Python
tiledb.cloud.asset.share(
    "tiledb://<account>/<vcf_name>", namespace="public", permissions="read"
)

Rename VCF

A useful property of the TileDB catalog and the way it registers VCF datasets is that you can easily rename a VCF dataset, without physically moving it, thus avoiding the very expensive copying operations entailed in object stores when physically renaming/moving file objects. You can rename VCF datasets from the Settings tab.

You can programmatically rename a VCF dataset as follows:

  • Python
tiledb.cloud.asset.update_info(
    "`tiledb://<account>/<previous_name>`", name="<new_name>"
)
Warning

Take caution when renaming VCF datasets, as any URLs including the previous VCF dataset name will no longer work.

Delete VCF

When deleting a VCF dataset, you have two options:

  • Unregister: This operation removes the VCF dataset from the TileDB catalog, but it does not physically remove it from the object store. Since the VCF dataset will persist on storage, you can register it again in the TileDB catalog in the future.
  • Delete: This operation both unregisters and physically removes the VCF dataset from storage. Note that this operation cannot be undone.

For both unregistering and deleting a VCF dataset, you have the option to apply the operation recursively. Unregistering a VCF dataset recursively involves unregistering both the VCF dataset and the arrays contained in the VCF dataset from TileDB, but neither the VCF dataset nor the arrays contained in the VCF dataset are removed from your cloud object store. Deleting a VCF dataset recursively deletes both the VCF dataset and the arrays contained within the VCF dataset. Deleting a VCF dataset is permanent and cannot be undone.

You can delete the VCF dataset from the Settings tab, which will prompt you to choose among the two operations above.

A modal displaying the options when removing a VCF dataset. A modal displaying the options when removing a VCF dataset.

You can also programmatically delete or unregister the VCF dataset as follows:

  • Python
# Unregister a VCF dataset
tiledb.cloud.asset.deregister(uri="tiledb://<account>/<vcf_name>")

# Delete a VCF dataset
tiledb.cloud.asset.delete(uri="tiledb://<account>/<vcf_name>")
Single-Cell (SOMA)
Biomedical Imaging