1. Structure
  2. Life Sciences
  3. Population Genomics
  4. Tutorials
  5. Advanced
  6. User-Defined Functions
  • Home
  • What is TileDB?
  • Get Started
  • Explore Content
  • Accounts
    • Individual Accounts
      • Apply for the Free Tier
      • Profile
        • Overview
        • Cloud Credentials
        • Storage Paths
        • REST API Tokens
        • Credits
    • Organization Admins
      • Create an Organization
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
    • Organization Members
      • Organization Invitations
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
  • Catalog
    • Introduction
    • Data
      • Arrays
      • Tables
      • Single-Cell (SOMA)
      • Genomics (VCF)
      • Biomedical Imaging
      • Vector Search
      • Files
    • Code
      • Notebooks
      • Dashboards
      • User-Defined Functions
      • Task Graphs
      • ML Models
    • Groups
    • Marketplace
    • Search
  • Collaborate
    • Introduction
    • Organizations
    • Access Control
      • Introduction
      • Share Assets
      • Asset Permissions
      • Public Assets
    • Logging
    • Marketplace
  • Analyze
    • Introduction
    • Slice Data
    • Multi-Region Redirection
    • Notebooks
      • Launch a Notebook
      • Usage
      • Widgets
      • Notebook Image Dependencies
    • Dashboards
      • Dashboards
      • Streamlit
    • Preview
    • User-Defined Functions
    • Task Graphs
    • Serverless SQL
    • Monitor
      • Task Log
      • Task Graph Log
  • Scale
    • Introduction
    • Task Graphs
    • API Usage
  • Structure
    • Why Structure Is Important
    • Arrays
      • Introduction
      • Quickstart
      • Foundation
        • Array Data Model
        • Key Concepts
          • Storage
            • Arrays
            • Dimensions
            • Attributes
            • Cells
            • Domain
            • Tiles
            • Data Layout
            • Compression
            • Encryption
            • Tile Filters
            • Array Schema
            • Schema Evolution
            • Fragments
            • Fragment Metadata
            • Commits
            • Indexing
            • Array Metadata
            • Datetimes
            • Groups
            • Object Stores
          • Compute
            • Writes
            • Deletions
            • Consolidation
            • Vacuuming
            • Time Traveling
            • Reads
            • Query Conditions
            • Aggregates
            • User-Defined Functions
            • Distributed Compute
            • Concurrency
            • Parallelism
        • Storage Format Spec
      • Tutorials
        • Basics
          • Basic Dense Array
          • Basic Sparse Array
          • Array Metadata
          • Compression
          • Encryption
          • Data Layout
          • Tile Filters
          • Datetimes
          • Multiple Attributes
          • Variable-Length Attributes
          • String Dimensions
          • Nullable Attributes
          • Multi-Range Reads
          • Query Conditions
          • Aggregates
          • Deletions
          • Catching Errors
          • Configuration
          • Basic S3 Example
          • Basic TileDB Cloud
          • fromDataFrame
          • Palmer Penguins
        • Advanced
          • Schema Evolution
          • Advanced Writes
            • Write at a Timestamp
            • Get Fragment Info
            • Consolidation
              • Fragments
              • Fragment List
              • Consolidation Plan
              • Commits
              • Fragment Metadata
              • Array Metadata
            • Vacuuming
              • Fragments
              • Commits
              • Fragment Metadata
              • Array Metadata
          • Advanced Reads
            • Get Fragment Info
            • Time Traveling
              • Introduction
              • Fragments
              • Array Metadata
              • Schema Evolution
          • Array Upgrade
          • Backends
            • Amazon S3
            • Azure Blob Storage
            • Google Cloud Storage
            • MinIO
            • Lustre
          • Virtual Filesystem
          • User-Defined Functions
          • Distributed Compute
          • Result Estimation
          • Incomplete Queries
        • Management
          • Array Schema
          • Groups
          • Object Management
        • Performance
          • Summary of Factors
          • Dense vs. Sparse
          • Dimensions vs. Attributes
          • Compression
          • Tiling and Data Layout
          • Tuning Writes
          • Tuning Reads
      • API Reference
    • Tables
      • Introduction
      • Quickstart
      • Foundation
        • Data Model
        • Key Concepts
          • Indexes
          • Columnar Storage
          • Compression
          • Data Manipulation
          • Optimize Tables
          • ACID
          • Serverless SQL
          • SQL Connectors
          • Dataframes
          • CSV Ingestion
      • Tutorials
        • Basics
          • Ingestion with SQL
          • CSV Ingestion
          • Basic S3 Example
          • Running Locally
        • Advanced
          • Scalable Ingestion
          • Scalable Queries
      • API Reference
    • AI & ML
      • Vector Search
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Vector Search
            • Vector Databases
            • Algorithms
            • Distance Metrics
            • Updates
            • Deployment Methods
            • Architecture
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Ingestion & Querying
            • Updates
            • Deletions
            • Basic S3 Example
            • Running Locally
          • Advanced
            • Versioning
            • Time Traveling
            • Consolidation
            • Distributed Compute
            • RAG LLM
            • LLM Memory
            • File Search
            • Image Search
            • Protein Search
          • Performance
        • API Reference
      • ML Models
        • Introduction
        • Quickstart
        • Foundation
          • Basics
          • Storage
          • Cloud Execution
          • Why TileDB for Machine Learning
        • Tutorials
          • Ingestion
            • Data Ingestion
              • Dense Datasets
              • Sparse Datasets
            • ML Model Ingestion
          • Management
            • Array Schema
            • Machine Learning: Groups
            • Time Traveling
    • Life Sciences
      • Single-cell
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Data Structures
            • Use of Apache Arrow
            • Join IDs
            • State Management
            • TileDB Cloud URIs
          • SOMA API Specification
        • Tutorials
          • Data Ingestion
          • Bulk Ingestion Tutorial
          • Data Access
          • Distributed Compute
          • Basic S3 Example
          • Multi-Experiment Queries
          • Appending Data to a SOMA Experiment
          • Add New Measurements
          • SQL Queries
          • Running Locally
          • Shapes in TileDB-SOMA
          • Drug Discovery App
        • Spatial
          • Introduction
          • Foundation
            • Spatial Data Model
            • Data Structures
          • Tutorials
            • Spatial Data Ingestion
            • Access Spatial Data
            • Manage Coordinate Spaces
        • API Reference
      • Population Genomics
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • The N+1 Problem
            • Architecture
            • Arrays
            • Ingestion
            • Reads
            • Variant Statistics
            • Annotations
            • User-Defined Functions
            • Tables and SQL
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Basic Ingestion
            • Basic Queries
            • Export to VCF
            • Add New Samples
            • Deleting Samples
            • Basic S3 Example
            • Basic TileDB Cloud
          • Advanced
            • Scalable Ingestion
            • Scalable Queries
            • Query Transforms
            • Handling Large Queries
            • Annotations
              • Finding Annotations
              • Embedded Annotations
              • External Annotations
              • Annotation VCFs
              • Ingesting Annotations
            • Variant Statistics
            • Tables and SQL
            • User-Defined Functions
            • Sample Metadata
            • Split VCF
          • Performance
        • API Reference
          • Command Line Interface
          • Python API
          • Cloud API
      • Biomedical Imaging
        • Introduction
        • Foundation
          • Data Model
          • Key Concepts
            • Arrays
            • Ingestion
            • Reads
            • User Defined Functions
          • Storage Format Spec
        • Quickstart
        • Tutorials
          • Basics
            • Ingestion
            • Read
              • OpenSlide
              • TileDB-Py
          • Advanced
            • Batched Ingestion
            • Chunked Ingestion
            • Machine Learning
              • PyTorch
            • Napari
    • Files
  • API Reference
  • Self-Hosting
    • Installation
    • Upgrades
    • Administrative Tasks
    • Image Customization
      • Customize User-Defined Function Images
      • AWS ECR Container Registry
      • Customize Jupyter Notebook Images
    • Single Sign-On
      • Configure Single Sign-On
      • OpenID Connect
      • Okta SCIM
      • Microsoft Entra
  • Glossary
  1. Structure
  2. Life Sciences
  3. Population Genomics
  4. Tutorials
  5. Advanced
  6. User-Defined Functions

Run User-Defined Functions with VCF Datasets

life sciences
genomics (vcf)
tutorials
r
python
user-defined functions
Learn how to use user-defined functions to perform flexible analysis on TileDB.
How to run this tutorial

You can run this tutorial only on TileDB Cloud. However, TileDB Cloud has a free tier. We strongly recommend that you sign up and run everything there, as that requires no installations or deployment.

User-defined functions (UDFs) allow users to run custom code on TileDB Cloud. Together with task graphs, these provide a means of dispatching parallel workloads across distributed workers on TileDB Cloud.

In population genomics, TileDB Cloud UDFs often access information from TileDB-VCF datasets and associated annotation. Some important points about UDFs:

  • You can create task graphs using UDFs and define the dependencies among them.
  • UDFs can be written in Python, R, or JavaScript. However, the TileDB-VCF API is limited to Python.
  • UDFs usually return results in Apache Arrow, pandas, or JSON.
  • R code can interact with Python-based UDFs using the Arrow/Feather format.
  • UDFs can be ad-hoc or registered. The latter enables easy code reusability.
  • The code in registered UDFs is visible within the TileDB Cloud UI console.
  • Registered UDFs can be shared with others in the TileDB Cloud UI console or programmatically.
  • Registered UDFs can remember stateful variables.
Tip

Before you dive into this section, you should go through the Catalog: User-Defined Functions and Key Concepts: User-Defined Functions sections.

Import the necessary libraries, and set the URIs that will be used in this tutorial. If you are running this from a local notebook, visit the Tutorials: Basic TileDB Cloud for more information on how to set your TileDB Cloud credentials in a configuration object (this step can be omitted inside a TileDB Cloud notebook).

import os

import tiledb
import tiledb.cloud
import tiledb.cloud.vcf
import tiledbvcf

# Get your credentials
tiledb_token = os.environ["TILEDB_REST_TOKEN"]
# or use your username and password (not recommended)
# tiledb_username = os.environ["TILEDB_USERNAME"]
# tiledb_password = os.environ["TILEDB_PASSWORD"]

# Public URIs dataset to be used in this tutorial
gnomad_uri = "tiledb://TileDB-Inc/gnomad-4_0-include-nopass"
variant_stats_uri = "tiledb://TileDB-Inc/6e6f9723-16f4-42eb-9ead-5d2bc6fba7cb"

# Log into TileDB Cloud
tiledb.cloud.login(token=tiledb_token)
# or use your username and password (not recommended)
# tiledb.cloud.login(username=tiledb_username, password=tiledb_password)

The following Python method queries the gnomAD and variant_stats arrays to return global allele frequency and internal allele frequency, respectively. Note how the import statements live inside the method.

  • Python
def allelestats(
    allele: str,
    gnomad_uri: str,
    variant_stats_uri: str,
    result_format="arrow",
    verbose=True,
) -> str:
    """
    return aggregate info about an allele from gnomAD and variant stats
    """
    import pandas
    import pyarrow
    import tiledb

    # parse an allele string like chr1_12345_A_T into its components
    (chr, pos, query_ref, query_alt) = allele.replace("-", "_").split("_")
    if not chr.startswith("chr"):
        chr = "chr" + chr

    region = [f"{chr}:{pos}-{pos}"]
    contigs = [chr]
    slices = [slice(int(pos) - 1, int(pos) - 1)]

    gnomad_ds = tiledbvcf.Dataset(gnomad_uri, tiledb_config=tiledb.cloud.Config())
    gnomad_df = gnomad_ds.read(
        attrs=["contig", "info_AF", "pos_start", "alleles"], regions=region
    )

    def calc_af(df):
        df = df.groupby(["contig", "pos", "allele"], sort=True).sum()
        an = df.groupby(["contig", "pos"], sort=True).ac.sum().rename("an")
        df = df.join(an, how="inner")
        df["af"] = df.ac / df.an
        return df

    # fetch internal allele frequency stats
    vstat = tiledb.open(variant_stats_uri, ctx=tiledb.cloud.Ctx())
    vsdf = vstat.query(attrs=["ac", "allele", "n_hom"], dims=["contig", "pos"]).df[
        contigs, slices
    ]
    vsdf = calc_af(vsdf).reset_index()
    vsdf["pos"] = vsdf["pos"] + 1
    vsdf = vsdf.rename(columns={"af": "iaf"})

    # fetch gnomad allele frequency stats
    if gnomad_df.empty:
        # artifically fill this in as all 0 gnomad allele frequencies
        gnomad_df = vsdf[["contig", "pos", "allele"]].copy()
        gnomad_df.loc[:, "gnomad_af"] = 0
        gnomad_df = gnomad_df.rename(columns={"allele": "alleles"})
    else:
        gnomad_df = gnomad_df.rename(columns={"pos_start": "pos"})
        gnomad_df["gnomad_af"] = gnomad_df["info_AF"].apply(
            lambda x: x[0]
        )  # remove brackets
    resultList = []

    iaf = vsdf[vsdf["allele"] == ",".join([query_ref, query_alt])]
    gaf = gnomad_df[gnomad_df["alleles"].apply(lambda x: query_alt in x)]

    # join internal and global allele frequency stats
    af = pandas.merge(iaf, gaf, on=["contig", "pos"], how="left")[
        ["contig", "pos", "allele", "iaf", "gnomad_af"]
    ]
    resultList += [af]

    result = resultList[0].reset_index()
    if result_format == "arrow":
        return pyarrow.Table.from_pandas(result)
    if result_format == "json":
        return result.to_json(orient="table", index=False)
    return result

Run this method locally on a single nucleotide polymorphism (SNP) known to affect eye color (chr15:28120472A>G).

  • Python
as_res_local_arrow = allelestats(
    allele="chr15_28120472_A_G",
    gnomad_uri=gnomad_uri,
    variant_stats_uri=variant_stats_uri,
)
as_res_local_arrow.to_pandas()
index contig pos allele iaf gnomad_af
0 0 chr15 28120472 A,G 0.690361 0.486506

Run this same method with the same parameters as an ad-hoc (unregistered) UDF on TileDB Cloud.

  • Python
as_res_udf_arrow = tiledb.cloud.udf.exec(
    func=allelestats,
    allele="chr15_28120472_A_G",
    gnomad_uri=gnomad_uri,
    variant_stats_uri=variant_stats_uri,
    result_format="arrow",
)
as_res_udf_arrow.to_pandas()
index contig pos allele iaf gnomad_af
0 0 chr15 28120472 A,G 0.690361 0.486506

Register the UDF on TileDB Cloud. This makes it available for later use and shareable with others.

  • Python
# Devise a name for the UDF
user_profile = tiledb.cloud.user_profile()
username = user_profile.username
udf_name = "allelestats"
udf_full_path = f"{username}/{udf_name}"

# Register the UDF to TileDB Cloud
tiledb.cloud.udf.register_generic_udf(
    func=allelestats,
    name=udf_name,
    namespace=user_profile.username,
    include_source_lines=True,
)

You can find your registered UDF under Assets -> Code -> UDFs.

A screenshot of the Preview tab of a registered UDF in TileDB Cloud, showing how you can view the code of a UDF. A screenshot of the Preview tab of a registered UDF in TileDB Cloud, showing how you can view the code of a UDF.

Run the registered UDF, this time on a SNP known to affect hair color (chr16:89820111C>T). Even though the UDF is written in Python, both Python and R can read the result.

  • Python
  • R
reg_res = tiledb.cloud.udf.exec(
    f"{username}/{udf_name}",
    allele="chr16_89820111_C_T",
    gnomad_uri=gnomad_uri,
    variant_stats_uri=variant_stats_uri,
    resource_class="large",
)
reg_res.to_pandas()
index contig pos allele iaf gnomad_af
0 0 chr16 89820111 C,T 0.555976 0.168875
library(tiledbcloud)

udf_name <- "allelestats"
username <- tiledbcloud::user_profile()$username
udf_full_path <- paste(username, udf_name, sep = "/")

args <- list()
args$allele <- "chr16_89820111_C_T"
args$gnomad_uri <- "tiledb://TileDB-Inc/gnomad-4_0-include-nopass"
args$variant_stats_uri <-
  "tiledb://TileDB-Inc/6e6f9723-16f4-42eb-9ead-5d2bc6fba7cb"

res <- tiledbcloud::execute_generic_udf(
  registered_udf_name = udf_full_path,
  args = args,
  namespace = username,
  resource_class = "large",
  result_format = "arrow",
  args_format = "native",
  language = "python"
)
data.frame(res)
A data.frame: 1 x 6
index contig pos allele iaf gnomad_af
<int> <chr> <int> <chr> <dbl> <dbl>
0 chr16 89820111 C,T 0.5559758 0.168875
Tables and SQL
Sample Metadata