Run User-Defined Functions with VCF Datasets

life sciences

genomics (vcf)

tutorials

python

user-defined functions

Learn how to use user-defined functions to perform flexible analysis on TileDB.

How to run this tutorial

You can run this tutorial only on TileDB Cloud. However, TileDB Cloud has a free tier. We strongly recommend that you sign up and run everything there, as that requires no installations or deployment.

User-defined functions (UDFs) allow users to run custom code on TileDB Cloud. Together with task graphs, these provide a means of dispatching parallel workloads across distributed workers on TileDB Cloud.

In population genomics, TileDB Cloud UDFs often access information from TileDB-VCF datasets and associated annotation. Some important points about UDFs:

You can create task graphs using UDFs and define the dependencies among them.
UDFs can be written in Python, R, or JavaScript. However, the TileDB-VCF API is limited to Python.
UDFs usually return results in Apache Arrow, pandas, or JSON.
R code can interact with Python-based UDFs using the Arrow/Feather format.
UDFs can be ad-hoc or registered. The latter enables easy code reusability.
The code in registered UDFs is visible within the TileDB Cloud UI console.
Registered UDFs can be shared with others in the TileDB Cloud UI console or programmatically.
Registered UDFs can remember stateful variables.

Tip

Before you dive into this section, you should go through the Catalog: User-Defined Functions and Key Concepts: User-Defined Functions sections.

Import the necessary libraries, and set the URIs that will be used in this tutorial. If you are running this from a local notebook, visit the Tutorials: Basic TileDB Cloud for more information on how to set your TileDB Cloud credentials in a configuration object (this step can be omitted inside a TileDB Cloud notebook).

import os

import tiledb
import tiledb.cloud
import tiledb.cloud.vcf
import tiledbvcf

# Get your credentials
tiledb_token = os.environ["TILEDB_REST_TOKEN"]
# or use your username and password (not recommended)
# tiledb_username = os.environ["TILEDB_USERNAME"]
# tiledb_password = os.environ["TILEDB_PASSWORD"]

# Public URIs dataset to be used in this tutorial
gnomad_uri = "tiledb://TileDB-Inc/gnomad-4_0-include-nopass"
variant_stats_uri = "tiledb://TileDB-Inc/6e6f9723-16f4-42eb-9ead-5d2bc6fba7cb"

# Log into TileDB Cloud
tiledb.cloud.login(token=tiledb_token)
# or use your username and password (not recommended)
# tiledb.cloud.login(username=tiledb_username, password=tiledb_password)

The following Python method queries the gnomAD and variant_stats arrays to return global allele frequency and internal allele frequency, respectively. Note how the import statements live inside the method.

Python

def allelestats(
    allele: str,
    gnomad_uri: str,
    variant_stats_uri: str,
    result_format="arrow",
    verbose=True,
) -> str:
    """
    return aggregate info about an allele from gnomAD and variant stats
    """
    import pandas
    import pyarrow
    import tiledb

    # parse an allele string like chr1_12345_A_T into its components
    (chr, pos, query_ref, query_alt) = allele.replace("-", "_").split("_")
    if not chr.startswith("chr"):
        chr = "chr" + chr

    region = [f"{chr}:{pos}-{pos}"]
    contigs = [chr]
    slices = [slice(int(pos) - 1, int(pos) - 1)]

    gnomad_ds = tiledbvcf.Dataset(gnomad_uri, tiledb_config=tiledb.cloud.Config())
    gnomad_df = gnomad_ds.read(
        attrs=["contig", "info_AF", "pos_start", "alleles"], regions=region
    )

    def calc_af(df):
        df = df.groupby(["contig", "pos", "allele"], sort=True).sum()
        an = df.groupby(["contig", "pos"], sort=True).ac.sum().rename("an")
        df = df.join(an, how="inner")
        df["af"] = df.ac / df.an
        return df

    # fetch internal allele frequency stats
    vstat = tiledb.open(variant_stats_uri, ctx=tiledb.cloud.Ctx())
    vsdf = vstat.query(attrs=["ac", "allele", "n_hom"], dims=["contig", "pos"]).df[
        contigs, slices
    ]
    vsdf = calc_af(vsdf).reset_index()
    vsdf["pos"] = vsdf["pos"] + 1
    vsdf = vsdf.rename(columns={"af": "iaf"})

    # fetch gnomad allele frequency stats
    if gnomad_df.empty:
        # artifically fill this in as all 0 gnomad allele frequencies
        gnomad_df = vsdf[["contig", "pos", "allele"]].copy()
        gnomad_df.loc[:, "gnomad_af"] = 0
        gnomad_df = gnomad_df.rename(columns={"allele": "alleles"})
    else:
        gnomad_df = gnomad_df.rename(columns={"pos_start": "pos"})
        gnomad_df["gnomad_af"] = gnomad_df["info_AF"].apply(
            lambda x: x[0]
        )  # remove brackets
    resultList = []

    iaf = vsdf[vsdf["allele"] == ",".join([query_ref, query_alt])]
    gaf = gnomad_df[gnomad_df["alleles"].apply(lambda x: query_alt in x)]

    # join internal and global allele frequency stats
    af = pandas.merge(iaf, gaf, on=["contig", "pos"], how="left")[
        ["contig", "pos", "allele", "iaf", "gnomad_af"]
    ]
    resultList += [af]

    result = resultList[0].reset_index()
    if result_format == "arrow":
        return pyarrow.Table.from_pandas(result)
    if result_format == "json":
        return result.to_json(orient="table", index=False)
    return result

Run this method locally on a single nucleotide polymorphism (SNP) known to affect eye color (chr15:28120472A>G).

Python

as_res_local_arrow = allelestats(
    allele="chr15_28120472_A_G",
    gnomad_uri=gnomad_uri,
    variant_stats_uri=variant_stats_uri,
)
as_res_local_arrow.to_pandas()

	index	contig	pos	allele	iaf	gnomad_af
0	0	chr15	28120472	A,G	0.690361	0.486506

Run this same method with the same parameters as an ad-hoc (unregistered) UDF on TileDB Cloud.

Python

as_res_udf_arrow = tiledb.cloud.udf.exec(
    func=allelestats,
    allele="chr15_28120472_A_G",
    gnomad_uri=gnomad_uri,
    variant_stats_uri=variant_stats_uri,
    result_format="arrow",
)
as_res_udf_arrow.to_pandas()

	index	contig	pos	allele	iaf	gnomad_af
0	0	chr15	28120472	A,G	0.690361	0.486506

Python

# Devise a name for the UDF
user_profile = tiledb.cloud.user_profile()
username = user_profile.username
udf_name = "allelestats"
udf_full_path = f"{username}/{udf_name}"

# Register the UDF to TileDB Cloud
tiledb.cloud.udf.register_generic_udf(
    func=allelestats,
    name=udf_name,
    namespace=user_profile.username,
    include_source_lines=True,
)

You can find your registered UDF under Assets -> Code -> UDFs.

Run the registered UDF, this time on a SNP known to affect hair color (chr16:89820111C>T). Even though the UDF is written in Python, both Python and R can read the result.

Python
R

reg_res = tiledb.cloud.udf.exec(
    f"{username}/{udf_name}",
    allele="chr16_89820111_C_T",
    gnomad_uri=gnomad_uri,
    variant_stats_uri=variant_stats_uri,
    resource_class="large",
)
reg_res.to_pandas()

	index	contig	pos	allele	iaf	gnomad_af
0	0	chr16	89820111	C,T	0.555976	0.168875

library(tiledbcloud)

udf_name <- "allelestats"
username <- tiledbcloud::user_profile()$username
udf_full_path <- paste(username, udf_name, sep = "/")

args <- list()
args$allele <- "chr16_89820111_C_T"
args$gnomad_uri <- "tiledb://TileDB-Inc/gnomad-4_0-include-nopass"
args$variant_stats_uri <-
  "tiledb://TileDB-Inc/6e6f9723-16f4-42eb-9ead-5d2bc6fba7cb"

res <- tiledbcloud::execute_generic_udf(
  registered_udf_name = udf_full_path,
  args = args,
  namespace = username,
  resource_class = "large",
  result_format = "arrow",
  args_format = "native",
  language = "python"
)
data.frame(res)

A data.frame: 1 x 6
index	contig	pos	allele	iaf	gnomad_af
<int>	<chr>	<int>	<chr>	<dbl>	<dbl>
0	chr16	89820111	C,T	0.5559758	0.168875