1. Structure
  2. Arrays
  3. Tutorials
  4. Advanced
  5. User-Defined Functions
  • Home
  • What is TileDB?
  • Get Started
  • Explore Content
  • Accounts
    • Individual Accounts
      • Apply for the Free Tier
      • Profile
        • Overview
        • Cloud Credentials
        • Storage Paths
        • REST API Tokens
        • Credits
    • Organization Admins
      • Create an Organization
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
    • Organization Members
      • Organization Invitations
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
  • Catalog
    • Introduction
    • Data
      • Arrays
      • Tables
      • Single-Cell (SOMA)
      • Genomics (VCF)
      • Biomedical Imaging
      • Vector Search
      • Files
    • Code
      • Notebooks
      • Dashboards
      • User-Defined Functions
      • Task Graphs
      • ML Models
    • Groups
    • Marketplace
    • Search
  • Collaborate
    • Introduction
    • Organizations
    • Access Control
      • Introduction
      • Share Assets
      • Asset Permissions
      • Public Assets
    • Logging
    • Marketplace
  • Analyze
    • Introduction
    • Slice Data
    • Multi-Region Redirection
    • Notebooks
      • Launch a Notebook
      • Usage
      • Widgets
      • Notebook Image Dependencies
    • Dashboards
      • Dashboards
      • Streamlit
    • Preview
    • User-Defined Functions
    • Task Graphs
    • Serverless SQL
    • Monitor
      • Task Log
      • Task Graph Log
  • Scale
    • Introduction
    • Task Graphs
    • API Usage
  • Structure
    • Why Structure Is Important
    • Arrays
      • Introduction
      • Quickstart
      • Foundation
        • Array Data Model
        • Key Concepts
          • Storage
            • Arrays
            • Dimensions
            • Attributes
            • Cells
            • Domain
            • Tiles
            • Data Layout
            • Compression
            • Encryption
            • Tile Filters
            • Array Schema
            • Schema Evolution
            • Fragments
            • Fragment Metadata
            • Commits
            • Indexing
            • Array Metadata
            • Datetimes
            • Groups
            • Object Stores
          • Compute
            • Writes
            • Deletions
            • Consolidation
            • Vacuuming
            • Time Traveling
            • Reads
            • Query Conditions
            • Aggregates
            • User-Defined Functions
            • Distributed Compute
            • Concurrency
            • Parallelism
        • Storage Format Spec
      • Tutorials
        • Basics
          • Basic Dense Array
          • Basic Sparse Array
          • Array Metadata
          • Compression
          • Encryption
          • Data Layout
          • Tile Filters
          • Datetimes
          • Multiple Attributes
          • Variable-Length Attributes
          • String Dimensions
          • Nullable Attributes
          • Multi-Range Reads
          • Query Conditions
          • Aggregates
          • Deletions
          • Catching Errors
          • Configuration
          • Basic S3 Example
          • Basic TileDB Cloud
          • fromDataFrame
          • Palmer Penguins
        • Advanced
          • Schema Evolution
          • Advanced Writes
            • Write at a Timestamp
            • Get Fragment Info
            • Consolidation
              • Fragments
              • Fragment List
              • Consolidation Plan
              • Commits
              • Fragment Metadata
              • Array Metadata
            • Vacuuming
              • Fragments
              • Commits
              • Fragment Metadata
              • Array Metadata
          • Advanced Reads
            • Get Fragment Info
            • Time Traveling
              • Introduction
              • Fragments
              • Array Metadata
              • Schema Evolution
          • Array Upgrade
          • Backends
            • Amazon S3
            • Azure Blob Storage
            • Google Cloud Storage
            • MinIO
            • Lustre
          • Virtual Filesystem
          • User-Defined Functions
          • Distributed Compute
          • Result Estimation
          • Incomplete Queries
        • Management
          • Array Schema
          • Groups
          • Object Management
        • Performance
          • Summary of Factors
          • Dense vs. Sparse
          • Dimensions vs. Attributes
          • Compression
          • Tiling and Data Layout
          • Tuning Writes
          • Tuning Reads
      • API Reference
    • Tables
      • Introduction
      • Quickstart
      • Foundation
        • Data Model
        • Key Concepts
          • Indexes
          • Columnar Storage
          • Compression
          • Data Manipulation
          • Optimize Tables
          • ACID
          • Serverless SQL
          • SQL Connectors
          • Dataframes
          • CSV Ingestion
      • Tutorials
        • Basics
          • Ingestion with SQL
          • CSV Ingestion
          • Basic S3 Example
          • Running Locally
        • Advanced
          • Scalable Ingestion
          • Scalable Queries
      • API Reference
    • AI & ML
      • Vector Search
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Vector Search
            • Vector Databases
            • Algorithms
            • Distance Metrics
            • Updates
            • Deployment Methods
            • Architecture
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Ingestion & Querying
            • Updates
            • Deletions
            • Basic S3 Example
            • Running Locally
          • Advanced
            • Versioning
            • Time Traveling
            • Consolidation
            • Distributed Compute
            • RAG LLM
            • LLM Memory
            • File Search
            • Image Search
            • Protein Search
          • Performance
        • API Reference
      • ML Models
        • Introduction
        • Quickstart
        • Foundation
          • Basics
          • Storage
          • Cloud Execution
          • Why TileDB for Machine Learning
        • Tutorials
          • Ingestion
            • Data Ingestion
              • Dense Datasets
              • Sparse Datasets
            • ML Model Ingestion
          • Management
            • Array Schema
            • Machine Learning: Groups
            • Time Traveling
    • Life Sciences
      • Single-cell
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Data Structures
            • Use of Apache Arrow
            • Join IDs
            • State Management
            • TileDB Cloud URIs
          • SOMA API Specification
        • Tutorials
          • Data Ingestion
          • Bulk Ingestion Tutorial
          • Data Access
          • Distributed Compute
          • Basic S3 Example
          • Multi-Experiment Queries
          • Appending Data to a SOMA Experiment
          • Add New Measurements
          • SQL Queries
          • Running Locally
          • Shapes in TileDB-SOMA
          • Drug Discovery App
        • Spatial
          • Introduction
          • Foundation
            • Spatial Data Model
            • Data Structures
          • Tutorials
            • Spatial Data Ingestion
            • Access Spatial Data
            • Manage Coordinate Spaces
        • API Reference
      • Population Genomics
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • The N+1 Problem
            • Architecture
            • Arrays
            • Ingestion
            • Reads
            • Variant Statistics
            • Annotations
            • User-Defined Functions
            • Tables and SQL
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Basic Ingestion
            • Basic Queries
            • Export to VCF
            • Add New Samples
            • Deleting Samples
            • Basic S3 Example
            • Basic TileDB Cloud
          • Advanced
            • Scalable Ingestion
            • Scalable Queries
            • Query Transforms
            • Handling Large Queries
            • Annotations
              • Finding Annotations
              • Embedded Annotations
              • External Annotations
              • Annotation VCFs
              • Ingesting Annotations
            • Variant Statistics
            • Tables and SQL
            • User-Defined Functions
            • Sample Metadata
            • Split VCF
          • Performance
        • API Reference
          • Command Line Interface
          • Python API
          • Cloud API
      • Biomedical Imaging
        • Introduction
        • Foundation
          • Data Model
          • Key Concepts
            • Arrays
            • Ingestion
            • Reads
            • User Defined Functions
          • Storage Format Spec
        • Quickstart
        • Tutorials
          • Basics
            • Ingestion
            • Read
              • OpenSlide
              • TileDB-Py
          • Advanced
            • Batched Ingestion
            • Chunked Ingestion
            • Machine Learning
              • PyTorch
            • Napari
    • Files
  • API Reference
  • Self-Hosting
    • Installation
    • Upgrades
    • Administrative Tasks
    • Image Customization
      • Customize User-Defined Function Images
      • AWS ECR Container Registry
      • Customize Jupyter Notebook Images
    • Single Sign-On
      • Configure Single Sign-On
      • OpenID Connect
      • Okta SCIM
      • Microsoft Entra
  • Glossary

On this page

  • Single-array UDFs
  • Multi-array UDFs
  1. Structure
  2. Arrays
  3. Tutorials
  4. Advanced
  5. User-Defined Functions

Use User-Defined Functions with Array Data

arrays
tutorials
python
user-defined functions
User-defined functions give you the ability to run code inside the secure infrastructure of TileDB.
How to run this tutorial

You can run this tutorial only on TileDB Cloud. However, TileDB Cloud has a free tier. We strongly recommend that you sign up and run everything there, as that requires no installations or deployment.

This tutorial shows how to use array user-defined functions (UDFs) on TileDB Cloud. It assumes you have already completed the Catalog: UDFs section.

TileDB Cloud supports two types of array UDFs:

  1. Single-array UDFs: These are applied to a single array.
  2. Multi-array UDFs: These are applied to an arbitrary number of arrays.

TileDB supports a third type of UDF, called generic UDFs, which is arbitrary code that doesn’t apply to any array specifically (unless the user makes array calls inside the generic UDF). The benefit of array UDFs over generic UDFs is that TileDB Cloud does not charge for array egress with array UDFs for the arrays specified as inputs to the UDFs (whereas if you make an array call inside a generic UDF, TileDB will charge you for egress). For more information on array UDFs, visit the Key Concepts: User-Defined Functions section.

Single-array UDFs

First, import the necessary libraries, set the array and UDF URIs for TileDB Cloud, and delete any previously created arrays and UDFs with the same name. Some things to note:

  • You need to generate a REST API token on TileDB Cloud to authenticate yourself.
  • You need to set the S3 bucket, for which you have already given access to TileDB Cloud by providing your AWS credentials.
  • TileDB Cloud stores registered UDFs physically on S3, in the bucket and path you provided in your profile settings.
  • TileDB Cloud models registered UDFs as arrays as well.
  • Python
# Import necessary libraries
import os.path

import numpy as np
import tiledb
import tiledb.cloud

# You should set the appropriate environment variables with your keys.
# Get the keys from the environment variables.
tiledb_token = os.environ["TILEDB_REST_TOKEN"]
tiledb_account = os.environ["TILEDB_ACCOUNT"]

# Get the bucket and region from environment variables
s3_bucket = os.environ["S3_BUCKET"]

# Loging with your TileDB Cloud token
tiledb.cloud.login(token=tiledb_token)

# Set array URI
array_name = "single_array_udf"
array_uri = "tiledb://" + tiledb_account + "/" + array_name

# Set the UDF URI
udf_name = "median_single_array_py"
account_udf_name = tiledb_account + "/" + udf_name
udf_uri = "tiledb://" + account_udf_name

# The following context will carry the TileDB Cloud credentials
cloud_ctx = tiledb.cloud.Ctx()

# Delete array and UDF, if they already exist
with tiledb.scope_ctx(cloud_ctx):
    # Delete the UDF, noting that TileDB Cloud stores UDFs as arrays as well
    if tiledb.array_exists(udf_uri):
        tiledb.Array.delete_array(udf_uri)

    # Delete the array
    if tiledb.array_exists(array_uri):
        tiledb.Array.delete_array(array_uri)

Next, create a dense array by specifying its schema (the case of sparse arrays is similar). The only difference between TileDB Cloud and TileDB Open-source when creating and registering arrays is that the TileDB Cloud URI should be of the form: tiledb://<account>/s3://<bucket>/<array_name>. TileDB Cloud understands that you are trying to create an array in S3 URI s3://<bucket>/<array_name> and register it under <account>. After you create and register the array, you can access the array as tiledb://<account>/<array_name>.

  • Python
# Create the two dimensions
d1 = tiledb.Dim(name="d1", domain=(1, 4), tile=2, dtype=np.int32)
d2 = tiledb.Dim(name="d2", domain=(1, 4), tile=2, dtype=np.int32)

# Create a domain using the two dimensions
dom = tiledb.Domain(d1, d2)

# Create an attribute
a = tiledb.Attr(name="a", dtype=np.int32)

# Create the array schema, setting `sparse=False` to indicate a dense array.
sch = tiledb.ArraySchema(domain=dom, sparse=False, attrs=[a])

# Create and register the array on TileDB Cloud
array_uri_reg = "tiledb://" + tiledb_account + "/" + s3_bucket + "/" + array_name
tiledb.Array.create(array_uri_reg, sch, ctx=cloud_ctx)

Populate the array by using a 2D NumPy array. Observe that the array URI now uses the form tiledb://<account>/<array_name>.

  • Python
# Prepare some data in a NumPy array
data = np.array(
    [[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12], [13, 14, 15, 16]], dtype=np.int32
)

# Write data to the array
with tiledb.open(array_uri, "w", ctx=cloud_ctx) as A:
    A[:] = data

Create a UDF that takes as input an array slice and computes the median value on attribute a.

  • Python
# Define the UDF
# The input is the array slice results as an OrderedDict
def median(data):
    import numpy

    return numpy.median(data["a"])

Apply the UDF to the TileDB Cloud array.

  • Python
# The "apply" function takes as input the function, an array slice
# and any attribute subset, and passes to the function the result of
# that TileDB query, i.e., A.query(attrs=["a"])[1:2, 1:2]
with tiledb.open(array_uri, ctx=cloud_ctx) as A:
    results = A.apply(median, [(1, 2), (1, 2)], attrs=["a"])
    print(results)
3.5

You can register the UDF with TileDB Cloud, which will allow you to browse it as part of your Assets in the TileDB Cloud UI, as well as call it using a TileDB Cloud name (in the form of tiledb://<account_name>/<udf_name>).

  • Python
# Register the UDF
tiledb.cloud.udf.register_single_array_udf(
    median, name=udf_name, namespace=tiledb_account
)

Once you register your UDF, you can apply it as follows.

  • Python
# Call a registered UDF
tiledb.cloud.array.apply(array_uri, account_udf_name, [(1, 2), (1, 2)], attrs=["a"])
3.5

Clean up in the end by deleting the array and UDF. Observe that the standard TileDB object management functions work directly with tiledb:// URIs (that is, TileDB Cloud arrays). Also note that a UDF is modeled by TileDB as an array and, thus, you can delete it similar to arrays.

  • Python
# Delete array and UDF, if they already exist
with tiledb.scope_ctx(cloud_ctx):
    # Delete the UDF, noting that TileDB Cloud stores UDFs as arrays as well
    if tiledb.array_exists(udf_uri):
        tiledb.Array.delete_array(udf_uri)

    # Delete the array
    if tiledb.array_exists(array_uri):
        tiledb.Array.delete_array(array_uri)

Multi-array UDFs

First, import the necessary libraries, set the array and UDF URIs for TileDB Cloud, and delete any previously created arrays and UDFs with the same name.

  • Python
# Import necessary libraries
import os.path

import numpy as np
import tiledb
import tiledb.cloud

# You should set the appropriate environment variables with your keys.
# Get the keys from the environment variables.
tiledb_token = os.environ["TILEDB_REST_TOKEN"]
tiledb_account = os.environ["TILEDB_ACCOUNT"]

# Get the bucket and region from environment variables
s3_bucket = os.environ["S3_BUCKET"]

# Loging with your TileDB Cloud token
tiledb.cloud.login(token=tiledb_token)

# Set array URIs
array_name_1 = "multi_array_udf_py_1"
array_name_2 = "multi_array_udf_py_2"
array_uri_1 = "tiledb://" + tiledb_account + "/" + array_name_1
array_uri_2 = "tiledb://" + tiledb_account + "/" + array_name_2

# Set the UDF URI
udf_name = "addition_single_array_py"
account_udf_name = tiledb_account + "/" + udf_name
udf_uri = "tiledb://" + account_udf_name

# The following context will carry the TileDB Cloud credentials
cloud_ctx = tiledb.cloud.Ctx()

# Delete array and UDF, if they already exist
with tiledb.scope_ctx(cloud_ctx):
    # Delete the UDF, noting that TileDB Cloud stores UDFs as arrays as well
    if tiledb.array_exists(udf_uri):
        tiledb.Array.delete_array(udf_uri)

    # Delete the arrays
    if tiledb.array_exists(array_uri_1):
        tiledb.Array.delete_array(array_uri_1)
    if tiledb.array_exists(array_uri_2):
        tiledb.Array.delete_array(array_uri_2)

You need to create two arrays this time. The following creates two dense arrays with an identical schema (noting that any other array could be used here).

  • Python
# Create the two dimensions
d1 = tiledb.Dim(name="d1", domain=(1, 4), tile=2, dtype=np.int32)
d2 = tiledb.Dim(name="d2", domain=(1, 4), tile=2, dtype=np.int32)

# Create a domain using the two dimensions
dom = tiledb.Domain(d1, d2)

# Create an attribute
a = tiledb.Attr(name="a", dtype=np.int32)

# Create the array schema, setting `sparse=False` to indicate a dense array.
sch = tiledb.ArraySchema(domain=dom, sparse=False, attrs=[a])

# Create and register the arrays on TileDB Cloud
array_uri_reg_1 = "tiledb://" + tiledb_account + "/" + s3_bucket + "/" + array_name_1
array_uri_reg_2 = "tiledb://" + tiledb_account + "/" + s3_bucket + "/" + array_name_2
tiledb.Array.create(array_uri_reg_1, sch, ctx=cloud_ctx)
tiledb.Array.create(array_uri_reg_2, sch, ctx=cloud_ctx)

Populate the arrays.

  • Python
# Prepare some data in a NumPy array
data_1 = np.array(
    [[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12], [13, 14, 15, 16]], dtype=np.int32
)
data_2 = np.array(
    [
        [100, 200, 300, 400],
        [500, 600, 700, 800],
        [900, 1000, 1100, 1200],
        [1300, 1400, 1500, 1600],
    ],
    dtype=np.int32,
)

# Write data to the arrays
with tiledb.open(array_uri_1, "w", ctx=cloud_ctx) as A:
    A[:] = data_1
with tiledb.open(array_uri_2, "w", ctx=cloud_ctx) as A:
    A[:] = data_2

Create a UDF that takes as input two arrays, and adds them on attribute a.

  • Python
def addition(data):
    # When you have multiple arrays, the parameter
    # we pass in is a list of ordered dictionaries.
    # The list is in the order of the arrays you asked for.

    return data[0]["a"] + data[1]["a"]

Apply the multi-array UDF to the TileDB Cloud array.

  • Python
# The following will create the list of arrays to take part
# in the multi-array UDF. Each has as input the array name,
# a multi-index for slicing and a list of attributes to subselect on.
array_list = tiledb.cloud.array.ArrayList()
array_list.add(array_uri_1, [(1, 4), (1, 4)], ["a"])
array_list.add(array_uri_2, [(1, 4), (1, 4)], ["a"])

# This will execute `median` using as input the result of the
# slicing and attribute subselection for each of the arrays
# in `array_list`
result = tiledb.cloud.array.exec_multi_array_udf(addition, array_list)

print(result)
[[ 101  202  303  404]
 [ 505  606  707  808]
 [ 909 1010 1111 1212]
 [1313 1414 1515 1616]]

You can register the UDF with TileDB Cloud, which will allow you to browse it as part of your Assets in the TileDB Cloud UI, as well as call it using a TileDB Cloud name (in the form of tiledb://<account_name>/<udf_name>).

  • Python
# Register the UDF
tiledb.cloud.udf.register_multi_array_udf(
    addition, name=udf_name, namespace=tiledb_account
)

Once you register your UDF, you can apply it as follows.

  • Python
# Call a registered UDF
tiledb.cloud.array.exec_multi_array_udf(account_udf_name, array_list)
array([[ 101,  202,  303,  404],
       [ 505,  606,  707,  808],
       [ 909, 1010, 1111, 1212],
       [1313, 1414, 1515, 1616]], dtype=int32)

Clean up in the end by deleting the arrays and UDF.

  • Python
# Delete array and UDF, if they already exist
with tiledb.scope_ctx(cloud_ctx):
    # Delete the UDF, noting that TileDB Cloud stores UDFs as arrays as well
    if tiledb.array_exists(udf_uri):
        tiledb.Array.delete_array(udf_uri)

    # Delete the arrays
    if tiledb.array_exists(array_uri_1):
        tiledb.Array.delete_array(array_uri_1)
    if tiledb.array_exists(array_uri_2):
        tiledb.Array.delete_array(array_uri_2)
Virtual Filesystem
Distributed Compute