Use User-Defined Functions with Array Data

arrays

tutorials

python

user-defined functions

User-defined functions give you the ability to run code inside the secure infrastructure of TileDB.

How to run this tutorial

You can run this tutorial only on TileDB Cloud. However, TileDB Cloud has a free tier. We strongly recommend that you sign up and run everything there, as that requires no installations or deployment.

This tutorial shows how to use array user-defined functions (UDFs) on TileDB Cloud. It assumes you have already completed the Catalog: UDFs section.

TileDB Cloud supports two types of array UDFs:

Single-array UDFs: These are applied to a single array.
Multi-array UDFs: These are applied to an arbitrary number of arrays.

TileDB supports a third type of UDF, called generic UDFs, which is arbitrary code that doesn’t apply to any array specifically (unless the user makes array calls inside the generic UDF). The benefit of array UDFs over generic UDFs is that TileDB Cloud does not charge for array egress with array UDFs for the arrays specified as inputs to the UDFs (whereas if you make an array call inside a generic UDF, TileDB will charge you for egress). For more information on array UDFs, visit the Key Concepts: User-Defined Functions section.

Single-array UDFs

First, import the necessary libraries, set the array and UDF URIs for TileDB Cloud, and delete any previously created arrays and UDFs with the same name. Some things to note:

You need to generate a REST API token on TileDB Cloud to authenticate yourself.
You need to set the S3 bucket, for which you have already given access to TileDB Cloud by providing your AWS credentials.
TileDB Cloud stores registered UDFs physically on S3, in the bucket and path you provided in your profile settings.
TileDB Cloud models registered UDFs as arrays as well.

Python

# Import necessary libraries
import os.path

import numpy as np
import tiledb
import tiledb.cloud

# You should set the appropriate environment variables with your keys.
# Get the keys from the environment variables.
tiledb_token = os.environ["TILEDB_REST_TOKEN"]
tiledb_account = os.environ["TILEDB_ACCOUNT"]

# Get the bucket and region from environment variables
s3_bucket = os.environ["S3_BUCKET"]

# Loging with your TileDB Cloud token
tiledb.cloud.login(token=tiledb_token)

# Set array URI
array_name = "single_array_udf"
array_uri = "tiledb://" + tiledb_account + "/" + array_name

# Set the UDF URI
udf_name = "median_single_array_py"
account_udf_name = tiledb_account + "/" + udf_name
udf_uri = "tiledb://" + account_udf_name

# The following context will carry the TileDB Cloud credentials
cloud_ctx = tiledb.cloud.Ctx()

# Delete array and UDF, if they already exist
with tiledb.scope_ctx(cloud_ctx):
    # Delete the UDF, noting that TileDB Cloud stores UDFs as arrays as well
    if tiledb.array_exists(udf_uri):
        tiledb.Array.delete_array(udf_uri)

    # Delete the array
    if tiledb.array_exists(array_uri):
        tiledb.Array.delete_array(array_uri)

Next, create a dense array by specifying its schema (the case of sparse arrays is similar). The only difference between TileDB Cloud and TileDB Open-source when creating and registering arrays is that the TileDB Cloud URI should be of the form: tiledb://<account>/s3://<bucket>/<array_name>. TileDB Cloud understands that you are trying to create an array in S3 URI s3://<bucket>/<array_name> and register it under <account>. After you create and register the array, you can access the array as tiledb://<account>/<array_name>.

Python

# Create the two dimensions
d1 = tiledb.Dim(name="d1", domain=(1, 4), tile=2, dtype=np.int32)
d2 = tiledb.Dim(name="d2", domain=(1, 4), tile=2, dtype=np.int32)

# Create a domain using the two dimensions
dom = tiledb.Domain(d1, d2)

# Create an attribute
a = tiledb.Attr(name="a", dtype=np.int32)

# Create the array schema, setting `sparse=False` to indicate a dense array.
sch = tiledb.ArraySchema(domain=dom, sparse=False, attrs=[a])

# Create and register the array on TileDB Cloud
array_uri_reg = "tiledb://" + tiledb_account + "/" + s3_bucket + "/" + array_name
tiledb.Array.create(array_uri_reg, sch, ctx=cloud_ctx)

Populate the array by using a 2D NumPy array. Observe that the array URI now uses the form tiledb://<account>/<array_name>.

Python

# Prepare some data in a NumPy array
data = np.array(
    [[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12], [13, 14, 15, 16]], dtype=np.int32
)

# Write data to the array
with tiledb.open(array_uri, "w", ctx=cloud_ctx) as A:
    A[:] = data

Create a UDF that takes as input an array slice and computes the median value on attribute a.

Python

# Define the UDF
# The input is the array slice results as an OrderedDict
def median(data):
    import numpy

    return numpy.median(data["a"])

Apply the UDF to the TileDB Cloud array.

Python

# The "apply" function takes as input the function, an array slice
# and any attribute subset, and passes to the function the result of
# that TileDB query, i.e., A.query(attrs=["a"])[1:2, 1:2]
with tiledb.open(array_uri, ctx=cloud_ctx) as A:
    results = A.apply(median, [(1, 2), (1, 2)], attrs=["a"])
    print(results)

3.5

You can register the UDF with TileDB Cloud, which will allow you to browse it as part of your Assets in the TileDB Cloud UI, as well as call it using a TileDB Cloud name (in the form of tiledb://<account_name>/<udf_name>).

Python

# Register the UDF
tiledb.cloud.udf.register_single_array_udf(
    median, name=udf_name, namespace=tiledb_account
)

Once you register your UDF, you can apply it as follows.

Python

# Call a registered UDF
tiledb.cloud.array.apply(array_uri, account_udf_name, [(1, 2), (1, 2)], attrs=["a"])

3.5

Clean up in the end by deleting the array and UDF. Observe that the standard TileDB object management functions work directly with tiledb:// URIs (that is, TileDB Cloud arrays). Also note that a UDF is modeled by TileDB as an array and, thus, you can delete it similar to arrays.

Python

# Delete array and UDF, if they already exist
with tiledb.scope_ctx(cloud_ctx):
    # Delete the UDF, noting that TileDB Cloud stores UDFs as arrays as well
    if tiledb.array_exists(udf_uri):
        tiledb.Array.delete_array(udf_uri)

    # Delete the array
    if tiledb.array_exists(array_uri):
        tiledb.Array.delete_array(array_uri)

Multi-array UDFs

First, import the necessary libraries, set the array and UDF URIs for TileDB Cloud, and delete any previously created arrays and UDFs with the same name.

Python

# Import necessary libraries
import os.path

import numpy as np
import tiledb
import tiledb.cloud

# You should set the appropriate environment variables with your keys.
# Get the keys from the environment variables.
tiledb_token = os.environ["TILEDB_REST_TOKEN"]
tiledb_account = os.environ["TILEDB_ACCOUNT"]

# Get the bucket and region from environment variables
s3_bucket = os.environ["S3_BUCKET"]

# Loging with your TileDB Cloud token
tiledb.cloud.login(token=tiledb_token)

# Set array URIs
array_name_1 = "multi_array_udf_py_1"
array_name_2 = "multi_array_udf_py_2"
array_uri_1 = "tiledb://" + tiledb_account + "/" + array_name_1
array_uri_2 = "tiledb://" + tiledb_account + "/" + array_name_2

# Set the UDF URI
udf_name = "addition_single_array_py"
account_udf_name = tiledb_account + "/" + udf_name
udf_uri = "tiledb://" + account_udf_name

# The following context will carry the TileDB Cloud credentials
cloud_ctx = tiledb.cloud.Ctx()

# Delete array and UDF, if they already exist
with tiledb.scope_ctx(cloud_ctx):
    # Delete the UDF, noting that TileDB Cloud stores UDFs as arrays as well
    if tiledb.array_exists(udf_uri):
        tiledb.Array.delete_array(udf_uri)

    # Delete the arrays
    if tiledb.array_exists(array_uri_1):
        tiledb.Array.delete_array(array_uri_1)
    if tiledb.array_exists(array_uri_2):
        tiledb.Array.delete_array(array_uri_2)

You need to create two arrays this time. The following creates two dense arrays with an identical schema (noting that any other array could be used here).

Python

# Create the two dimensions
d1 = tiledb.Dim(name="d1", domain=(1, 4), tile=2, dtype=np.int32)
d2 = tiledb.Dim(name="d2", domain=(1, 4), tile=2, dtype=np.int32)

# Create a domain using the two dimensions
dom = tiledb.Domain(d1, d2)

# Create an attribute
a = tiledb.Attr(name="a", dtype=np.int32)

# Create the array schema, setting `sparse=False` to indicate a dense array.
sch = tiledb.ArraySchema(domain=dom, sparse=False, attrs=[a])

# Create and register the arrays on TileDB Cloud
array_uri_reg_1 = "tiledb://" + tiledb_account + "/" + s3_bucket + "/" + array_name_1
array_uri_reg_2 = "tiledb://" + tiledb_account + "/" + s3_bucket + "/" + array_name_2
tiledb.Array.create(array_uri_reg_1, sch, ctx=cloud_ctx)
tiledb.Array.create(array_uri_reg_2, sch, ctx=cloud_ctx)

Populate the arrays.

Python

# Prepare some data in a NumPy array
data_1 = np.array(
    [[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12], [13, 14, 15, 16]], dtype=np.int32
)
data_2 = np.array(
    [
        [100, 200, 300, 400],
        [500, 600, 700, 800],
        [900, 1000, 1100, 1200],
        [1300, 1400, 1500, 1600],
    ],
    dtype=np.int32,
)

# Write data to the arrays
with tiledb.open(array_uri_1, "w", ctx=cloud_ctx) as A:
    A[:] = data_1
with tiledb.open(array_uri_2, "w", ctx=cloud_ctx) as A:
    A[:] = data_2

Create a UDF that takes as input two arrays, and adds them on attribute a.

Python

def addition(data):
    # When you have multiple arrays, the parameter
    # we pass in is a list of ordered dictionaries.
    # The list is in the order of the arrays you asked for.

    return data[0]["a"] + data[1]["a"]

Apply the multi-array UDF to the TileDB Cloud array.

Python

# The following will create the list of arrays to take part
# in the multi-array UDF. Each has as input the array name,
# a multi-index for slicing and a list of attributes to subselect on.
array_list = tiledb.cloud.array.ArrayList()
array_list.add(array_uri_1, [(1, 4), (1, 4)], ["a"])
array_list.add(array_uri_2, [(1, 4), (1, 4)], ["a"])

# This will execute `median` using as input the result of the
# slicing and attribute subselection for each of the arrays
# in `array_list`
result = tiledb.cloud.array.exec_multi_array_udf(addition, array_list)

print(result)

[[ 101  202  303  404]
 [ 505  606  707  808]
 [ 909 1010 1111 1212]
 [1313 1414 1515 1616]]

Python

# Register the UDF
tiledb.cloud.udf.register_multi_array_udf(
    addition, name=udf_name, namespace=tiledb_account
)

Once you register your UDF, you can apply it as follows.

Python

# Call a registered UDF
tiledb.cloud.array.exec_multi_array_udf(account_udf_name, array_list)

array([[ 101,  202,  303,  404],
       [ 505,  606,  707,  808],
       [ 909, 1010, 1111, 1212],
       [1313, 1414, 1515, 1616]], dtype=int32)

Clean up in the end by deleting the arrays and UDF.

Python

# Delete array and UDF, if they already exist
with tiledb.scope_ctx(cloud_ctx):
    # Delete the UDF, noting that TileDB Cloud stores UDFs as arrays as well
    if tiledb.array_exists(udf_uri):
        tiledb.Array.delete_array(udf_uri)

    # Delete the arrays
    if tiledb.array_exists(array_uri_1):
        tiledb.Array.delete_array(array_uri_1)
    if tiledb.array_exists(array_uri_2):
        tiledb.Array.delete_array(array_uri_2)