User-defined functions give you the ability to run code inside the secure infrastructure of TileDB Cloud.
How to run this tutorial
You can run this tutorial only on TileDB Cloud. However, TileDB Cloud has a free tier. We strongly recommend that you sign up and run everything there, as that requires no installations or deployment.
This tutorial shows how to use array user-defined functions (UDFs) on TileDB Cloud. It assumes you have already completed the Catalog: UDFs section.
TileDB Cloud supports two types of array UDFs:
Single-array UDFs: These are applied to a single array.
Multi-array UDFs: These are applied to an arbitrary number of arrays.
TileDB supports a third type of UDF, called generic UDFs, which is arbitrary code that doesn’t apply to any array specifically (unless the user makes array calls inside the generic UDF). The benefit of array UDFs over generic UDFs is that TileDB Cloud does not charge for array egress with array UDFs for the arrays specified as inputs to the UDFs (whereas if you make an array call inside a generic UDF, TileDB will charge you for egress). For more information on array UDFs, visit the Key Concepts: User-Defined Functions section.
Single-array UDFs
First, import the necessary libraries, set the array and UDF URIs for TileDB Cloud, and delete any previously created arrays and UDFs with the same name. Some things to note:
You need to generate a REST API token on TileDB Cloud to authenticate yourself.
You need to set the S3 bucket, for which you have already given access to TileDB Cloud by providing your AWS credentials.
TileDB Cloud stores registered UDFs physically on S3, in the bucket and path you provided in your profile settings.
TileDB Cloud models registered UDFs as arrays as well.
# Import necessary librariesimport tiledbimport tiledb.cloudimport numpy as npimport os.path# You should set the appropriate environment variables with your keys.# Get the keys from the environment variables.tiledb_token = os.environ["TILEDB_REST_TOKEN"]tiledb_account = os.environ["TILEDB_ACCOUNT"]# Get the bucket and region from environment variabless3_bucket = os.environ["S3_BUCKET"]# Loging with your TileDB Cloud tokentiledb.cloud.login(token=tiledb_token)# Set array URIarray_name ="single_array_udf"array_uri ="tiledb://"+ tiledb_account +"/"+ array_name# Set the UDF URIudf_name ="median_single_array_py"account_udf_name = tiledb_account +"/"+ udf_nameudf_uri ="tiledb://"+ account_udf_name# The following context will carry the TileDB Cloud credentialscloud_ctx = tiledb.cloud.Ctx()# Delete array and UDF, if they already existwith tiledb.scope_ctx(cloud_ctx):# Delete the UDF, noting that TileDB Cloud stores UDFs as arrays as wellif tiledb.array_exists(udf_uri): tiledb.Array.delete_array(udf_uri)# Delete the arrayif tiledb.array_exists(array_uri): tiledb.Array.delete_array(array_uri)
Next, create a dense array by specifying its schema (the case of sparse arrays is similar). The only difference between TileDB Cloud and TileDB Open-source when creating and registering arrays is that the TileDB Cloud URI should be of the form: tiledb://<account>/s3://<bucket>/<array_name>. TileDB Cloud understands that you are trying to create an array in S3 URI s3://<bucket>/<array_name> and register it under <account>. After you create and register the array, you can access the array as tiledb://<account>/<array_name>.
# Create the two dimensionsd1 = tiledb.Dim(name="d1", domain=(1, 4), tile=2, dtype=np.int32)d2 = tiledb.Dim(name="d2", domain=(1, 4), tile=2, dtype=np.int32)# Create a domain using the two dimensionsdom = tiledb.Domain(d1, d2)# Create an attributea = tiledb.Attr(name="a", dtype=np.int32)# Create the array schema, setting `sparse=False` to indicate a dense array.sch = tiledb.ArraySchema(domain=dom, sparse=False, attrs=[a])# Create and register the array on TileDB Cloudarray_uri_reg ="tiledb://"+ tiledb_account +"/"+ s3_bucket +"/"+ array_nametiledb.Array.create(array_uri_reg, sch, ctx=cloud_ctx)
Populate the array by using a 2D NumPy array. Observe that the array URI now uses the form tiledb://<account>/<array_name>.
# Prepare some data in a NumPy arraydata = np.array( [[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12], [13, 14, 15, 16]], dtype=np.int32)# Write data to the arraywith tiledb.open(array_uri, "w", ctx=cloud_ctx) as A: A[:] = data
Create a UDF that takes as input an array slice and computes the median value on attribute a.
# The "apply" function takes as input the function, an array slice# and any attribute subset, and passes to the function the result of# that TileDB query, i.e., A.query(attrs=["a"])[1:2, 1:2]with tiledb.open(array_uri, ctx=cloud_ctx) as A: results = A.apply(median, [(1, 2), (1, 2)], attrs=["a"])print(results)
3.5
You can register the UDF with TileDB Cloud, which will allow you to browse it as part of your Assets in the TileDB Cloud UI, as well as call it using a TileDB Cloud name (in the form of tiledb://<account_name>/<udf_name>).
Clean up in the end by deleting the array and UDF. Observe that the standard TileDB object management functions work directly with tiledb:// URIs (that is, TileDB Cloud arrays). Also note that a UDF is modeled by TileDB as an array and, thus, you can delete it similar to arrays.
# Delete array and UDF, if they already existwith tiledb.scope_ctx(cloud_ctx):# Delete the UDF, noting that TileDB Cloud stores UDFs as arrays as wellif tiledb.array_exists(udf_uri): tiledb.Array.delete_array(udf_uri)# Delete the arrayif tiledb.array_exists(array_uri): tiledb.Array.delete_array(array_uri)
Multi-array UDFs
First, import the necessary libraries, set the array and UDF URIs for TileDB Cloud, and delete any previously created arrays and UDFs with the same name.
# Import necessary librariesimport tiledbimport tiledb.cloudimport numpy as npimport os.path# You should set the appropriate environment variables with your keys.# Get the keys from the environment variables.tiledb_token = os.environ["TILEDB_REST_TOKEN"]tiledb_account = os.environ["TILEDB_ACCOUNT"]# Get the bucket and region from environment variabless3_bucket = os.environ["S3_BUCKET"]# Loging with your TileDB Cloud tokentiledb.cloud.login(token=tiledb_token)# Set array URIsarray_name_1 ="multi_array_udf_py_1"array_name_2 ="multi_array_udf_py_2"array_uri_1 ="tiledb://"+ tiledb_account +"/"+ array_name_1array_uri_2 ="tiledb://"+ tiledb_account +"/"+ array_name_2# Set the UDF URIudf_name ="addition_single_array_py"account_udf_name = tiledb_account +"/"+ udf_nameudf_uri ="tiledb://"+ account_udf_name# The following context will carry the TileDB Cloud credentialscloud_ctx = tiledb.cloud.Ctx()# Delete array and UDF, if they already existwith tiledb.scope_ctx(cloud_ctx):# Delete the UDF, noting that TileDB Cloud stores UDFs as arrays as wellif tiledb.array_exists(udf_uri): tiledb.Array.delete_array(udf_uri)# Delete the arraysif tiledb.array_exists(array_uri_1): tiledb.Array.delete_array(array_uri_1)if tiledb.array_exists(array_uri_2): tiledb.Array.delete_array(array_uri_2)
You need to create two arrays this time. The following creates two dense arrays with an identical schema (noting that any other array could be used here).
def addition(data):# When you have multiple arrays, the parameter# we pass in is a list of ordered dictionaries.# The list is in the order of the arrays you asked for.return data[0]["a"] + data[1]["a"]
Apply the multi-array UDF to the TileDB Cloud array.
# The following will create the list of arrays to take part# in the multi-array UDF. Each has as input the array name,# a multi-index for slicing and a list of attributes to subselect on.array_list = tiledb.cloud.array.ArrayList()array_list.add(array_uri_1, [(1, 4), (1, 4)], ["a"])array_list.add(array_uri_2, [(1, 4), (1, 4)], ["a"])# This will execute `median` using as input the result of the# slicing and attribute subselection for each of the arrays# in `array_list`result = tiledb.cloud.array.exec_multi_array_udf(addition, array_list)print(result)
You can register the UDF with TileDB Cloud, which will allow you to browse it as part of your Assets in the TileDB Cloud UI, as well as call it using a TileDB Cloud name (in the form of tiledb://<account_name>/<udf_name>).
# Delete array and UDF, if they already existwith tiledb.scope_ctx(cloud_ctx):# Delete the UDF, noting that TileDB Cloud stores UDFs as arrays as wellif tiledb.array_exists(udf_uri): tiledb.Array.delete_array(udf_uri)# Delete the arraysif tiledb.array_exists(array_uri_1): tiledb.Array.delete_array(array_uri_1)if tiledb.array_exists(array_uri_2): tiledb.Array.delete_array(array_uri_2)