Learn how to use user-defined functions to perform flexible analysis on TileDB Cloud.
How to run this tutorial
You can run this tutorial only on TileDB Cloud. However, TileDB Cloud has a free tier. We strongly recommend that you sign up and run everything there, as that requires no installations or deployment.
User-defined functions (UDFs) allow users to run custom code on TileDB Cloud. Together with task graphs, these provide a means of dispatching parallel workloads across distributed workers on TileDB Cloud.
In population genomics, TileDB Cloud UDFs often access information from TileDB-VCF datasets and associated annotation. Some important points about UDFs:
You can create task graphs using UDFs and define the dependencies among them.
UDFs can be written in Python, R, or JavaScript. However, the TileDB-VCF API is limited to Python.
UDFs usually return results in Apache Arrow, pandas, or JSON.
R code can interact with Python-based UDFs using the Arrow/Feather format.
UDFs can be ad-hoc or registered. The latter enables easy code reusability.
The code in registered UDFs is visible within the TileDB Cloud UI console.
Registered UDFs can be shared with others in the TileDB Cloud UI console or programmatically.
Import the necessary libraries, and set the URIs that will be used in this tutorial. If you are running this from a local notebook, visit the Tutorials: Basic TileDB Cloud for more information on how to set your TileDB Cloud credentials in a configuration object (this step can be omitted inside a TileDB Cloud notebook).
import osimport tiledbimport tiledb.cloudimport tiledb.cloud.vcfimport tiledbvcf# Get your credentialstiledb_token = os.environ["TILEDB_REST_TOKEN"]# or use your username and password (not recommended)# tiledb_username = os.environ["TILEDB_USERNAME"]# tiledb_password = os.environ["TILEDB_PASSWORD"]# Public URIs dataset to be used in this tutorialgnomad_uri ="tiledb://TileDB-Inc/gnomad-4_0-include-nopass"variant_stats_uri ="tiledb://TileDB-Inc/6e6f9723-16f4-42eb-9ead-5d2bc6fba7cb"# Log into TileDB Cloudtiledb.cloud.login(token=tiledb_token)# or use your username and password (not recommended)# tiledb.cloud.login(username=tiledb_username, password=tiledb_password)
The following Python method queries the gnomAD and variant_stats arrays to return global allele frequency and internal allele frequency, respectively. Note how the import statements live inside the method.
# Devise a name for the UDFuser_profile = tiledb.cloud.user_profile()username = user_profile.usernameudf_name ="allelestats"udf_full_path =f"{username}/{udf_name}"# Register the UDF to TileDB Cloudtiledb.cloud.udf.register_generic_udf( func=allelestats, name=udf_name, namespace=user_profile.username, include_source_lines=True,)
You can find your registered UDF under Assets -> Code -> UDFs.
Run the registered UDF, this time on a SNP known to affect hair color (chr16:89820111C>T). Even though the UDF is written in Python, both Python and R can read the result.