# Import necessary libraries
import tiledb
import tiledb.cloud
from tiledb.cloud.compute import Delayed, DelayedArrayUDF
import numpy as np
import os.path
# You should set the appropriate environment variables with your keys.
# Get the keys from the environment variables.
= os.environ["TILEDB_REST_TOKEN"]
tiledb_token = os.environ["TILEDB_ACCOUNT"]
tiledb_account
# Get the bucket and region from environment variables
= os.environ["S3_BUCKET"]
s3_bucket
# Loging with your TileDB Cloud token
=tiledb_token)
tiledb.cloud.login(token
# Set array URI
= "distributed_computing"
array_name = "tiledb://" + tiledb_account + "/" + array_name
array_uri
# The following context will carry the TileDB Cloud credentials
= tiledb.cloud.Ctx()
cloud_ctx
# Delete array, if it already exists
with tiledb.scope_ctx(cloud_ctx):
if tiledb.array_exists(array_uri):
tiledb.Array.delete_array(array_uri)
Distributed Compute
You can run this tutorial only on TileDB Cloud. However, TileDB Cloud has a free tier. We strongly recommend that you sign up and run everything there, as that requires no installations or deployment.
This tutorial shows distributed compute on TileDB Cloud using task graphs. It assumes you have already completed the Get Started section. You will learn how to ingest data and perform array computations, all at scale in a distributed (i.e., parallel across multiple machines) fashion. For more information on distributed compute, visit the Key Concepts: Distributed Compute section.
Distributed array ingestion
In this section, you will learn how to write to an array in parallel with a task graph containing four write nodes, each writing to a different partition of the array.
First, import the necessary libraries, set the array URI for TileDB Cloud, and delete any previously created arrays with the same name.
Next, create a dense array by specifying its schema (the case of sparse arrays is similar). The only difference between TileDB Cloud and TileDB Open-source when creating and registering arrays is that the TileDB Cloud URI should be of the form: tiledb://<account>/s3://<bucket>/<array_name>
. TileDB Cloud understands that you are trying to create an array in S3 URI s3://<bucket>/<array_name>
and register it under <account>
. After you create and register the array, you can access the array as tiledb://<account>/<array_name>
.
Subsequently, you need a function that takes as input an array URI, a slice, and some data, and writes the data to the corresponding array slice. This will be called by each of the task graph nodes with different slice and data arguments to perform the entire array ingestion in parallel.
You are now ready to create and run the task graph that will perform the parallel ingestion. There will be four nodes that perform the writes in parallel, all connected to a sink node that will run an empty function and complete the graph execution.
Read the entire array to verify that the parallel ingestion was successful.
Distributed array processing
This section continues the same code example as above to perform parallel computations over the array. Specifically, you will compute the sum over all the elements of the array you ingested. You will again use a task graph with 4 nodes, each reading a different array slice and computing a partial sum. These four nodes will be connected to a sink node that will receive the four partial results, and add them together to produce the final sum over the entire array.
First, you need to create a function that will act as an array user-defined function (UDF); visit the Tutorials: User-Defined Functions for dedicated examples on array UDFs.
Next, create the task graph. Notice that this time the task graph consists of array UDFs, as opposed to parallel ingestion that used generic UDFs in each graph node.
Clean up by removing the created array.