Tables: Scalable Queries

tutorials

tables

queries

Demonstration distributed queries on TileDB Cloud.

Note

TileDB’s SQL query support is now provided by TileDB Tables. The TileDB storage engine for MariaDB, imported as tiledb.sql, and also known as MyTile or TileDB-MariaDB, is now deprecated.

How to run this tutorial

You can run this tutorial only on TileDB Cloud. However, TileDB Cloud has a free tier. We strongly recommend that you sign up and run everything there, as that requires no installations or deployment.

This tutorial shows distributed querying on TileDB Cloud. It assumes you have already completed the Get Started section.

You can interface with TileDB Cloud using the same TileDB libraries used in the rest of the TileDB Tables tutorials. To authenticate yourself with TileDB Cloud, you need to pass a REST API token into the configuration of a TileDB context object. While it’s possible to pass your username and password to the TileDB Cloud API for authentication, we strongly recommend you use a REST API token instead, as it’s more controlled and prevents your login credentials from misuse. This tutorial assumes you have already stored your REST API token (or username and password) as environment variables.

Tip

When working inside a TileDB Cloud notebook environment, you’re already authenticated and do not need to create a REST API token.

First, import the necessary libraries.

import os
import warnings

import tiledb

warnings.filterwarnings("ignore")
import numpy as np
import tiledb.sql

# Set the appropriate environment variables.
tiledb_token = os.environ["TILEDB_REST_TOKEN"]
tiledb_account = os.environ["TILEDB_ACCOUNT"]

# Set table URI you will use in this tutorial
table_uri = "tiledb://TileDB-Inc/nyc_tlc_yellow_trip_data_2016-2022"

# Set the context config
cfg = tiledb.Config({"vfs.s3.region": "us-east-1", "rest.token": tiledb_token})
ctx = tiledb.Ctx(cfg)

You will use a pre-ingested table from the New York City Taxi and Limousine Commission Trip Record Data dataset.

Before running a distributed query, take a quick look at some records to understand the data.

A = tiledb.open(table_uri, ctx=ctx)
df = A.df[
    np.datetime64("2022-01-01T00:00:00", "[ns]") : np.datetime64(
        "2022-01-02T00:00:00", "[ns]"
    )
]
df

	tpep_pickup_datetime	PULocationID	VendorID	tpep_dropoff_datetime	passenger_count	trip_distance	RatecodeID	store_and_fwd_flag	DOLocationID	payment_type	fare_amount	extra	mta_tax	tip_amount	tolls_amount	improvement_surcharge	total_amount	congestion_surcharge
0	2022-01-02 00:00:00	132	2.0	2022-01-02 00:00:05	1.0	0.00	1.0	N	132	2.0	2.5	0.5	0.5	0.00	0.0	0.3	5.05	0.0
1	2022-01-02 00:00:00	132	2.0	2022-01-02 00:00:05	1.0	0.00	1.0	N	132	4.0	-2.5	-0.5	-0.5	0.00	0.0	-0.3	-5.05	0.0
2	2022-01-01 23:59:56	90	2.0	2022-01-02 00:05:28	1.0	1.26	1.0	N	79	1.0	6.5	0.5	0.5	2.06	0.0	0.3	12.36	2.5
3	2022-01-01 23:59:55	100	2.0	2022-01-02 00:03:36	1.0	0.61	1.0	N	164	2.0	4.5	0.5	0.5	0.00	0.0	0.3	8.30	2.5
4	2022-01-01 23:59:55	100	2.0	2022-01-02 00:03:36	1.0	0.61	1.0	N	164	3.0	-4.5	-0.5	-0.5	0.00	0.0	-0.3	-8.30	-2.5
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
126907	2022-01-01 00:01:26	261	2.0	2022-01-01 00:17:13	1.0	6.03	1.0	N	161	1.0	19.5	0.5	0.5	4.66	0.0	0.3	27.96	2.5
126908	2022-01-01 00:01:16	209	2.0	2022-01-01 00:07:09	3.0	2.45	1.0	N	232	2.0	9.0	0.5	0.5	0.00	0.0	0.3	12.80	2.5
126909	2022-01-01 00:01:05	48	2.0	2022-01-01 00:33:27	5.0	4.38	1.0	N	232	2.0	21.0	0.5	0.5	0.00	0.0	0.3	24.80	2.5
126910	2022-01-01 00:00:25	132	2.0	2022-01-01 00:22:31	3.0	18.30	1.0	N	65	1.0	48.0	0.5	0.5	12.64	0.0	0.3	63.19	0.0
126911	2022-01-01 00:00:27	211	2.0	2022-01-01 00:10:14	6.0	3.90	1.0	N	209	1.0	12.5	0.5	0.5	4.89	0.0	0.3	21.19	2.5

126912 rows × 18 columns

Now you will query at scale by using a TileDB Cloud task graph.

Define two Python functions for computing a distributed average and partial average.

def compute_python_avg(partial_results):
    """
    Function to perform average from partials.
    """
    sum_num = 0
    count = 0
    for result in partial_results:
        sum_num += result["sum"]
        count += result["count"]
    return {"average": sum_num / count, "sum": sum_num, "count": count}

def compute_partial_averages(data, previous_data=None):
    """
    Function to perform partial averages.
    """
    sum_num = 0
    count = len(data["fare_amount"])
    for record in data["fare_amount"]:
        sum_num += record

    if previous_data is not None:
        sum_num += previous_data["sum"]
        count += previous_data["count"]

    return {"sum": sum_num, "count": count}

To partition, you will query by week for an entire year. To do this, select the year 2021 and use NumPy to partition into date ranges for the year:

start_date = np.datetime64("2021-01-01", "M")
end_date = np.datetime64("2022-01-01", "M")
dates = np.arange(start_date, end_date, np.timedelta64(7, "D"))

Now you can define the task graph by using the dag API. Using the date ranges defined earlier, you can loop over each of the week ranges, call dag.submit_array_udf for each week, and pass in the parameters of the partial average function, the date range for a week, and limit the query to only the fare_amount field.

nodes = []

tasks_to_run_in_parallel = len(dates)
dag = tiledb.cloud.dag.DAG(
    name="Distributed Average",
    namespace=tiledb_account,
    max_workers=tasks_to_run_in_parallel,
)

# Loop over every week to compute in parallel partitions
for index in range(len(dates) - 1):
    # Define start and end boundaries
    start = dates[index].astype("datetime64[ns]")
    end = dates[index] + np.timedelta64(7, "D") - np.timedelta64(1, "ns")

    # Set array_udf accessing only fare_amount
    node = dag.submit_array_udf(
        table_uri,
        compute_partial_averages,
        name=f'"{start}"',
        ranges=[(start, end), []],
        attrs=["fare_amount"],
    )
    nodes.append(node)

# Accumulate and tabulate the final results
results = dag.submit_local(compute_python_avg, nodes)

Running the graph should take only a few seconds and shows how TileDB Cloud can achieve returning millions of records per second when issuing distributed queries.

# Kick off distributed computation
import time

start_time = time.time()
dag.compute()
average = results.result()
end_time = time.time()

print(f"results: {average}")
duration = end_time - start_time

print(
    f"Distributed average took {duration}s yielding {average['count'] / duration} records per second"
)

results: {'average': 13.518565198990611, 'sum': 833644520.4798127, 'count': 61666642}
Distributed average took 5.920797109603882s yielding 10415260.117590092 records per second