Tables Quickstart

tables

quickstart

tutorials

python

This tutorial covers the basics of working with tabular data using TileDB.

How to run this tutorial

We recommend running this tutorial, as well as the other tutorials in the Tutorials section, inside TileDB Cloud. By using TileDB Cloud, you can experiment while avoiding all the installation, deployment, and configuration hassles. Sign up for the free tier, spin up a TileDB Cloud notebook with a Python kernel, and follow the tutorial instructions. If you wish to learn how to run tutorials locally on your machine, read the Tutorials: Running Locally tutorial.

This tutorial offers a rapid introduction to TileDB’s tabular support and its capabilities. It covers the following topics:

Ingest data from a CSV file into a new TileDB array.
Run dataframe queries with a pandas-like API, retrieving the results in a pandas dataframe.
Run SQL queries and return the results in a pandas dataframe.

Setup

Start by importing the libraries used in this tutorial, setting the URIs you will use throughout this tutorial, and cleaning up any older data with the same name.

import warnings

import tiledb

warnings.filterwarnings("ignore")
import os.path
import shutil

import pandas as pd
import tiledb.sql

# Print library versions
print("TileDB core version: {}".format(tiledb.libtiledb.version()))
print("TileDB-Py version: {}".format(tiledb.version()))
print("TileDB-SQL version: {}".format(tiledb.sql.version))

# Set CSV dataset URI
table_uri = os.path.expanduser("~/my_table")
csv_uri = (
    "s3://tiledb-inc-demo-data/examples/notebooks/nyc_yellow_tripdata/taxi_first_10.csv"
)

# Define config values and context
cfg = tiledb.Config({"vfs.s3.no_sign_request": "true", "vfs.s3.region": "us-east-1"})
ctx = tiledb.Ctx(cfg)

# Clean up past data
if os.path.exists(table_uri):
    shutil.rmtree(table_uri)

Ingestion

This process will ingest some small CSV data directly from a public S3 bucket into a local TileDB array, without needing to download the source CSV files beforehand. The ingestion should take about a few seconds from your laptop. You will use a subset from the latest New York City Taxi and Limousine Commission Trip Record Data dataset.

You can ingest with a single command, without needing to define the tabular schema beforehand:

tiledb.from_csv(
    table_uri,  # TileDB array to create
    csv_uri,  # CSV to load
    ctx=ctx,  # Context
    parse_dates=["tpep_dropoff_datetime", "tpep_pickup_datetime"],
)  # Parse these fields are datetimes

You need to pass the following cfg options into ctx to ingest directly from a public S3 bucket:

"vfs.s3.no_sign_request": Set to True to reduce the cost of accessing public data by removing the need for AWS access credentials.
"vfs.s3.region": Set to us-east-1 to match the location of the public TileDB demo data. This avoids file access issues caused by a different default region in a local environment.

First, prepare the table for reading, so that you can then inspect its schema and run queries.

# Open the table in read mode
table = tiledb.open(table_uri, mode="r")

Inspect the schema of the underlying array on which this particular table is based:

# Print the underlying array schema
print(table.schema)

ArraySchema(
  domain=Domain(*[
    Dim(name='__tiledb_rows', domain=(0, 9), tile=10, dtype='uint64', filters=FilterList([ZstdFilter(level=-1), ])),
  ]),
  attrs=[
    Attr(name='VendorID', dtype='int64', var=False, nullable=False, enum_label=None, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='tpep_pickup_datetime', dtype='datetime64[ns]', var=False, nullable=False, enum_label=None, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='tpep_dropoff_datetime', dtype='datetime64[ns]', var=False, nullable=False, enum_label=None, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='passenger_count', dtype='int64', var=False, nullable=False, enum_label=None, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='trip_distance', dtype='float64', var=False, nullable=False, enum_label=None, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='RatecodeID', dtype='int64', var=False, nullable=False, enum_label=None, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='store_and_fwd_flag', dtype='<U0', var=True, nullable=False, enum_label=None, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='PULocationID', dtype='int64', var=False, nullable=False, enum_label=None, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='DOLocationID', dtype='int64', var=False, nullable=False, enum_label=None, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='payment_type', dtype='int64', var=False, nullable=False, enum_label=None, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='fare_amount', dtype='float64', var=False, nullable=False, enum_label=None, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='extra', dtype='float64', var=False, nullable=False, enum_label=None, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='mta_tax', dtype='float64', var=False, nullable=False, enum_label=None, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='tip_amount', dtype='float64', var=False, nullable=False, enum_label=None, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='tolls_amount', dtype='int64', var=False, nullable=False, enum_label=None, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='improvement_surcharge', dtype='float64', var=False, nullable=False, enum_label=None, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='total_amount', dtype='float64', var=False, nullable=False, enum_label=None, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='congestion_surcharge', dtype='float64', var=False, nullable=False, enum_label=None, filters=FilterList([ZstdFilter(level=-1), ])),
  ],
  cell_order='row-major',
  tile_order='row-major',
  sparse=False,
)

TileDB materializes the created tabular dataset as a directory on your local storage and models the table as a dense TileDB array with a row_id dimension added. Ingesting a CSV by default produces a dense array with the dimension being the row ID. You can run !tree {table_uri} to see the file hierarchy inside the dataset directory. For more details on the meaning of those different TileDB objects and the different ways TileDB arrays can model tabular data, visit the Data Model section.

Read data using dataframes

You can read data into a pandas dataframe with the .df[] method, which allows setting the dimensions on which to slice:

# Read entire dataset into a pandas dataframe
df = table.df[:]  # Equivalent to: A.df[0:9]
df

	VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	RatecodeID	store_and_fwd_flag	PULocationID	DOLocationID	payment_type	fare_amount	extra	mta_tax	tip_amount	improvement_surcharge	total_amount	congestion_surcharge
0	1	2020-01-01 00:28:15	2020-01-01 00:33:03	1	1.20	1	N	238	239	1	6.00	3.0	0.5	1.47	0.3	11.27	2.5
1	1	2020-01-01 00:35:39	2020-01-01 00:43:04	1	1.20	1	N	239	238	1	7.00	3.0	0.5	1.50	0.3	12.30	2.5
2	1	2020-01-01 00:47:41	2020-01-01 00:53:52	1	0.60	1	N	238	238	1	6.00	3.0	0.5	1.00	0.3	10.80	2.5
3	1	2020-01-01 00:55:23	2020-01-01 01:00:14	1	0.80	1	N	238	151	1	5.50	0.5	0.5	1.36	0.3	8.16	0.0
4	2	2020-01-01 00:01:58	2020-01-01 00:04:16	1	0.00	1	N	193	193	2	3.50	0.5	0.5	0.00	0.3	4.80	0.0
5	2	2020-01-01 00:09:44	2020-01-01 00:10:37	1	0.03	1	N	7	193	2	2.50	0.5	0.5	0.00	0.3	3.80	0.0
6	2	2020-01-01 00:39:25	2020-01-01 00:39:29	1	0.00	1	N	193	193	1	2.50	0.5	0.5	0.01	0.3	3.81	0.0
7	2	2019-12-18 15:27:49	2019-12-18 15:28:59	1	0.00	5	N	193	193	1	0.01	0.0	0.0	0.00	0.3	2.81	2.5
8	2	2019-12-18 15:30:35	2019-12-18 15:31:35	4	0.00	1	N	193	193	1	2.50	0.5	0.5	0.00	0.3	6.30	2.5
9	1	2020-01-01 00:29:01	2020-01-01 00:40:28	2	0.70	1	N	246	48	1	8.00	3.0	0.5	2.35	0.3	14.15	2.5

To slice a subset of rows, pass the desired range in the .df[] method:

# Pass a specific range in for the row_id to subset select
table.df[3:5]

	VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	RatecodeID	store_and_fwd_flag	PULocationID	DOLocationID	payment_type	fare_amount	extra	mta_tax	tip_amount	improvement_surcharge	total_amount
3	1	2020-01-01 00:55:23	2020-01-01 01:00:14	1	0.80	1	N	238	151	1	5.5	0.5	0.5	1.36	0.3	8.16
4	2	2020-01-01 00:01:58	2020-01-01 00:04:16	1	0.00	1	N	193	193	2	3.5	0.5	0.5	0.00	0.3	4.80
5	2	2020-01-01 00:09:44	2020-01-01 00:10:37	1	0.03	1	N	7	193	2	2.5	0.5	0.5	0.00	0.3	3.80

Read data using SQL

Along with dataframe APIs, TileDB also support ANSI SQL via an integration with MariaDB. TileDB SQL is DBI-compliant and offers the ability to create a connection and pass it to any system, such as pandas.

# Query with SQL
db = tiledb.sql.connect(init_command="SET GLOBAL time_zone='+00:00'")
pd.read_sql(sql=f"SELECT * FROM `{table_uri}`", con=db)

	__tiledb_rows	congestion_surcharge	improvement_surcharge	tip_amount	VendorID	payment_type	DOLocationID	mta_tax	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	RatecodeID	store_and_fwd_flag	PULocationID	total_amount	fare_amount	extra
0	0	2.5	0.3	1.47	1	1	239	0.5	2020-01-01 00:28:15	2020-01-01 00:33:03	1	1.20	1	N	238	11.27	6.00	3.0
1	1	2.5	0.3	1.50	1	1	238	0.5	2020-01-01 00:35:39	2020-01-01 00:43:04	1	1.20	1	N	239	12.30	7.00	3.0
2	2	2.5	0.3	1.00	1	1	238	0.5	2020-01-01 00:47:41	2020-01-01 00:53:52	1	0.60	1	N	238	10.80	6.00	3.0
3	3	0.0	0.3	1.36	1	1	151	0.5	2020-01-01 00:55:23	2020-01-01 01:00:14	1	0.80	1	N	238	8.16	5.50	0.5
4	4	0.0	0.3	0.00	2	2	193	0.5	2020-01-01 00:01:58	2020-01-01 00:04:16	1	0.00	1	N	193	4.80	3.50	0.5
5	5	0.0	0.3	0.00	2	2	193	0.5	2020-01-01 00:09:44	2020-01-01 00:10:37	1	0.03	1	N	7	3.80	2.50	0.5
6	6	0.0	0.3	0.01	2	1	193	0.5	2020-01-01 00:39:25	2020-01-01 00:39:29	1	0.00	1	N	193	3.81	2.50	0.5
7	7	2.5	0.3	0.00	2	1	193	0.0	2019-12-18 15:27:49	2019-12-18 15:28:59	1	0.00	5	N	193	2.81	0.01	0.0
8	8	2.5	0.3	0.00	2	1	193	0.5	2019-12-18 15:30:35	2019-12-18 15:31:35	4	0.00	1	N	193	6.30	2.50	0.5
9	9	2.5	0.3	2.35	1	1	48	0.5	2020-01-01 00:29:01	2020-01-01 00:40:28	2	0.70	1	N	246	14.15	8.00	3.0

TileDB offers full support for joins, group bys, aggregates, and more.

# A more complex SQL query
db = tiledb.sql.connect(init_command="SET GLOBAL time_zone='+00:00'")
pd.read_sql(
    sql=f"""SELECT COUNT(*) as trip_count, 
                        MAX(total_amount) as max_total_amount,
                        AVG(tip_amount) as average_tip_amount,
                        PULocationID
                    FROM `{table_uri}`
                    GROUP BY PULocationID""",
    con=db,
)

	trip_count	max_total_amount	average_tip_amount	PULocationID
0	1	3.80	0.000000	7
1	4	6.30	0.002500	193
2	3	11.27	1.276667	238
3	1	12.30	1.500000	239
4	1	14.15	2.350000	246

Clean up

Clean up in the end by deleting the table.

# Clean up
if os.path.exists(table_uri):
    shutil.rmtree(table_uri)