Basic S3 Example with Tables

tutorials

tables

python

remote access

storage backends

This page shows the basic usage of TileDB tables on Amazon S3.

How to run this tutorial

We recommend running this tutorial, as well as the other tutorials in the Tutorials section, inside TileDB Cloud. By using TileDB Cloud, you can experiment while avoiding all the installation, deployment, and configuration hassles. Sign up for the free tier, spin up a TileDB Cloud notebook with a Python kernel, and follow the tutorial instructions. If you wish to learn how to run tutorials locally on your machine, read the Tutorials: Running Locally tutorial.

This tutorial shows how to use TileDB’s tabular offering to store a table on S3, and query it efficiently without the need to download it locally. For more information on how TileDB efficiently works on object stores, visit the Array Key Concepts: Object Stores section.

The only difference to working with local tables is twofold:

Set the appropriate AWS credentials in environment variables and load them into a configuration object in a TileDB context.
Use an s3:// URI instead of a local path for the table location.

Other than those differences, the rest of the operations are the same as for local tables.

First, load the appropriate libraries, set the AWS credentials in a context, specify the table S3 URI, and delete any already-created table with the same URI.

# Import necessary libraries
import os

import pandas as pd
import tiledb
import tiledb.sql

# You should set the appropriate environment variables with your keys.
# Get the keys from the environment variables.
aws_access_key_id = os.environ["AWS_ACCESS_KEY_ID"]
aws_secret_access_key = os.environ["AWS_SECRET_ACCESS_KEY"]

# Get the bucket and region from environment variables
s3_bucket = os.environ["S3_BUCKET"]
s3_region = os.environ["S3_REGION"]

# Set the AWS keys and region to the config of the default context
# This context initialization can be performed only once.
cfg = tiledb.Config(
    {
        "vfs.s3.region": s3_region,
        "vfs.s3.aws_access_key_id": aws_access_key_id,
        "vfs.s3.aws_secret_access_key": aws_secret_access_key,
        "vfs.s3.no_sign_request": True,
    }
)
ctx = tiledb.Ctx(cfg)

# Set tab;e URI
table_name = "basic_s3"
table_uri = s3_bucket + "/" + table_name
csv_uri = (
    "s3://tiledb-inc-demo-data/examples/notebooks/nyc_yellow_tripdata/taxi_first_10.csv"
)

# Clean up previous data
if tiledb.array_exists(table_uri, ctx=ctx):
    tiledb.Array.delete_array(table_uri, ctx=ctx)

Ingest a CSV file to create the table on S3.

tiledb.from_csv(
    table_uri,
    csv_uri,
    ctx=ctx,
    parse_dates=["tpep_dropoff_datetime", "tpep_pickup_datetime"],
)

Query the table with SQL.

# Query with SQL
db = tiledb.sql.connect(init_command="SET GLOBAL time_zone='+00:00'")
pd.read_sql(sql=f"SELECT * FROM `{table_uri}`", con=db)

	__tiledb_rows	congestion_surcharge	improvement_surcharge	tip_amount	total_amount	mta_tax	extra	tpep_dropoff_datetime	fare_amount	payment_type	PULocationID	store_and_fwd_flag	tpep_pickup_datetime	VendorID	trip_distance	RatecodeID	passenger_count	DOLocationID
0	0	2.5	0.3	1.47	11.27	0.5	3.0	2020-01-01 00:33:03	6.00	1	238	N	2020-01-01 00:28:15	1	1.20	1	1	239
1	1	2.5	0.3	1.50	12.30	0.5	3.0	2020-01-01 00:43:04	7.00	1	239	N	2020-01-01 00:35:39	1	1.20	1	1	238
2	2	2.5	0.3	1.00	10.80	0.5	3.0	2020-01-01 00:53:52	6.00	1	238	N	2020-01-01 00:47:41	1	0.60	1	1	238
3	3	0.0	0.3	1.36	8.16	0.5	0.5	2020-01-01 01:00:14	5.50	1	238	N	2020-01-01 00:55:23	1	0.80	1	1	151
4	4	0.0	0.3	0.00	4.80	0.5	0.5	2020-01-01 00:04:16	3.50	2	193	N	2020-01-01 00:01:58	2	0.00	1	1	193
5	5	0.0	0.3	0.00	3.80	0.5	0.5	2020-01-01 00:10:37	2.50	2	7	N	2020-01-01 00:09:44	2	0.03	1	1	193
6	6	0.0	0.3	0.01	3.81	0.5	0.5	2020-01-01 00:39:29	2.50	1	193	N	2020-01-01 00:39:25	2	0.00	1	1	193
7	7	2.5	0.3	0.00	2.81	0.0	0.0	2019-12-18 15:28:59	0.01	1	193	N	2019-12-18 15:27:49	2	0.00	5	1	193
8	8	2.5	0.3	0.00	6.30	0.5	0.5	2019-12-18 15:31:35	2.50	1	193	N	2019-12-18 15:30:35	2	0.00	1	4	193
9	9	2.5	0.3	2.35	14.15	0.5	3.0	2020-01-01 00:40:28	8.00	1	246	N	2020-01-01 00:29:01	1	0.70	1	2	48

Clean up in the end by removing the table from S3:

# Clean up
if tiledb.array_exists(table_uri, ctx=ctx):
    tiledb.Array.delete_array(table_uri, ctx=ctx)