import warnings
import tiledb
"ignore")
warnings.filterwarnings(import os.path
import shutil
import pandas as pd
import tiledb.sql
# Print library versions
print("TileDB core version: {}".format(tiledb.libtiledb.version()))
print("TileDB-Py version: {}".format(tiledb.version()))
print("TileDB-SQL version: {}".format(tiledb.sql.version))
# Set CSV dataset URI
= os.path.expanduser("~/my_table")
table_uri = (
csv_uri "s3://tiledb-inc-demo-data/examples/notebooks/nyc_yellow_tripdata/taxi_first_10.csv"
)
# Define config values and context
= tiledb.Config({"vfs.s3.no_sign_request": "true", "vfs.s3.region": "us-east-1"})
cfg = tiledb.Ctx(cfg)
ctx
# Clean up past data
if os.path.exists(table_uri):
shutil.rmtree(table_uri)
Tables Quickstart
We recommend running this tutorial, as well as the other tutorials in the Tutorials section, inside TileDB Cloud. By using TileDB Cloud, you can experiment while avoiding all the installation, deployment, and configuration hassles. Sign up for the free tier, spin up a TileDB Cloud notebook with a Python kernel, and follow the tutorial instructions. If you wish to learn how to run tutorials locally on your machine, read the Tutorials: Running Locally tutorial.
This tutorial offers a rapid introduction to TileDB’s tabular support and its capabilities. It covers the following topics:
- Ingest data from a CSV file into a new TileDB array.
- Run dataframe queries with a pandas-like API, retrieving the results in a pandas dataframe.
- Run SQL queries and return the results in a pandas dataframe.
Setup
Start by importing the libraries used in this tutorial, setting the URIs you will use throughout this tutorial, and cleaning up any older data with the same name.
Ingestion
This process will ingest some small CSV data directly from a public S3 bucket into a local TileDB array, without needing to download the source CSV files beforehand. The ingestion should take about a few seconds from your laptop. You will use a subset from the latest New York City Taxi and Limousine Commission Trip Record Data dataset.
You can ingest with a single command, without needing to define the tabular schema beforehand:
You need to pass the following cfg
options into ctx
to ingest directly from a public S3 bucket:
"vfs.s3.no_sign_request"
: Set toTrue
to reduce the cost of accessing public data by removing the need for AWS access credentials."vfs.s3.region"
: Set tous-east-1
to match the location of the public TileDB demo data. This avoids file access issues caused by a different default region in a local environment.
First, prepare the table for reading, so that you can then inspect its schema and run queries.
Inspect the schema of the underlying array on which this particular table is based:
TileDB materializes the created tabular dataset as a directory on your local storage and models the table as a dense TileDB array with a row_id
dimension added. Ingesting a CSV by default produces a dense array with the dimension being the row ID. You can run !tree {table_uri}
to see the file hierarchy inside the dataset directory. For more details on the meaning of those different TileDB objects and the different ways TileDB arrays can model tabular data, visit the Data Model section.
Read data using dataframes
You can read data into a pandas dataframe with the .df[]
method, which allows setting the dimensions on which to slice:
To slice a subset of rows, pass the desired range in the .df[]
method:
Read data using SQL
Along with dataframe APIs, TileDB also support ANSI SQL via an integration with MariaDB. TileDB SQL is DBI-compliant and offers the ability to create a connection and pass it to any system, such as pandas.
TileDB offers full support for joins, group bys, aggregates, and more.
Clean up
Clean up in the end by deleting the table.