Basic S3 Example with Vector Search Data

ai/ml

vector search

tutorials

storage backends

Demonstration of basic usage of TileDB-Vector-Search on Amazon S3.

How to run this tutorial

We recommend running this tutorial, as well as the other various tutorials in the Tutorials section, inside TileDB Cloud. This will allow you to quickly experiment avoiding all the installation, deployment, and configuration hassles. Sign up for the free tier, spin up a TileDB Cloud notebook with a Python kernel, and follow the tutorial instructions. If you wish to learn how to run tutorials locally on your machine, read the Tutorials: Running Locally tutorial.

This tutorial demonstrates how to use TileDB-Vector-Search to store a vector index to S3, and query it efficiently without the need to download it locally. For more information on how TileDB efficiently works on object stores, visit the Array Key Concepts: Object Stores section.

Setup

The only difference to working with local vector indexes is twofold:

Set the appropriate AWS credentials in environment variables and load them into a configuration object in a TileDB context.
Use an s3:// URI instead of a local path for the index location.

Other than the above, the rest of the operations are identical to local indexes.

First, load the appropriate libraries, set the AWS credentials in a context, specify the index S3 URI, and delete any previously created index with the same URI.

# Import necessary libraries
import os

import numpy as np
import tiledb
import tiledb.vector_search as vs

# You should set the appropriate environment variables with your keys.
# Get the keys from the environment variables.
aws_access_key_id = os.environ["AWS_ACCESS_KEY_ID"]
aws_secret_access_key = os.environ["AWS_SECRET_ACCESS_KEY"]

# Get the bucket and region from environment variables
s3_bucket = os.environ["S3_BUCKET"]
s3_region = os.environ["S3_REGION"]

# Set the AWS keys and region to the config of the default context
# This context initialization can be performed only once.
cfg = tiledb.Config(
    {
        "vfs.s3.aws_access_key_id": aws_access_key_id,
        "vfs.s3.aws_secret_access_key": aws_secret_access_key,
        "vfs.s3.region": s3_region,
    }
)
ctx = tiledb.Ctx(cfg)

# Set index URI
index_name = "basic_s3"
index_uri = s3_bucket + "/" + index_name

# Clean up previous data
if tiledb.object_type(index_uri, ctx=ctx) == "group":
    with tiledb.Group(index_uri, "m") as g:
        g.delete(recursive=True)

Ingestion and querying

Create an empty IVF_FLAT index.

# Create an index, where the dimensionality of each vector is 3,
# the type of the vector values is float32, and the index will
# use 3 partitions.
index = vs.ivf_flat_index.create(
    ctx=ctx, uri=index_uri, dimensions=3, partitions=3, vector_type=np.dtype(np.float32)
)

Add some vectors to the index.

# Apply a set of appends to the index, adding one vector at a time
update_vectors = np.empty([5], dtype=object)
update_vectors[0] = np.array([0, 0, 0], dtype=np.dtype(np.float32))
update_vectors[1] = np.array([1, 1, 1], dtype=np.dtype(np.float32))
update_vectors[2] = np.array([2, 2, 2], dtype=np.dtype(np.float32))
update_vectors[3] = np.array([3, 3, 3], dtype=np.dtype(np.float32))
update_vectors[4] = np.array([4, 4, 4], dtype=np.dtype(np.float32))
index.update_batch(vectors=update_vectors, external_ids=np.array([0, 1, 2, 3, 4]))

Query the index.

# Create a query
query_vector = np.array([[2, 2, 2]], dtype=np.float32)

# Perform the query
result_d, result_i = index.query(query_vector, k=3, nprobe=3)
print("Result vector ids:\n")
print(result_i)
print("\nResult vector distances:\n")
print(result_d)

Result vector ids:

[[2 3 1]]

Result vector distances:

[[0. 3. 3.]]

Clean up

Clean up in the end by removing the index:

# Clean up
if tiledb.object_type(index_uri, ctx=ctx) == "group":
    with tiledb.Group(index_uri, "m") as g:
        g.delete(recursive=True)