Learn how to install TileDB-Vector-Search, build an index, and perform basic searches with this guided tutorial.
How to run this tutorial
We recommend running this tutorial, as well as the other various tutorials in the Tutorials section, inside TileDB Cloud. This will allow you to quickly experiment avoiding all the installation, deployment, and configuration hassles. Sign up for the free tier, spin up a TileDB Cloud notebook with a Python kernel, and follow the tutorial instructions. If you wish to learn how to run tutorials locally on your machine, read the Tutorials: Running Locally tutorial.
This is a quickstart tutorial on the basic vector search capabilities of TileDB.
First, import the necessary libraries, set up the URIs you will use throughout the tutorial, and clean up any previously created directories and vector indexes.
# Import necessary librariesimport osimport tarfileimport shutilimport urllib.requestimport numpy as npimport tiledb.vector_search as vsfrom tiledb.vector_search.utils import load_fvecs, load_ivecsimport tiledb# The URIs for the data to download and ingestdata_uri ="https://github.com/TileDB-Inc/TileDB-Vector-Search/releases/download/0.0.1/siftsmall.tgz"data_filename ="siftsmall.tar.gz"data_dir = os.path.expanduser("~/sift10k/")local_data_path = os.path.join(data_dir, data_filename)index_uri = os.path.expanduser("~/sift10k_flat")# Clean up previous dataif os.path.exists(data_dir): shutil.rmtree(data_dir)if os.path.exists(index_uri): shutil.rmtree(index_uri)
Next, download and untar the source dataset.
# Create a directory to store the source datasetos.makedirs(os.path.dirname(data_dir))# Download the file that contains the vector dataseturllib.request.urlretrieve(data_uri, local_data_path)# untar the filetarfile.open(local_data_path, "r:gz").extractall( os.path.dirname(local_data_path), filter="fully_trusted")
Ingest
Once the tarball has been extracted, ingest the data into a FLAT index (the simplest index in TileDB-Vector-Search) as follows:
index = vs.ingest( index_type="FLAT", index_uri=index_uri, source_uri=os.path.join(data_dir, "siftsmall_base.fvecs"), source_type="FVEC",)
Inspect
This ingestion creates a group with three arrays, which can be shown as shown below. In addition, you can see the array schema of the array that stores the vectors, as well as print the contents of the first vector.
# Show the physical groupgroup = tiledb.Group(index_uri, "r")print("Index physical contents:\n")print(group)# Prepare the index for readingindex = vs.FlatIndex(index_uri)# Open the vector array to inspect itprint("Vector array URI:", index.db_uri, "\n")A = tiledb.open(index.db_uri)# Print the schema of the vector arrayprint("Vector array schema:\n")print(A.schema)# Print the first vectorprint("Contents of first vector:\n")print(A[:, 0]["values"])
To run similarity search on the ingested vectors, load the queries and ground truth vectors from the siftsmall dataset you downloaded, noting you can use any vector to query this dataset. load_fvecs and load_ivecs are auxiliary functions in tiledb.vector_search.utils to fetch the query vectors and ground truth vectors. Then, run the query, and print the result vectors ids and corresponding distances to the queries as follows:
# Get query vectors with ground truthquery_vectors = load_fvecs(os.path.join(data_dir, "siftsmall_query.fvecs"))ground_truth = load_ivecs(os.path.join(data_dir, "siftsmall_groundtruth.ivecs"))# Select a query vectorquery_id =77qv = np.array([query_vectors[query_id]])# Return the 100 most similar vectors to the query vector with FLATresult_d, result_i = index.query(qv, k=100)print("Result vector ids:\n")print(result_i)print("\nResult vector distances:\n")print(result_d)
You can check the result against the ground truth:
# For FLAT, the following will always be truenp.alltrue(result_i == ground_truth[query_id])
True
You can even run batches of searches, which are efficiently implemented in TileDB-Vector-Search:
# Simply provide more than one query vectorsresult_d, result_i = index.query(np.array([query_vectors[5], query_vectors[6]]), k=100)print("Result vector ids:\n")print(result_i)print("\nResult vector distances:\n")print(result_d)
Now that you have a basic understanding of how to ingest data and run similarity search queries using TileDB-Vector-Search, you can continue learning about TileDB-Vector-Search:
The foundation docs explain how TileDB has implemented vector search.
The tutorials cover the broad functionality and use cases of vector search in TileDB.
The API reference provides more information about the usage of TileDB-Vector-Search.