Learn how to ingest basic vector data with TileDB-Vector-Search and perform similarity search.
How to run this tutorial
We recommend running this tutorial, as well as the other various tutorials in the Tutorials section, inside TileDB Cloud. This will allow you to quickly experiment avoiding all the installation, deployment, and configuration hassles. Sign up for the free tier, spin up a TileDB Cloud notebook with a Python kernel, and follow the tutorial instructions. If you wish to learn how to run tutorials locally on your machine, read the Tutorials: Running Locally tutorial.
In this tutorial, you will learn how to ingest a set vectors into an IVF_FLAT index and perform basic similarity search.
First, import the necessary libraries, set up the URIs you will use throughout the tutorial, and clean up any previously created directories and vector indexes.
# Import necessary librariesimport osimport tarfileimport shutilimport urllib.requestimport numpy as npimport tiledb.vector_search as vsfrom tiledb.vector_search.utils import load_fvecs, load_ivecsimport tiledb# The URIs for the data to download and ingestdata_uri ="https://github.com/TileDB-Inc/TileDB-Vector-Search/releases/download/0.0.1/siftsmall.tgz"data_filename ="siftsmall.tar.gz"data_dir = os.path.expanduser("~/sift10k/")local_data_path = os.path.join(data_dir, data_filename)index_uri = os.path.expanduser("~/sift10k_ivf_flat")# Clean up previous dataif os.path.exists(data_dir): shutil.rmtree(data_dir)if os.path.exists(index_uri): shutil.rmtree(index_uri)
Next, download and untar the source dataset.
# Create a directory to store the source datasetos.makedirs(os.path.dirname(data_dir))# Download the file that contains the vector dataseturllib.request.urlretrieve(data_uri, local_data_path)# untar the filetarfile.open(local_data_path, "r:gz").extractall( os.path.dirname(local_data_path), filter="fully_trusted")
Ingest
Once the tarball has been extracted, ingest the data into a IVF_FLAT index as follows:
index = vs.ingest( index_type="IVF_FLAT", source_uri=os.path.join(data_dir, "siftsmall_base.fvecs"), index_uri=index_uri, source_type="FVEC", partitions=100,)
This ingestion creates a group with three arrays, which can be shown as shown below. In addition, you can see the array schema of the array that stores the vectors, as well as print the contents of the first vector.
# Show the physical groupgroup = tiledb.Group(index_uri, "r")print("Index physical contents:\n")print(group)# Prepare the index for readingindex = vs.IVFFlatIndex(index_uri)# Open the vector array to inspect itprint("Vector array URI:", index.db_uri, "\n")A = tiledb.open(index.db_uri)# Print the schema of the vector arrayprint("Vector array schema:\n")print(A.schema)# Print the first vectorprint("Contents of first vector:\n")print(A[:, 0]["values"])
To run similarity search on the ingested vectors, load the queries and ground truth vectors from the siftsmall dataset you downloaded, noting you can use any vector to query this dataset. load_fvecs and load_ivecs are auxiliary functions in tiledb.vector_search.utils to fetch the query vectors and ground truth vectors. Then, run the query, and print the result vectors ids and corresponding distances to the queries as follows:
# Get query vectors with ground truthquery_vectors = load_fvecs(os.path.join(data_dir, "siftsmall_query.fvecs"))ground_truth = load_ivecs(os.path.join(data_dir, "siftsmall_groundtruth.ivecs"))# Select a query vectorquery_id =77qv = np.array([query_vectors[query_id]])# Return the 100 most similar vectors to the query vector with IVF_FLATresult_d, result_i = index.query(qv, k=100, nprobe=10)print("Result vector ids:\n")print(result_i)print("\nResult vector distances:\n")print(result_d)
You can check the result against the ground truth:
# For FLAT, the following will always be true,# but for IVF_FLAT it might not be (as it's an approximate algorithmnp.alltrue(result_i == ground_truth[query_id])
False
You can even run batches of searches, which are efficiently implemented in TileDB-Vector-Search:
# Simply provide more than one query vectorsresult_d, result_i = index.query( np.array([query_vectors[5], query_vectors[6]]), nprobe=10, k=100)print("Result vector ids:\n")print(result_i)print("\nResult vector distances:\n")print(result_d)
Note the following when creating indexes and registering them on TileDB Cloud:
When creating a new index, you need to use a URI in the form tiledb://<your_username>/<S3_path>/<index_name>, where S3_path is the location on S3 where you wish to physically store the index.
When referring to the index after creating it (e.g., when submitting queries), use a URI in the form tiledb://<your_username>/<index_name> (i.e., no need to specify the S3 physical path anymore).
Set up the URIs and clean up any previously created index with the same name on TileDB Cloud.
# Get your usernameusername = tiledb.cloud.user_profile().username# Get the bucket from an environment variables3_bucket = os.environ["S3_BUCKET"]# Set index URIindex_name ="cloud_index"index_uri ="tiledb://"+ username +"/"+ index_nameindex_reg_uri ="tiledb://"+ username +"/"+ s3_bucket +"/"+ index_name# The TileDB Cloud contextctx = tiledb.cloud.Ctx()# Clean up index if it already existsif tiledb.object_type(index_uri, ctx=ctx) =="group": tiledb.cloud.asset.delete(index_uri, recursive=True)
Create the index, using the tiledb://<your_username>/<S3_path>/<index_name> URI format explained above:
# Create an index, where the dimensionality of each vector is 3,# the type of the vector values is float32, and the index will# use 3 partitions.index = vs.ivf_flat_index.create( ctx=ctx, uri=index_reg_uri, dimensions=3, partitions=3, vector_type=np.dtype(np.float32),)
From this point onwards, you can populate and query the index just using the tiledb://<your_username>/<index_name> URI format explained above.
The following with delete the physical files of the index and deregister it from TileDB Cloud.
# Clean up index if it already existsif tiledb.object_type(index_uri, ctx=ctx) =="group": tiledb.cloud.asset.delete(index_uri, recursive=True)
What’s next?
Now that you know how to perform basic ingestion and similarity search, you are ready to learn how to perform updates and deletions.