TileDB-Vector-Search allows you to consolidate vector indexes to boost performance after numerous updates and deletions.
How to run this tutorial
We recommend running this tutorial, as well as the other various tutorials in the Tutorials section, inside TileDB Cloud. This will allow you to quickly experiment avoiding all the installation, deployment, and configuration hassles. Sign up for the free tier, spin up a TileDB Cloud notebook with a Python kernel, and follow the tutorial instructions. If you wish to learn how to run tutorials locally on your machine, read the Tutorials: Running Locally tutorial.
This tutorial shows how you can consolidate your vector indexes with TileDB-Vector-Search, which can be used to boost the query performance, especially when you have performed numerous additions, updates and deletions to your index. We recommend reading the following sections before proceeding with this tutorial:
First, import the appropriate libraries, set the index URI, and delete any previously created data.
# Import necessary librariesimport osimport shutilimport numpy as npimport tiledb.vector_search as vs# Set the index URI for this tutorialindex_uri = os.path.expanduser("~/consolidation")# Clean up previous dataif os.path.exists(index_uri): shutil.rmtree(index_uri)
Next, create an empty index:
# Create an index, where the dimensionality of each vector is 3,# the type of the vector values is float32, and the index will# use 3 partitions.index = vs.ivf_flat_index.create( uri=index_uri, dimensions=3, partitions=3, vector_type=np.dtype(np.float32))
Populate index
Perform a series of updates, starting with adding some new vectors in bulk. Note that here you will provide timestamp = 1 as a parameter to the update_batch command. In other tutorials, this parameter was omitted, and a default timestamp was set to the current time (in milliseconds since the Unix epoch).
# Prepare some vectors to addupdate_vectors = np.empty([5], dtype=object)update_vectors[0] = np.array([0, 0, 0], dtype=np.dtype(np.float32))update_vectors[1] = np.array([1, 1, 1], dtype=np.dtype(np.float32))update_vectors[2] = np.array([2, 2, 2], dtype=np.dtype(np.float32))update_vectors[3] = np.array([3, 3, 3], dtype=np.dtype(np.float32))update_vectors[4] = np.array([4, 4, 4], dtype=np.dtype(np.float32))# Add the vectors to the index, specifying a timestamp (1)index.update_batch( vectors=update_vectors, external_ids=np.array([0, 1, 2, 3, 4]), timestamp=1)
Next, update the values of some existing vectors one by one at timestamp = 2 and timestamp = 3, respectively. Then perform a bulk update at timestamp = 4.
# Delete the vectors with external ids 1 and 2, but at a later timestampindex.delete_batch(external_ids=[1, 2], timestamp=5)
Output the directory structure of the index and observe the contents of array updates. Inside its __commits subfolder, you can see files that start with __1_1, __2_2, …, __5_5, which are the timestamps at which these updates happened. In addition, you see a similar structure to the subfolders inside the __fragments subfolder, which contain the actual update data. This is how TileDB physically stores the different versions, which can then facilitate time traveling, i.e., querying the index at different times in the past.
Despite the flexibility to query your index in the past, numerous updates and deletions can adversely impact performance. You can mitigate this by consolidating your index.
index = index.consolidate_updates()
Inspect the index folders again, and note the one update commit and one update fragment. Also, TileDB-Vector-Search created a new fragment and commit in each of the following directories: