You can version vector data in TileDB-Vector-Search as you perform updates and deletes.
How to run this tutorial
We recommend running this tutorial, as well as the other various tutorials in the Tutorials section, inside TileDB Cloud. This will allow you to quickly experiment avoiding all the installation, deployment, and configuration hassles. Sign up for the free tier, spin up a TileDB Cloud notebook with a Python kernel, and follow the tutorial instructions. If you wish to learn how to run tutorials locally on your machine, read the Tutorials: Running Locally tutorial.
This tutorial shows how TileDB-Vector-Search implements versioning upon performing updates and deletions, which can later be used to time travel. We recommend reading the following sections before proceeding with this tutorial:
First, import the appropriate libraries, set the index URI, and delete any previously created data.
# Import necessary librariesimport osimport shutilimport numpy as npimport tiledb.vector_search as vs# Set the index URI for this tutorialindex_uri = os.path.expanduser("~/versioning")# Clean up previous dataif os.path.exists(index_uri): shutil.rmtree(index_uri)
Next, create an empty index:
# Create an index, where the dimensionality of each vector is 3,# the type of the vector values is float32, and the index will# use 3 partitions.index = vs.ivf_flat_index.create( uri=index_uri, dimensions=3, partitions=3, vector_type=np.dtype(np.float32))
Populate index
Perform a series of updates, starting with adding some new vectors in bulk. Note that here you will provide timestamp = 1 as a parameter to the update_batch command. In other tutorials, this parameter was omitted, and a default timestamp was set to the current time (in milliseconds since epoch).
# Prepare some vectors to addupdate_vectors = np.empty([5], dtype=object)update_vectors[0] = np.array([0, 0, 0], dtype=np.dtype(np.float32))update_vectors[1] = np.array([1, 1, 1], dtype=np.dtype(np.float32))update_vectors[2] = np.array([2, 2, 2], dtype=np.dtype(np.float32))update_vectors[3] = np.array([3, 3, 3], dtype=np.dtype(np.float32))update_vectors[4] = np.array([4, 4, 4], dtype=np.dtype(np.float32))# Add the vectors to the index, specifying a timestamp (1)index.update_batch( vectors=update_vectors, external_ids=np.array([0, 1, 2, 3, 4]), timestamp=1)
Next, update the values of some existing vectors one by one at timestamp = 2 and timestamp = 3, respectively. Then perform a bulk update at timestamp = 4.
# Delete the vectors with external ids 1 and 2, but at a later timestampindex.delete_batch(external_ids=[1, 2], timestamp=5)
Inspect versions
Output the directory structure of the index and observe the contents of array updates. Inside its __commits subfolder, you can see files that start with __1_1, __2_2, …, __5_5, which are the timestamps at which these updates happened. In addition, you see a similar structure to the subfolders inside the __fragments subfolder, which contain the actual update data. This is how TileDB physically stores the different versions, which can then facilitate time traveling, i.e., querying the index at different times in the past.