Version Vector Search Data

ai/ml

vector search

tutorials

versioning

You can version vector data in TileDB-Vector-Search as you perform updates and deletes.

How to run this tutorial

We recommend running this tutorial, as well as the other various tutorials in the Tutorials section, inside TileDB Cloud. This will allow you to quickly experiment avoiding all the installation, deployment, and configuration hassles. Sign up for the free tier, spin up a TileDB Cloud notebook with a Python kernel, and follow the tutorial instructions. If you wish to learn how to run tutorials locally on your machine, read the Tutorials: Running Locally tutorial.

This tutorial shows how TileDB-Vector-Search implements versioning upon performing updates and deletions, which can later be used to time travel. We recommend reading the following sections before proceeding with this tutorial:

Setup

First, import the appropriate libraries, set the index URI, and delete any previously created data.

# Import necessary libraries
import os
import shutil

import numpy as np
import tiledb.vector_search as vs

# Set the index URI for this tutorial
index_uri = os.path.expanduser("~/versioning")

# Clean up previous data
if os.path.exists(index_uri):
    shutil.rmtree(index_uri)

Next, create an empty index:

# Create an index, where the dimensionality of each vector is 3,
# the type of the vector values is float32, and the index will
# use 3 partitions.
index = vs.ivf_flat_index.create(
    uri=index_uri, dimensions=3, partitions=3, vector_type=np.dtype(np.float32)
)

Populate index

Perform a series of updates, starting with adding some new vectors in bulk. Note that here you will provide timestamp = 1 as a parameter to the update_batch command. In other tutorials, this parameter was omitted, and a default timestamp was set to the current time (in milliseconds since epoch).

# Prepare some vectors to add
update_vectors = np.empty([5], dtype=object)
update_vectors[0] = np.array([0, 0, 0], dtype=np.dtype(np.float32))
update_vectors[1] = np.array([1, 1, 1], dtype=np.dtype(np.float32))
update_vectors[2] = np.array([2, 2, 2], dtype=np.dtype(np.float32))
update_vectors[3] = np.array([3, 3, 3], dtype=np.dtype(np.float32))
update_vectors[4] = np.array([4, 4, 4], dtype=np.dtype(np.float32))

# Add the vectors to the index, specifying a timestamp (1)
index.update_batch(
    vectors=update_vectors, external_ids=np.array([0, 1, 2, 3, 4]), timestamp=1
)

Next, update the values of some existing vectors one by one at timestamp = 2 and timestamp = 3, respectively. Then perform a bulk update at timestamp = 4.

# Update vectors individually
index.update(
    vector=np.array([10, 10, 10], dtype=np.dtype(np.float32)),
    external_id=1,
    timestamp=2,
)
index.update(
    vector=np.array([11, 11, 11], dtype=np.dtype(np.float32)),
    external_id=2,
    timestamp=3,
)

# Update vectors in bulk
update_vectors = np.empty([2], dtype=object)
update_vectors[0] = np.array([1, 1, 1], dtype=np.dtype(np.float32))
update_vectors[1] = np.array([2, 2, 2], dtype=np.dtype(np.float32))
index.update_batch(vectors=update_vectors, external_ids=np.array([1, 2]), timestamp=4)

Now, delete some vectors at timestamp = 5:

# Delete the vectors with external ids 1 and 2, but at a later timestamp
index.delete_batch(external_ids=[1, 2], timestamp=5)

Inspect versions

Output the directory structure of the index and observe the contents of array updates. Inside its __commits subfolder, you can see files that start with __1_1, __2_2, …, __5_5, which are the timestamps at which these updates happened. In addition, you see a similar structure to the subfolders inside the __fragments subfolder, which contain the actual update data. This is how TileDB physically stores the different versions, which can then facilitate time traveling, that is, querying the index at different times in the past.

!tree {index_uri}

/Users/stavrospapadopoulos/versioning
├── __group
│   └── __1724608507499_1724608507499_3345227fc6da26d2e9386532f43cb54b_2
├── __meta
│   ├── __1724608507479_1724608507479_6dd3e2433d2485f8b02cdc5e28e91736
│   ├── __1724608507484_1724608507484_353b5f0e946ce752b892b8362b173b6c
│   └── __1724608560243_1724608560243_34c55b90113f77f04bebd17fcac0af55
├── __tiledb_group.tdb
├── partition_centroids
│   ├── __commits
│   ├── __fragment_meta
│   ├── __fragments
│   ├── __labels
│   ├── __meta
│   └── __schema
│       ├── __1724608507489_1724608507489_000000015447c9d37dfb10f81b6ce243
│       └── __enumerations
├── partition_indexes
│   ├── __commits
│   ├── __fragment_meta
│   ├── __fragments
│   ├── __labels
│   ├── __meta
│   └── __schema
│       ├── __1724608507496_1724608507496_34e4b3cf65a352f9d8e27086c53dbc4e
│       └── __enumerations
├── shuffled_vector_ids
│   ├── __commits
│   ├── __fragment_meta
│   ├── __fragments
│   ├── __labels
│   ├── __meta
│   └── __schema
│       ├── __1724608507497_1724608507497_620b0bf4fa18de40bad37745455b3300
│       └── __enumerations
├── shuffled_vectors
│   ├── __commits
│   ├── __fragment_meta
│   ├── __fragments
│   ├── __labels
│   ├── __meta
│   └── __schema
│       ├── __1724608507497_1724608507497_620b0bf58c04a0e996459c62ceac1ab0
│       └── __enumerations
└── updates
    ├── __commits
    │   ├── __1_1_1d886fd15f254f28a094a84c5fe65ad5_22.wrt
    │   ├── __2_2_39958cd58eb54c0e40e0e55b35cb9309_22.wrt
    │   ├── __3_3_26ab69c27ee4ce242e5ede33c134d090_22.wrt
    │   ├── __4_4_032663bf324763b5b792c4b7d9a1edf3_22.wrt
    │   └── __5_5_70fc83cbd0c7437ac2d09d6de1647a24_22.wrt
    ├── __fragment_meta
    ├── __fragments
    │   ├── __1_1_1d886fd15f254f28a094a84c5fe65ad5_22
    │   │   ├── __fragment_metadata.tdb
    │   │   ├── a0.tdb
    │   │   ├── a0_var.tdb
    │   │   └── d0.tdb
    │   ├── __2_2_39958cd58eb54c0e40e0e55b35cb9309_22
    │   │   ├── __fragment_metadata.tdb
    │   │   ├── a0.tdb
    │   │   ├── a0_var.tdb
    │   │   └── d0.tdb
    │   ├── __3_3_26ab69c27ee4ce242e5ede33c134d090_22
    │   │   ├── __fragment_metadata.tdb
    │   │   ├── a0.tdb
    │   │   ├── a0_var.tdb
    │   │   └── d0.tdb
    │   ├── __4_4_032663bf324763b5b792c4b7d9a1edf3_22
    │   │   ├── __fragment_metadata.tdb
    │   │   ├── a0.tdb
    │   │   ├── a0_var.tdb
    │   │   └── d0.tdb
    │   └── __5_5_70fc83cbd0c7437ac2d09d6de1647a24_22
    │       ├── __fragment_metadata.tdb
    │       ├── a0.tdb
    │       ├── a0_var.tdb
    │       └── d0.tdb
    ├── __labels
    ├── __meta
    └── __schema
        ├── __1724608507498_1724608507498_53b879f0f93cbd926f9f3f872d31e121
        └── __enumerations

48 directories, 35 files

Clean up

Clean up in the end by removing the index:

# Clean up
if os.path.exists(index_uri):
    shutil.rmtree(index_uri)

What’s next?

Now that you understand the basics of TileDB versioning, you should read the following tutorials: