Consolidate Vector Index Data

ai/ml

vector search

tutorials

consolidation

TileDB-Vector-Search allows you to consolidate vector indexes to boost performance after numerous updates and deletions.

How to run this tutorial

We recommend running this tutorial, as well as the other various tutorials in the Tutorials section, inside TileDB Cloud. This will allow you to quickly experiment avoiding all the installation, deployment, and configuration hassles. Sign up for the free tier, spin up a TileDB Cloud notebook with a Python kernel, and follow the tutorial instructions. If you wish to learn how to run tutorials locally on your machine, read the Tutorials: Running Locally tutorial.

This tutorial shows how you can consolidate your vector indexes with TileDB-Vector-Search, which can be used to boost the query performance, especially when you have performed numerous additions, updates and deletions to your index. We recommend reading the following sections before proceeding with this tutorial:

Setup

First, import the appropriate libraries, set the index URI, and delete any previously created data.

# Import necessary libraries
import os
import shutil

import numpy as np
import tiledb.vector_search as vs

# Set the index URI for this tutorial
index_uri = os.path.expanduser("~/consolidation")

# Clean up previous data
if os.path.exists(index_uri):
    shutil.rmtree(index_uri)

Next, create an empty index:

# Create an index, where the dimensionality of each vector is 3,
# the type of the vector values is float32, and the index will
# use 3 partitions.
index = vs.ivf_flat_index.create(
    uri=index_uri, dimensions=3, partitions=3, vector_type=np.dtype(np.float32)
)

Populate index

Perform a series of updates, starting with adding some new vectors in bulk. Note that here you will provide timestamp = 1 as a parameter to the update_batch command. In other tutorials, this parameter was omitted, and a default timestamp was set to the current time (in milliseconds since the Unix epoch).

# Prepare some vectors to add
update_vectors = np.empty([5], dtype=object)
update_vectors[0] = np.array([0, 0, 0], dtype=np.dtype(np.float32))
update_vectors[1] = np.array([1, 1, 1], dtype=np.dtype(np.float32))
update_vectors[2] = np.array([2, 2, 2], dtype=np.dtype(np.float32))
update_vectors[3] = np.array([3, 3, 3], dtype=np.dtype(np.float32))
update_vectors[4] = np.array([4, 4, 4], dtype=np.dtype(np.float32))

# Add the vectors to the index, specifying a timestamp (1)
index.update_batch(
    vectors=update_vectors, external_ids=np.array([0, 1, 2, 3, 4]), timestamp=1
)

Next, update the values of some existing vectors one by one at timestamp = 2 and timestamp = 3, respectively. Then perform a bulk update at timestamp = 4.

# Update vectors individually
index.update(
    vector=np.array([10, 10, 10], dtype=np.dtype(np.float32)),
    external_id=1,
    timestamp=2,
)
index.update(
    vector=np.array([11, 11, 11], dtype=np.dtype(np.float32)),
    external_id=2,
    timestamp=3,
)

# Update vectors in bulk
update_vectors = np.empty([2], dtype=object)
update_vectors[0] = np.array([1, 1, 1], dtype=np.dtype(np.float32))
update_vectors[1] = np.array([2, 2, 2], dtype=np.dtype(np.float32))
index.update_batch(vectors=update_vectors, external_ids=np.array([1, 2]), timestamp=4)

Now, delete some vectors at timestamp = 5:

# Delete the vectors with external ids 1 and 2, but at a later timestamp
index.delete_batch(external_ids=[1, 2], timestamp=5)

Output the directory structure of the index and observe the contents of array updates. Inside its __commits subfolder, you can see files that start with __1_1, __2_2, …, __5_5, which are the timestamps at which these updates happened. In addition, you see a similar structure to the subfolders inside the __fragments subfolder, which contain the actual update data. This is how TileDB physically stores the different versions, which can then facilitate time traveling (that is, querying the index at different times in the past).

!tree {index_uri}

/home/jovyan/consolidation
├── __group
│   └── __1724684741769_1724684741769_0b75ae1433f49404d83fe7620ad59714_2
├── __meta
│   ├── __1724684741747_1724684741747_05fbca50d8e654bf5400bd698d65e6c3
│   ├── __1724684741752_1724684741752_0c355ea1c6f572fbf129bdd4e933537f
│   └── __1724684746769_1724684746769_0d4a2f0131d697c41c8a8feba23a657a
├── partition_centroids
│   ├── __commits
│   ├── __fragment_meta
│   ├── __fragments
│   ├── __labels
│   ├── __meta
│   └── __schema
│       ├── __1724684741752_1724684741752_0c355ea3da73586d9ceb76469aec5f1f
│       └── __enumerations
├── partition_indexes
│   ├── __commits
│   ├── __fragment_meta
│   ├── __fragments
│   ├── __labels
│   ├── __meta
│   └── __schema
│       ├── __1724684741755_1724684741755_32c6987dac3ab364f04c05dcb200cce7
│       └── __enumerations
├── shuffled_vector_ids
│   ├── __commits
│   ├── __fragment_meta
│   ├── __fragments
│   ├── __labels
│   ├── __meta
│   └── __schema
│       ├── __1724684741758_1724684741758_4217b1dbcde9d7ff7c7bd765926f29a6
│       └── __enumerations
├── shuffled_vectors
│   ├── __commits
│   ├── __fragment_meta
│   ├── __fragments
│   ├── __labels
│   ├── __meta
│   └── __schema
│       ├── __1724684741761_1724684741761_2082dfe611bd6a44f68a763b4b0dd0fa
│       └── __enumerations
├── __tiledb_group.tdb
└── updates
    ├── __commits
    │   ├── __1_1_2d59eb45c5b57de7f99fb40bb3d3f0c4_22.wrt
    │   ├── __2_2_22ac0f3f27f863fbbff868e20a496bb9_22.wrt
    │   ├── __3_3_4757e5101188b61bc866d47e57d24b8d_22.wrt
    │   ├── __4_4_5d950c0ee676961ca84eec9332c1065e_22.wrt
    │   └── __5_5_0e50aae21f903e9f7dc0330b1616176a_22.wrt
    ├── __fragment_meta
    ├── __fragments
    │   ├── __1_1_2d59eb45c5b57de7f99fb40bb3d3f0c4_22
    │   │   ├── a0.tdb
    │   │   ├── a0_var.tdb
    │   │   ├── d0.tdb
    │   │   └── __fragment_metadata.tdb
    │   ├── __2_2_22ac0f3f27f863fbbff868e20a496bb9_22
    │   │   ├── a0.tdb
    │   │   ├── a0_var.tdb
    │   │   ├── d0.tdb
    │   │   └── __fragment_metadata.tdb
    │   ├── __3_3_4757e5101188b61bc866d47e57d24b8d_22
    │   │   ├── a0.tdb
    │   │   ├── a0_var.tdb
    │   │   ├── d0.tdb
    │   │   └── __fragment_metadata.tdb
    │   ├── __4_4_5d950c0ee676961ca84eec9332c1065e_22
    │   │   ├── a0.tdb
    │   │   ├── a0_var.tdb
    │   │   ├── d0.tdb
    │   │   └── __fragment_metadata.tdb
    │   └── __5_5_0e50aae21f903e9f7dc0330b1616176a_22
    │       ├── a0.tdb
    │       ├── a0_var.tdb
    │       ├── d0.tdb
    │       └── __fragment_metadata.tdb
    ├── __labels
    ├── __meta
    └── __schema
        ├── __1724684741764_1724684741764_4b4bf4e580ec66d91157598e041cf4ed
        └── __enumerations

48 directories, 35 files

Consolidate updates

Despite the flexibility to query your index in the past, numerous updates and deletions can adversely impact performance. You can mitigate this by consolidating your index.

index = index.consolidate_updates()

Inspect the index folders again, and note the one update commit and one update fragment. Also, TileDB-Vector-Search created a new fragment and commit in each of the following directories:

partition_centroids
partition_indexes
shuffled_vector_ids
shuffled_vectors

!tree {index_uri}

/home/jovyan/consolidation
├── __group
│   ├── __1724684741769_1724684741769_0b75ae1433f49404d83fe7620ad59714_2
│   ├── __1724684757924_1724684757924_6125cf3051f6546f243a774127725359_2
│   └── __1724684759913_1724684759913_4105295afb69ff8706c0e885a06bfd6d_2
├── __meta
│   ├── __1724684741747_1724684741747_05fbca50d8e654bf5400bd698d65e6c3
│   ├── __1724684741752_1724684741752_0c355ea1c6f572fbf129bdd4e933537f
│   ├── __1724684746769_1724684746769_0d4a2f0131d697c41c8a8feba23a657a
│   ├── __1724684757910_1724684757910_6297a0b2bba9ef2e119326ba94b0ccfb
│   ├── __1724684759768_1724684759768_23c6b4cae69a618daa28b5f6fb03c6bd
│   └── __1724684759895_1724684759895_60f79392119f4526d4b3d04f9ea952b0
├── partition_centroids
│   ├── __commits
│   │   └── __6_6_232e097b0108dc42c77464a4d7d653a6_22.wrt
│   ├── __fragment_meta
│   ├── __fragments
│   │   └── __6_6_232e097b0108dc42c77464a4d7d653a6_22
│   │       ├── a0.tdb
│   │       └── __fragment_metadata.tdb
│   ├── __labels
│   ├── __meta
│   └── __schema
│       ├── __1724684741752_1724684741752_0c355ea3da73586d9ceb76469aec5f1f
│       └── __enumerations
├── partition_indexes
│   ├── __commits
│   │   └── __6_6_1472190725eae3acf8d4b7e69070c902_22.wrt
│   ├── __fragment_meta
│   ├── __fragments
│   │   └── __6_6_1472190725eae3acf8d4b7e69070c902_22
│   │       ├── a0.tdb
│   │       └── __fragment_metadata.tdb
│   ├── __labels
│   ├── __meta
│   └── __schema
│       ├── __1724684741755_1724684741755_32c6987dac3ab364f04c05dcb200cce7
│       └── __enumerations
├── shuffled_vector_ids
│   ├── __commits
│   │   └── __6_6_028163e777dae261c1b79c88a7b32e6a_22.con
│   ├── __fragment_meta
│   ├── __fragments
│   │   └── __6_6_067573fd62327b2ba541969d05babb7f_22
│   │       ├── a0.tdb
│   │       └── __fragment_metadata.tdb
│   ├── __labels
│   ├── __meta
│   └── __schema
│       ├── __1724684741758_1724684741758_4217b1dbcde9d7ff7c7bd765926f29a6
│       └── __enumerations
├── shuffled_vectors
│   ├── __commits
│   │   └── __6_6_7bcdbcdb977113fdb2f59467271d531f_22.con
│   ├── __fragment_meta
│   ├── __fragments
│   │   └── __6_6_2ec7da80599ce21bbdeb85fa47fdaeb6_22
│   │       ├── a0.tdb
│   │       └── __fragment_metadata.tdb
│   ├── __labels
│   ├── __meta
│   └── __schema
│       ├── __1724684741761_1724684741761_2082dfe611bd6a44f68a763b4b0dd0fa
│       └── __enumerations
├── temp_data_GDPVhNXhEl
│   ├── partition_indexes
│   ├── shuffled_vector_ids
│   └── shuffled_vectors
├── __tiledb_group.tdb
└── updates
    ├── __commits
    │   └── __1_5_55ef749d4493038c97179ebbd9701253_22.wrt
    ├── __fragment_meta
    ├── __fragments
    │   └── __1_5_55ef749d4493038c97179ebbd9701253_22
    │       ├── a0.tdb
    │       ├── a0_var.tdb
    │       ├── d0.tdb
    │       ├── __fragment_metadata.tdb
    │       └── t.tdb
    ├── __labels
    ├── __meta
    └── __schema
        ├── __1724684741764_1724684741764_4b4bf4e580ec66d91157598e041cf4ed
        └── __enumerations

52 directories, 33 files

Clean up

Clean up in the end by removing the index:

# Clean up
if os.path.exists(index_uri):
    shutil.rmtree(index_uri)