Vector Index Updates

ai/ml

vector search

foundation

updates

Learn about TileDB-Vector-Search support for updates.

TileDB-Vector-Search supports vector update (in the form of insertion of new vectors and modification of existing vectors) and delete operations for all indexing algorithms. In addition, it offers storage optimizations via consolidation, as well as versioning and time traveling.

We achieve this by managing updates outside of the primary indexing algorithm storage structure and periodically consolidating them with the primary index. This happens by merging the updated vectors and the primary index vectors and re-indexing, writing the new version of the primary index at a different timestamp in the underlying TileDB arrays.

This follows the log-structured system architecture that typically consists of:

Primary storage: This is a storage structure optimized for reads.
Update storage: This is a storage structure optimized for writes/updates. This is maintaining all updates not yet merged into the primary storage.
Query implementation: An integrated view of primary and update storage for read operations.
Consolidation: A process that merges updates into the primary storage.
Versioning and time traveling: This is the process of being able to query the vector index in the past (i.e., before certain updates and deletions happened).

Primary storage

In this case, this corresponds to the vector indexing algorithms implemented by TileDB-Vector-Search to provide efficient vector similarity search queries.

Update storage

We use a TileDB sparse array (updates), indexed by the vector external_ids for efficient write operations of deletes and updates. All updates and deletes are recorded in this array and periodically merged to primary storage after consolidation.

Query implementation

We execute queries by simultaneously querying the primary index and update storage and merging the results. Queries over the primary index use one of the corresponding vector search algorithms, while a FLAT query is applied to the vectors read from the update storage. The results are merged by removing deleted or updated vectors from the primary results and maintaining the top results of the union of the 2 queries.

Consolidation

We implement an asynchronous process to periodically merge the update storage data into the primary indexes.

This happens by merging the updated vectors and the primary index vectors and re-indexing, writing the new version of the primary index at a different timestamp in the underlying TileDB arrays. Note here that the previous version of the primary index is not deleted and can be later used for performing time-traveling operations.

As consolidation is a full re-indexing of the union of vectors it can perform all the regular ingestion processing of the specific algorithm. For example, consolidation for an IVF_FLAT index retrains k-means and reshuffles all the vectors based on the new centroids. We also have some optimizations to avoid index retraining in consecutive consolidation executions, however the user can pass any ingestion argument to the consolidation method to fine-tune its execution.

Consolidation can be triggered either manually or automatically, based on heuristics to avoid query degradation as more updates are accumulated.

Versioning and time traveling

TileDB-Vector-Search allows you to query the index at past timestamps. Every update operation effectively marks a new timestamped version. Every consolidation process creates a new optimized view of the index, while retaining the past versions of the index before the separate vector in updates was merged. TileDB-Vector-Search also provides you with a way to delete past versions, in order to avoid unnecessary inflation of the storage space.