Primary storage
In this case, this corresponds to the vector indexing algorithms implemented by TileDB-Vector-Search to provide efficient vector similarity search queries.
TileDB-Vector-Search supports vector update (in the form of insertion of new vectors and modification of existing vectors) and delete operations for all indexing algorithms. In addition, it offers storage optimizations via consolidation, as well as versioning and time traveling.
We achieve this by managing updates outside of the primary indexing algorithm storage structure and periodically consolidating them with the primary index. This happens by merging the updated vectors and the primary index vectors and re-indexing, writing the new version of the primary index at a different timestamp in the underlying TileDB arrays.
This follows the log-structured system architecture that typically consists of:
In this case, this corresponds to the vector indexing algorithms implemented by TileDB-Vector-Search to provide efficient vector similarity search queries.
We use a TileDB sparse array (updates
), indexed by the vector external_ids
for efficient write operations of deletes and updates. All updates and deletes are recorded in this array and periodically merged to primary storage after consolidation.
We execute queries by simultaneously querying the primary index and update storage and merging the results. Queries over the primary index use one of the corresponding vector search algorithms, while a FLAT
query is applied to the vectors read from the update storage. The results are merged by removing deleted or updated vectors from the primary results and maintaining the top results of the union of the 2 queries.
We implement an asynchronous process to periodically merge the update storage data into the primary indexes.
This happens by merging the updated vectors and the primary index vectors and re-indexing, writing the new version of the primary index at a different timestamp in the underlying TileDB arrays. Note here that the previous version of the primary index is not deleted and can be later used for performing time-traveling operations.
As consolidation is a full re-indexing of the union of vectors it can perform all the regular ingestion
processing of the specific algorithm. For example, consolidation for an IVF_FLAT
index retrains k-means and reshuffles all the vectors based on the new centroids. We also have some optimizations to avoid index retraining in consecutive consolidation executions, however the user can pass any ingestion
argument to the consolidation method to fine-tune its execution.
Consolidation can be triggered either manually or automatically, based on heuristics to avoid query degradation as more updates are accumulated.
TileDB-Vector-Search allows you to query the index at past timestamps. Every update operation effectively marks a new timestamped version. Every consolidation process creates a new optimized view of the index, while retaining the past versions of the index before the separate vector in updates
was merged. TileDB-Vector-Search also provides you with a way to delete past versions, in order to avoid unnecessary inflation of the storage space.