Introduction to Vector Search

ai/ml

vector search

Learn why TileDB is an ideal solution for vector search.

TileDB is an ideal solution for vector search

Vector search is a useful feature for finding similar or relevant assets in a collection (such as PDFs, images, video, audio, and a large spectrum of domain-specific data). TileDB is architected around multi-dimensional arrays and, hence, it is ideal for offering native vector search support, since a vector is simply a 1D array.

Why TileDB for Vector Search

Vector search (also often called “similarity search” or “nearest neighbor search”) has been around for a very long time. However, due to the advent of Generative AI and the proliferation of large language models (LLMs), vector search became extremely relevant, as it can serve as a very useful piece in the Generative AI infrastructure. As a result, scores of new specialized databases, called “vector databases”, were created, whereas most of the existing databases added vector search functionality.

TileDB is an array database, and its main strength is that it can morph into practically any data modality and application, delivering unprecedented performance and alleviating the data infrastructure pains in an organization. A vector is just a 1D array and, therefore, TileDB is the most natural database choice for delivering amazing vector search functionality.

We spent many years building a powerful array-based engine, which allowed us to pretty quickly enhance our database with vector search capabilities in a new open-source library called TileDB-Vector-Search. Coupled with TileDB Cloud, TileDB offers the following benefits:

Performance: Vectors are arrays, and arrays are a native data structure in TileDB. As such, TileDB delivers spectacular performance, even under extreme scenarios.
Serverless: All deployment modes supported by TileDB are serverless, even when you need to outgrow your local machine and scale in the cloud. This has a direct impact on the operational cost, which TileDB minimizes.
Cloud-native: TileDB is optimized for object stores. As such, you can scale your vectors to cloud storage, while enjoying superb performance. Again, this leads to significant cost savings.
Multiple modalities: TileDB is not just a vector database. TileDB envisions to store, manage, and analyze all your data, which includes the raw original data from which you generate your vectors, as well as any other data for which your organization might require a powerful database. Storing multiple data modalities in a single system has the following benefits:
- It lowers your licensing costs.
- Having one system manage all data modalities simplifies your infrastructure and reduces data engineering.
- Data isn’t siloed. Instead, TileDB stores your data by taking a more sane, holistic governance approach over all your data and code assets.

TileDB vs. other solutions

Many vector search solutions are available, from special-purpose databases (that is, databases built from scratch just for vector search) to libraries and extensions of existing databases. Here is a brief qualitative and quantitative comparison versus other solutions you might know:

FAISS is an open-source library developed by Facebook AI Research that provides diverse indexing and querying algorithms for vector search. Its focus is performing efficient similarity search using main-memory indices and leveraging the processing capabilities of single-server CPU and GPU resources. Though FAISS offers excellent performance and algorithmic support, it lacks database features such as security, access control, auditing and storing raw objects and metadata. In the “single-server, main memory” setting, TileDB outperforms the respective algorithm implementations of FAISS by up to 8x for 1 billion vectors.
Milvus is an open-source vector database developed by Zilliz. It focuses on providing scalable and efficient vector search functionalities in a database setting while also storing extra objects and metadata. With its new version Milvus 2.0, it targets a cloud-native distributed architecture. While offering DB features for vector search, Milvus only targets high QPS and low latency use cases using a distributed in-memory infrastructure that requires large maintenance and static infrastructure costs. Also, its support for storing extra data and raw objects is not appropriately designed to handle the high dimensional objects often required for vector search (images, video, audio, etc.). In the “single-server, main memory” setting TileDB outperforms the respective algorithm implementations of Milvus by up to 3x for 1 million vectors.
Pinecone is a cloud-based vector database service that is proprietary and optimized for real-time applications and machine learning workloads. It excels in delivering low-latency and high-throughput vector search capabilities. Similar to Milvus, it offers a distributed in-memory infrastructure hosted in the cloud, with added database features for vector search. However, Pinecone may not provide optimal performance for storing raw multi-dimensional objects, and it can incur significant static costs for infrastructure setup, even in scenarios where a high QPS setup is not necessary. Pinecone does not have an open-source offering and does not participate in the ann-benchmarks.
Chroma is an open source vector database built on NumPy (for distance metrics Euclidean, cosine, and inner-product) and hnswlib, with several database backends for storage. It is focused on text search for LLM use-cases, and provides integration with several Python libraries. At present, Chroma provides an in-process Python API. The Chroma company is developing a distributed implementation. Chroma does not yet participate in ann-benchmarks, and we have not benchmarked independently.
Activeloop is a company that specializes in storing multi-modal data like audio, video, image, and point cloud data, and making them accessible to machine learning processing pipelines. They have developed a product called DeepLake, which is a data lake specifically designed for deep learning. Activeloop also offers a vector search product that is designed to work in a serverless manner using cloud store and disk data. While they embrace “tensor” formats for storing raw objects, their data format is less generic than the multi-dimensional array format of TileDB. They also embrace a serverless compute, cloud-native storage model for vector search, but, based on our understanding, their implementation is quite primitive (without any sophisticated indexing and query algorithm implementation and without any knobs to tweak accuracy for cost and performance). In the “single-server, out-of-core” setting, TileDB outperforms the respective algorithm implementations of Activeloop by more than 10x.
pgvector is an open-source vector search extension for PostgreSQL, supporting IVF_FLAT indexing with several distance metrics (Euclidean, cosine, inner product). All pgvector data transfer happens through SQL, which is likely to be a performance limitation in large-scale use. TileDB provides approximately 100x higher QPS at 95% recall vs pgvector on the ann-benchmarks.
DiskANN is an open-source software library developed by Microsoft to support search problems that are too large to fit into the memory of a single machine. It is still a single-machine system, but it stores indices and partitioned full-precision vectors on a solid-state drive (SSD). PQ-compressed vectors are stored in memory. The primary approach for indexing is the graph-based Vamana algorithm, which improves over other graph-based algorithms such as HNSW and NSG. Adaptations of indexing and search strategies can be used with the DiskANN framework for different kinds of applications. As with other library approaches, database concerns are outside the scope of DiskANN.
Qdrant is an open source system implemented in Rust and backed by the startup Qdrant Solutions GmbH. It features a client-server architecture with REST and gRPC endpoints, and is available for both self-hosted usage and as a managed service. Qdrant only supports the HNSW index algorithm, but adds additional filtering with the capability to attach (indexed) payloads to the graph. Qdrant supports JSON “payload” data attached to vectors, but does not provide support for storage or slicing of array or multi-modal data. TileDB-Vector-Search provides approximately 5x QPS in batch mode vs Qdrant at 95% recall.
Weaviate is an open source system implemented in Go and backed by the startup Weaviate B.V. It features a client-server architecture with REST and GraphQL endpoints, available for self-hosted usage or as a managed service. Weaviate implements indexing with HNSW (optionally with product quantization to reduce memory usage). It includes integrations to generate vectors based on several popular models. Weaviate does not support storage or slicing of other data types. TileDB-Vector-Search provides approximately 10x QPS in batch mode vs Weaviate at 95% recall.

Section organization

The rest of the Vector Search section is organized as follows:

Quickstart: This is the best way to get started with TileDB’s vector search functionality. You will learn how to run basic examples.
Foundation: This contains all the background information and internal mechanics of vector search in TileDB. Learning these will provide a very deep understanding of the TileDB vector search technology and power, and help maximize the value you get from TileDB.
Tutorials: This is a series of tutorials covering all aspects of TileDB vector search, from basic ingestion to massively scalable computations. Running those tutorials can help you start without any prior knowledge of TileDB vector search and become power users.
API Reference: This lists all the TileDB-Vector-Search functionality across the various programming languages it supports, and enables fast lookups on API usage.