Vector Databases
A vector database, like any other database managing other data types, is responsible for:
- Storing the vector data in some format, on some storage medium.
- Managing updates, potentially offering versioning and time traveling.
- Enforcing authentication, access control, and logging.
- Performing efficient vector search using various algorithms, distances, and indexes.
Vector databases differ in the way they implement any subset of the above four bullet points, as well as, of course, performance (typically measured in terms of latency and queries per second, or QPS). Here is where things start getting confusing. Today, you hear a lot about special-purpose vector databases, which have been designed specifically to handle vectors and perform vector search. This category includes Pinecone, Milvus, Weaviate, Qdrant, and Chroma. However, libraries (not databases specifically) exist that offer similar functionality, such as FAISS. These libraries include a mix of open-source and closed-source options.
And here is where it gets worse: pretty much every single database system out there (be it tabular, key-value, time series, you name it) started offering “vector search capabilities” as part of their database product, due to the recent hype around vector databases. This is because they can store vectors as blobs, build a couple of indexes on top, and include operators to perform vector search. Compared to the colossal software they have already developed in their product (building databases is extremely difficult), the vector search capabilities are a straightforward development task (since no IP exists around the vector search algorithms).
So, how do you choose a vector database from the myriad of emerging options? For example, you can probably go a very long way just by using the open-source FAISS library, but you won’t have enterprise-grade security features. On the other hand, you can use Pinecone or Milvus, but in order to scale to billions of vectors, you will probably need numerous machines up and running constantly, seeing your operational cost skyrocketing. In other words, what you need to understand is the deployment setting behind each solution, which determines both performance and cost.
Here are the main deployment settings:
- Single-server, in-memory: All the vectors and indexes reside in the RAM of a single machine. This means that you need to have a big enough memory to hold all the vector data. This probably yields the best performance, but targets relatively small datasets. An example of this is FAISS.
- Single-server, out-of-core: All vectors reside on the local storage of a single machine, whereas the indexes (or portion of them) can reside in RAM. The query is performed by progressively bringing more data from disk to RAM for processing. This targets larger datasets, but you are still constrained by the capacity of a single machine and your performance is impacted by the disk-to-memory IO. An example of this is DiskANN.
- Multi-server, in-memory: To address the RAM limits of a single machine, in this category you have a database that runs distributed over multiple machines, across which memory collectively stores the entire vector dataset and indexes. Performance is pretty high since all data is always processed directly in main memory. However, this can get extremely expensive for larger datasets, as you may need to scale to numerous machines. An example of this is Pinecone.
- Serverless, cloud store: All the data is stored on a cloud object store (e.g., Amazon S3). The database has infrastructure to process any query in a way that seems “serverless” to the user (i.e., the user does not need to set up or even specify any machines to run the query). Performance depends on how well the underlying storage engine is implemented for cloud object stores and in general the whole serverless infrastructure, but there are lower bounds on latency because now an entire cloud infrastructure lies in between the user and the data. An example of this is Activeloop.
TileDB natively supports vector search, removing the need for a special-purpose vector search library or database. It offers a variety of algorithms, distance metrics, and deployment methods.