Vector Search Quickstart

ai/ml

vector search

quickstart

tutorials

python

Learn how to install TileDB-Vector-Search, build an index, and perform basic searches with this guided tutorial.

How to run this tutorial

We recommend running this tutorial, as well as the other various tutorials in the Tutorials section, inside TileDB Cloud. This will allow you to quickly experiment avoiding all the installation, deployment, and configuration hassles. Sign up for the free tier, spin up a TileDB Cloud notebook with a Python kernel, and follow the tutorial instructions. If you wish to learn how to run tutorials locally on your machine, read the Tutorials: Running Locally tutorial.

This is a quickstart tutorial on the basic vector search capabilities of TileDB.

Note

If running this tutorial locally, you’ll need to install TileDB-Vector-Search.

Setup

You will ingest the small (10k) SIFT dataset from the Datasets for approximate nearest neighbor search site. You will download a mirrored copy of the dataset from the TileDB-Vector-Search repo on GitHub. The original source can be found here: ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz.

First, import the necessary libraries, set up the URIs you will use throughout the tutorial, and clean up any previously created directories and vector indexes.

# Import necessary libraries
import os
import shutil
import tarfile
import urllib.request

import numpy as np
import tiledb
import tiledb.vector_search as vs
from tiledb.vector_search.utils import load_fvecs, load_ivecs

# The URIs for the data to download and ingest
data_uri = "https://github.com/TileDB-Inc/TileDB-Vector-Search/releases/download/0.0.1/siftsmall.tgz"
data_filename = "siftsmall.tar.gz"
data_dir = os.path.expanduser("~/sift10k/")
local_data_path = os.path.join(data_dir, data_filename)
index_uri = os.path.expanduser("~/sift10k_flat")

# Clean up previous data
if os.path.exists(data_dir):
    shutil.rmtree(data_dir)
if os.path.exists(index_uri):
    shutil.rmtree(index_uri)

Next, download and untar the source dataset.

# Create a directory to store the source dataset
os.makedirs(os.path.dirname(data_dir))

# Download the file that contains the vector dataset
urllib.request.urlretrieve(data_uri, local_data_path)

# untar the file
tarfile.open(local_data_path, "r:gz").extractall(
    os.path.dirname(local_data_path), filter="fully_trusted"
)

Ingest

Once the tarball has been extracted, ingest the data into a FLAT index (the simplest index in TileDB-Vector-Search) as follows:

index = vs.ingest(
    index_type="FLAT",
    index_uri=index_uri,
    source_uri=os.path.join(data_dir, "siftsmall_base.fvecs"),
    source_type="FVEC",
)

Inspect

This ingestion creates a group with three arrays, which can be shown as shown below. In addition, you can see the array schema of the array that stores the vectors, as well as print the contents of the first vector.

# Show the physical group
group = tiledb.Group(index_uri, "r")
print("Index physical contents:\n")
print(group)

# Prepare the index for reading
index = vs.FlatIndex(index_uri)

# Open the vector array to inspect it
print("Vector array URI:", index.db_uri, "\n")
A = tiledb.open(index.db_uri)

# Print the schema of the vector array
print("Vector array schema:\n")
print(A.schema)

# Print the first vector
print("Contents of first vector:\n")
print(A[:, 0]["values"])

Index physical contents:

sift10k_flat GROUP
|-- shuffled_vector_ids ARRAY
|-- shuffled_vectors ARRAY
|-- updates ARRAY

Vector array URI: file:///home/jovyan/sift10k_flat/shuffled_vectors 

Vector array schema:

ArraySchema(
  domain=Domain(*[
    Dim(name='rows', domain=(0, 127), tile=128, dtype='int32', filters=FilterList([ZstdFilter(level=-1), ])),
    Dim(name='cols', domain=(0, 2147483647), tile=250000, dtype='int32', filters=FilterList([ZstdFilter(level=-1), ])),
  ]),
  attrs=[
    Attr(name='values', dtype='float32', var=False, nullable=False, enum_label=None, filters=FilterList([ZstdFilter(level=-1), ])),
  ],
  cell_order='col-major',
  tile_order='col-major',
  sparse=False,
)

Contents of first vector:

[  0.  16.  35.   5.  32.  31.  14.  10.  11.  78.  55.  10.  45.  83.
  11.   6.  14.  57. 102.  75.  20.   8.   3.   5.  67.  17.  19.  26.
   5.   0.   1.  22.  60.  26.   7.   1.  18.  22.  84.  53.  85. 119.
 119.   4.  24.  18.   7.   7.   1.  81. 106. 102.  72.  30.   6.   0.
   9.   1.   9. 119.  72.   1.   4.  33. 119.  29.   6.   1.   0.   1.
  14.  52. 119.  30.   3.   0.   0.  55.  92. 111.   2.   5.   4.   9.
  22.  89.  96.  14.   1.   0.   1.  82.  59.  16.  20.   5.  25.  14.
  11.   4.   0.   0.   1.  26.  47.  23.   4.   0.   0.   4.  38.  83.
  30.  14.   9.   4.   9.  17.  23.  41.   0.   0.   2.   8.  19.  25.
  23.   1.]

Similarity search

To run similarity search on the ingested vectors, load the queries and ground truth vectors from the siftsmall dataset you downloaded, noting you can use any vector to query this dataset. load_fvecs and load_ivecs are auxiliary functions in tiledb.vector_search.utils to fetch the query vectors and ground truth vectors. Then, run the query, and print the result vectors ids and corresponding distances to the queries as follows:

# Get query vectors with ground truth
query_vectors = load_fvecs(os.path.join(data_dir, "siftsmall_query.fvecs"))
ground_truth = load_ivecs(os.path.join(data_dir, "siftsmall_groundtruth.ivecs"))

# Select a query vector
query_id = 77
qv = np.array([query_vectors[query_id]])

# Return the 100 most similar vectors to the query vector with FLAT
result_d, result_i = index.query(qv, k=100)
print("Result vector ids:\n")
print(result_i)
print("\nResult vector distances:\n")
print(result_d)

Result vector ids:

[[8578 8275 5332 9153 2092 3290 2010 8004 2949 9190 6784 5756 1721 1075
  1411 6994 4120 7529 3994 6878 4844 3969 3313 7945  796 2135 2861 5957
   621  905 5014 9429 7287 4675 9172 2399 1923  902 7555 2986 1976  852
  2834 5247 4545 4231 3431 4614 8526 4045 2987 4602   23 7018 1555 4637
  2156 6058 7331 9181 1084 8985 8838   15  871 4150 9270 1061 9440 8290
   710 4615 4616 1018 7426 2728 4542 7278 6979 7731 4782 5133 6660 1499
  7569 5267  758 4867 8826 4085 2948 8382  745 8832 3499 2954 4046 3430
   650 8654]]

Result vector distances:

[[ 86269.  89329.  94088. 100973. 101066. 102972. 107926. 108293. 111591.
  112274. 112447. 113508. 113748. 114415. 114491. 114537. 115424. 115560.
  116875. 116975. 118403. 119806. 120557. 121629. 122677. 125305. 125616.
  125944. 126107. 126544. 126683. 126729. 126942. 128506. 129325. 129838.
  130042. 130087. 130679. 130800. 131003. 131052. 131130. 131456. 131467.
  131750. 132446. 132604. 132875. 133198. 133221. 133335. 134265. 134771.
  134950. 135413. 135516. 135832. 136118. 136148. 136659. 136714. 137088.
  137152. 137454. 137496. 137521. 137964. 138023. 138146. 138153. 138241.
  138431. 138725. 138745. 139058. 139211. 139432. 139443. 139488. 139812.
  139873. 139966. 140168. 140243. 140267. 140554. 140935. 141092. 141348.
  141843. 141988. 142590. 142847. 142921. 143102. 143338. 143578. 143755.
  143887.]]

You can check the result against the ground truth:

# For FLAT, the following will always be true
np.alltrue(result_i == ground_truth[query_id])

True

You can even run batches of searches, which are efficiently implemented in TileDB-Vector-Search:

# Simply provide more than one query vectors
result_d, result_i = index.query(np.array([query_vectors[5], query_vectors[6]]), k=100)
print("Result vector ids:\n")
print(result_i)
print("\nResult vector distances:\n")
print(result_d)

Result vector ids:

[[1097 1239 4943 3227  804 2607 4060 4443 4246 3112  535 4445 2312 2945
  1403 1707 1896 1626 7132 6767 9690 8666 3699 3460 1532 2329 8844 4370
  2115 3180  660 3652  771 1068 9608 2791 9072 4296 9048 3506  288 2753
  1445 7837 4222 4051  609 9176 2880 1205 4413 3241 3286 4044 3837 7913
  1241 3614 6039 7669 2378  169 5815 3141 1218 3365  703 1108 7322  121
  6585 9999 6358 1863 7214  693 2571 3065 4272 2129 8867  151  226 3649
   172 9083 7348  861 1254 2953 3062   96 1834 5995 7169 2616 2381  754
  9781 8289]
 [2456 3013 1682 8581 2774 3530  924 2732 9701 1916 3687 1036 4248 4094
  9885 2638 8000 1151 1174 2839 3609 2176 9651 3996 3943 8284 8642 3249
  3856 3954 8517 1468 2107 3854 1323 3623 8886 8773 3548 3543  694 2435
  3298  683 2038 1623 1038 2702 3136 3138 9629 9542  287 3713  421 1918
  2890 3248 3794  108 8848 3292 1322 9520 1150 1467 4475 7873 9523 2388
  8478 2648 1666 3645 3009 1295 2755 3877 4182 1454 9541 2626  403 1795
   641  495 6854  300 2065 1064 2357 3515 1479 1963 2695  611 7680 3615
  2741 9568]]

Result vector distances:

[[ 75708.  82658.  86330.  90315.  92913.  97025.  98149.  99507. 101123.
  101335. 102143. 102765. 105347. 105865. 105876. 106125. 107297. 108639.
  108952. 109650. 110999. 111563. 111846. 112377. 112639. 112784. 112972.
  113673. 113792. 114105. 114171. 114212. 114392. 114654. 114770. 115035.
  115367. 115861. 116037. 116315. 117064. 117306. 117612. 117996. 118100.
  118345. 118381. 118654. 118674. 118895. 119739. 119951. 120017. 120092.
  120118. 120183. 120388. 120540. 120812. 120846. 120880. 120898. 120957.
  121048. 121216. 121217. 121521. 121632. 121930. 121972. 122014. 122176.
  122184. 122205. 122456. 122608. 122687. 122739. 122853. 122857. 123119.
  123281. 123403. 123421. 123421. 123634. 123636. 123883. 124033. 124419.
  124561. 124654. 124714. 124805. 124820. 125131. 125166. 125236. 125396.
  125444.]
 [ 60816.  63916.  65436.  65590.  66738.  66853.  71142.  71437.  71957.
   72362.  74836.  75104.  76538.  76855.  77058.  77269.  77729.  78293.
   78324.  79111.  79411.  79676.  80593.  81027.  81190.  81661.  81889.
   82067.  82201.  82737.  83202.  83305.  83928.  84052.  84165.  84571.
   85152.  85184.  85243.  85304.  85402.  85424.  85721.  86005.  86059.
   87233.  87678.  87688.  87769.  88583.  88643.  88982.  89315.  89420.
   89483.  89612.  89949.  90075.  90196.  90294.  90599.  91287.  91588.
   91590.  91799.  91830.  91831.  91835.  91856.  92012.  92245.  92391.
   92411.  92515.  92834.  93214.  93303.  93314.  93340.  93407.  93445.
   93576.  94182.  94193.  94226.  94260.  94278.  94405.  94492.  94582.
   94807.  94958.  95089.  95260.  95315.  95339.  95406.  95536.  96005.
   96154.]]

Next steps

Now that you have a basic understanding of how to ingest data and run similarity search queries using TileDB-Vector-Search, you can continue learning about TileDB-Vector-Search:

The foundation docs explain how TileDB has implemented vector search.
The tutorials cover the broad functionality and use cases of vector search in TileDB.
The API reference provides more information about the usage of TileDB-Vector-Search.