Ingestion & Querying

ai/ml

vector search

tutorials

ingestion

queries

Learn how to ingest basic vector data with TileDB-Vector-Search and perform similarity search.

How to run this tutorial

We recommend running this tutorial, as well as the other various tutorials in the Tutorials section, inside TileDB Cloud. This will allow you to quickly experiment avoiding all the installation, deployment, and configuration hassles. Sign up for the free tier, spin up a TileDB Cloud notebook with a Python kernel, and follow the tutorial instructions. If you wish to learn how to run tutorials locally on your machine, read the Tutorials: Running Locally tutorial.

In this tutorial, you will learn how to ingest a set vectors into an IVF_FLAT index and perform basic similarity search.

Setup

You will ingest the small (10k) SIFT dataset from the Datasets for approximate nearest neighbor search site. You will download a mirrored copy of the dataset from the TileDB-Vector-Search repo on GitHub. The original source can be found here: ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz.

First, import the necessary libraries, set up the URIs you will use throughout the tutorial, and clean up any previously created directories and vector indexes.

# Import necessary libraries
import os
import shutil
import tarfile
import urllib.request

import numpy as np
import tiledb
import tiledb.vector_search as vs
from tiledb.vector_search.utils import load_fvecs, load_ivecs

# The URIs for the data to download and ingest
data_uri = "https://github.com/TileDB-Inc/TileDB-Vector-Search/releases/download/0.0.1/siftsmall.tgz"
data_filename = "siftsmall.tar.gz"
data_dir = os.path.expanduser("~/sift10k/")
local_data_path = os.path.join(data_dir, data_filename)
index_uri = os.path.expanduser("~/sift10k_ivf_flat")

# Clean up previous data
if os.path.exists(data_dir):
    shutil.rmtree(data_dir)
if os.path.exists(index_uri):
    shutil.rmtree(index_uri)

Next, download and untar the source dataset.

# Create a directory to store the source dataset
os.makedirs(os.path.dirname(data_dir))

# Download the file that contains the vector dataset
urllib.request.urlretrieve(data_uri, local_data_path)

# untar the file
tarfile.open(local_data_path, "r:gz").extractall(
    os.path.dirname(local_data_path), filter="fully_trusted"
)

Ingest

Once the tarball has been extracted, ingest the data into a IVF_FLAT index as follows:

index = vs.ingest(
    index_type="IVF_FLAT",
    source_uri=os.path.join(data_dir, "siftsmall_base.fvecs"),
    index_uri=index_uri,
    source_type="FVEC",
    partitions=100,
)

You can similarly create other types of indexes, such as FLAT and VAMANA. Visit the Key Concepts: Vector Indexing Algorithms, Key Concepts: Distance Metrics, and Key Concepts: Vector Search Performance sections to learn more about how to choose the right index for your use case and fine-tune it for maximum performance.

Inspect

This ingestion creates a group with three arrays, which can be shown as shown below. In addition, you can see the array schema of the array that stores the vectors, as well as print the contents of the first vector.

# Show the physical group
group = tiledb.Group(index_uri, "r")
print("Index physical contents:\n")
print(group)

# Prepare the index for reading
index = vs.IVFFlatIndex(index_uri)

# Open the vector array to inspect it
print("Vector array URI:", index.db_uri, "\n")
A = tiledb.open(index.db_uri)

# Print the schema of the vector array
print("Vector array schema:\n")
print(A.schema)

# Print the first vector
print("Contents of first vector:\n")
print(A[:, 0]["values"])

Index physical contents:

sift10k_ivf_flat GROUP
|-- partition_centroids ARRAY
|-- partition_indexes ARRAY
|-- shuffled_vector_ids ARRAY
|-- shuffled_vectors ARRAY
|-- updates ARRAY

Vector array URI: file:///home/jovyan/sift10k_ivf_flat/shuffled_vectors 

Vector array schema:

ArraySchema(
  domain=Domain(*[
    Dim(name='rows', domain=(0, 127), tile=128, dtype='int32', filters=FilterList([ZstdFilter(level=-1), ])),
    Dim(name='cols', domain=(0, 2147483647), tile=125000, dtype='int32', filters=FilterList([ZstdFilter(level=-1), ])),
  ]),
  attrs=[
    Attr(name='values', dtype='float32', var=False, nullable=False, enum_label=None, filters=FilterList([ZstdFilter(level=-1), ])),
  ],
  cell_order='col-major',
  tile_order='col-major',
  sparse=False,
)

Contents of first vector:

[ 14.  14.  26.   1.   0.   0.  12.  12.   6.  21. 123.  12.   0.   0.
  10.   7.   0.   9. 112.  14.   1.   0.  12.   1.   0.  17.  96.   4.
   0.   0.   2.   0. 139.  49.   7.   0.   0.   1.  12.  24. 128.  99.
 139.  35.   0.   0.   0.   0.   0.  18. 139.  52.   1.   0.   0.   0.
   0.  54. 139.   3.   0.   0.   0.   0. 139.  52.   0.   0.   0.   0.
   1.  35. 139.  36.   8.   3.   0.   0.   0.  35.   2.   2.  28.  13.
  14.   6.   0.   3.  10.  25.  13.   2.   4.   3.   1.   1. 126.   0.
   0.   0.   0.   0.   0. 116.  89.   0.   0.   0.   0.   0.  12. 139.
   5.   1.   0.   0.   5.   7.  28.  21.   3.   2.   0.   0.   1.   3.
  12.   6.]

Similarity search

To run similarity search on the ingested vectors, load the queries and ground truth vectors from the siftsmall dataset you downloaded, noting you can use any vector to query this dataset. load_fvecs and load_ivecs are auxiliary functions in tiledb.vector_search.utils to fetch the query vectors and ground truth vectors. Then, run the query, and print the result vectors ids and corresponding distances to the queries as follows:

# Get query vectors with ground truth
query_vectors = load_fvecs(os.path.join(data_dir, "siftsmall_query.fvecs"))
ground_truth = load_ivecs(os.path.join(data_dir, "siftsmall_groundtruth.ivecs"))

# Select a query vector
query_id = 77
qv = np.array([query_vectors[query_id]])

# Return the 100 most similar vectors to the query vector with IVF_FLAT
result_d, result_i = index.query(qv, k=100, nprobe=10)
print("Result vector ids:\n")
print(result_i)
print("\nResult vector distances:\n")
print(result_d)

Result vector ids:

[[8578 8275 5332 9153 2092 3290 2010 8004 2949 9190 6784 5756 1721 1075
  1411 6994 4120 7529 3994 4844 3969 3313 7945  796 2135 2861 5957  621
   905 5014 9429 7287 4675 9172 1923  902 7555 2986 1976  852 2834 5247
  4545 4231 3431 4614 8526 4045 2987   23 7018 4637 2156 6058 7331 9181
  1084 8985 8838   15  871 4150 9270 1061 9440  710 4615 4616 1018 7426
  2728 4542 6979 7731 4782 5133 6660 1499 7569 5267  758 4867 8826 4085
  8382  745 3499 2954 4046 3430  650 8654 1407 3280 1103  867 4961 9904
   651 7146]]

Result vector distances:

[[ 86269.  89329.  94088. 100973. 101066. 102972. 107926. 108293. 111591.
  112274. 112447. 113508. 113748. 114415. 114491. 114537. 115424. 115560.
  116875. 118403. 119806. 120557. 121629. 122677. 125305. 125616. 125944.
  126107. 126544. 126683. 126729. 126942. 128506. 129325. 130042. 130087.
  130679. 130800. 131003. 131052. 131130. 131456. 131467. 131750. 132446.
  132604. 132875. 133198. 133221. 134265. 134771. 135413. 135516. 135832.
  136118. 136148. 136659. 136714. 137088. 137152. 137454. 137496. 137521.
  137964. 138023. 138153. 138241. 138431. 138725. 138745. 139058. 139211.
  139443. 139488. 139812. 139873. 139966. 140168. 140243. 140267. 140554.
  140935. 141092. 141348. 141988. 142590. 142921. 143102. 143338. 143578.
  143755. 143887. 144321. 144422. 144779. 144913. 145619. 145698. 145724.
  145986.]]

You can check the result against the ground truth:

# For FLAT, the following will always be true,
# but for IVF_FLAT it might not be (as it's an approximate algorithm
np.alltrue(result_i == ground_truth[query_id])

False

You can even run batches of searches, which are efficiently implemented in TileDB-Vector-Search:

# Simply provide more than one query vectors
result_d, result_i = index.query(
    np.array([query_vectors[5], query_vectors[6]]), nprobe=10, k=100
)
print("Result vector ids:\n")
print(result_i)
print("\nResult vector distances:\n")
print(result_d)

Result vector ids:

[[1097 1239 3227  804 2607 4443 4246 3112  535 4445 2312 2945 1403 1707
  1896 1626 7132 6767 9690 8666 3699 3460 1532 2329 8844 4370 2115 3180
  3652  771 1068 9608 2791 4296 9048 3506  288 2753 1445 7837 4222 4051
   609 9176 1205 4413 3241 3286 4044 3837 7913 3614 6039 7669 2378 5815
  3141 1218 3365  703  121 6585 6358 1863 7214  693 2571 4272 8867  151
  3649  172 9083 7348  861 1254 3062   96 1834 5995 7169 2616 2381  754
  9781 7087 1353 2706 6418  645 3061  806 1935 3391 2104  302 3682 3429
  4447 3914]
 [2456 3013 1682 8581 2774 3530  924 2732 9701 1916 3687 1036 4248 4094
  9885 2638 8000 1151 1174 2839 3609 2176 9651 3996 3943 8642 3249 3856
  3954 1468 2107 3854 1323 3623 8886 8773 3548  694 2435 3298  683 2038
  1623 1038 2702 3136 3138 9629 9542  287 3713  421 1918 3794  108 8848
  3292 1322 9520 1150 1467 4475 7873 9523 2388 8478 2648 1666 3645 3009
  1295 2755 3877 4182 1454 2626  403 1795  641 2065 1064 2357 3515 1479
  1963 7680 3615 2741 2109 2002 9212  630 9539  981 9715  411 2488  219
  3829  942]]

Result vector distances:

[[ 75708.  82658.  90315.  92913.  97025.  99507. 101123. 101335. 102143.
  102765. 105347. 105865. 105876. 106125. 107297. 108639. 108952. 109650.
  110999. 111563. 111846. 112377. 112639. 112784. 112972. 113673. 113792.
  114105. 114212. 114392. 114654. 114770. 115035. 115861. 116037. 116315.
  117064. 117306. 117612. 117996. 118100. 118345. 118381. 118654. 118895.
  119739. 119951. 120017. 120092. 120118. 120183. 120540. 120812. 120846.
  120880. 120957. 121048. 121216. 121217. 121521. 121972. 122014. 122184.
  122205. 122456. 122608. 122687. 122853. 123119. 123281. 123421. 123421.
  123634. 123636. 123883. 124033. 124561. 124654. 124714. 124805. 124820.
  125131. 125166. 125236. 125396. 125511. 126821. 126937. 126980. 127041.
  127109. 127233. 127465. 127512. 127951. 128167. 128523. 128531. 128549.
  128567.]
 [ 60816.  63916.  65436.  65590.  66738.  66853.  71142.  71437.  71957.
   72362.  74836.  75104.  76538.  76855.  77058.  77269.  77729.  78293.
   78324.  79111.  79411.  79676.  80593.  81027.  81190.  81889.  82067.
   82201.  82737.  83305.  83928.  84052.  84165.  84571.  85152.  85184.
   85243.  85402.  85424.  85721.  86005.  86059.  87233.  87678.  87688.
   87769.  88583.  88643.  88982.  89315.  89420.  89483.  89612.  90196.
   90294.  90599.  91287.  91588.  91590.  91799.  91830.  91831.  91835.
   91856.  92012.  92245.  92391.  92411.  92515.  92834.  93214.  93303.
   93314.  93340.  93407.  93576.  94182.  94193.  94226.  94492.  94582.
   94807.  94958.  95089.  95260.  95406.  95536.  96005.  96203.  96640.
   96655.  96673.  96877.  97202.  97345.  97403.  97573.  97577.  97655.
   97697.]]

Clean up

Clean up in the end by removing the created directory and index:

# Clean up
if os.path.exists(data_dir):
    shutil.rmtree(data_dir)
if os.path.exists(index_uri):
    shutil.rmtree(index_uri)

Indexes on TileDB Cloud

Note the following when creating indexes and registering them on TileDB Cloud:

When creating a new index, you need to use a URI in the form tiledb://<your_username>/<S3_path>/<index_name>, where S3_path is the location on S3 where you wish to physically store the index.
When referring to the index after creating it (e.g., when submitting queries), use a URI in the form tiledb://<your_username>/<index_name> (i.e., no need to specify the S3 physical path anymore).

Set up the URIs and clean up any previously created index with the same name on TileDB Cloud.

# Get your username
username = tiledb.cloud.user_profile().username

# Get the bucket from an environment variable
s3_bucket = os.environ["S3_BUCKET"]

# Set index URI
index_name = "cloud_index"
index_uri = "tiledb://" + username + "/" + index_name
index_reg_uri = "tiledb://" + username + "/" + s3_bucket + "/" + index_name

# The TileDB Cloud context
ctx = tiledb.cloud.Ctx()

# Clean up index if it already exists
if tiledb.object_type(index_uri, ctx=ctx) == "group":
    tiledb.cloud.asset.delete(index_uri, recursive=True)

Create the index, using the tiledb://<your_username>/<S3_path>/<index_name> URI format explained above:

# Create an index, where the dimensionality of each vector is 3,
# the type of the vector values is float32, and the index will
# use 3 partitions.
index = vs.ivf_flat_index.create(
    ctx=ctx,
    uri=index_reg_uri,
    dimensions=3,
    partitions=3,
    vector_type=np.dtype(np.float32),
)

From this point onwards, you can populate and query the index just using the tiledb://<your_username>/<index_name> URI format explained above.

The following with delete the physical files of the index and deregister it from TileDB Cloud.

# Clean up index if it already exists
if tiledb.object_type(index_uri, ctx=ctx) == "group":
    tiledb.cloud.asset.delete(index_uri, recursive=True)

What’s next?

Now that you know how to perform basic ingestion and similarity search, you are ready to learn how to perform updates and deletions.