Result Estimation

TileDB can estimate the size of a query result, so you can allocate buffers for a particular query correctly.

When reading from sparse arrays or variable-length attributes from either dense or sparse arrays, you have no way to know how big the result will be, unless you execute the query. If that is the case, how should you allocate your buffers before passing them to TileDB? TileDB offers a way to get the estimated result size for any attribute. Note that TileDB does not execute the query. Thus, getting the estimated result is faster than executing the query. However, this comes at the cost of accuracy, since allocating your buffers based on the estimate may still lead to incomplete queries. Thus, you should always check for the query status, even if you allocate your buffers based on the result estimate.

Warning

The number of bytes returned is an estimation and may not be divisible by the datatype size. You must perform any ceiling operations necessary to make sure the query works.

First, import the necessary libraries, set the array URI (that is, its path, which in this tutorial will be on local storage), and delete any previously created arrays with the same name.

Python

# Import necessary libraries
import os.path
import shutil

import numpy as np
import tiledb

# Set array URI
array_uri = os.path.expanduser("~/result_estimation_python")

# Delete array if it already exists
if os.path.exists(array_uri):
    shutil.rmtree(array_uri)

Next, create the array, and write data to the array. This example uses a sparse array, but the described incomplete query functionality is applicable to any array.

Python

# The array will be 100 cells with dimensions "x".
dom = tiledb.Domain(tiledb.Dim(name="x", domain=(0, 99), tile=100, dtype=np.int64))

# The array will be dense with a single string typed attribute "a"
schema = tiledb.ArraySchema(
    domain=dom, sparse=True, attrs=[tiledb.Attr(name="a", dtype=str)]
)

# Create the (empty) array on disk.
tiledb.SparseArray.create(array_uri, schema)

# Write data to the array
with tiledb.open(array_uri, mode="w") as A:
    extent = A.schema.domain.dim("x").domain
    ncells = extent[1] - extent[0] + 1

    # Data is the Latin alphabet with varying repeat lengths
    data = [chr(i % 26 + 97) * (i % 52) for i in range(ncells)]

    # Coords are the dimension range
    coords = np.arange(extent[0], extent[1] + 1)

    A[coords] = data

Calculate the result estimate.

Python

# Create query object:
with tiledb.open(array_uri) as A:
    iterable = A.query(return_incomplete=True).multi_index[:]
    # then call `estimated_result_sizes`, which will return an
    # OrderedDict of {'result name': estimate}
    print(iterable.estimated_result_sizes())

{'x': EstimatedResultSize(offsets_bytes=0, data_bytes=800), 'a': EstimatedResultSize(offsets_bytes=800, data_bytes=2454)}

Clean up in the end by deleting the array.

Python

# Delete the array
if os.path.exists(array_uri):
    shutil.rmtree(array_uri)