Glossary

reference

A complete glossary of terms related to TileDB and and all supported solutions.

A

Academy

A learning platform containing all the information (explanation of product features, tutorials, foundational concepts, and more) needed to make you a TileDB superuser and extract maximum value for you and your organization.

Account

A personal namespace created for an individual user in TileDB Cloud. This namespace cannot give access to other users into their account namespace, as it is for personal use. By default, each user has a personal namespace.

Aggregate

A function built into TileDB to manipulate qualifying data in parallel in TileDB rather than passing the data to an external system to manipulate data. Examples of aggregates are count, sum, min, max, null count, and mean.

API token

Also known as a REST API token, a string of alphanumeric characters and symbols. You use REST API tokens authenticate to TileDB Cloud as a specific user and perform certain actions, depending on the scope set when you created the token.

Array

A multi-dimensional collection of data values stored in TileDB that allows for complex data structures and efficient access.

Array domain

The hyperspace defined by the dimension domains

Array metadata

Key-value pairs of arbitrary data users can attach to an array, where the key is a string and the value can be any type.

Array schema

An object that stores details about the array definition, such as the attributes, dimensions, tile capacity and extent, and the tile and cell order.

Asset

Any object in TileDB Cloud, including the following:

Arrays
Files
VCF datasets
SOMA experiments
Biomedical Imaging datasets
Vector Search indexes
Notebooks
Dashboards
UDFs
Task Graphs
ML Models
Groups

Attribute

The values stored within each cell, represented by a key (attribute name) and a value.

B

Binary variant call format (BCF)

Binary version of a VCF file. Note: a BCF file is not a tabix bgzipped VCF file, although it provides the necessary indices for TileDB ingestion.

C

Cell

An ordered tuple of dimension domain values (following the order in which the array dimensions were specified during array creation), called coordinates.

Cell order

The order in which cells within each space tile are written to disk. Cell orders can be row-major or column-major

Cloud credentials

Credentials used to access resources on S3-compatible object stores. Cloud credentials are configured in your account settings.

Column-major order

An order for tiles and cells where, assuming each tile or cell can be identified by a set of coordinates in the multi-dimensional space, the leftmost coordinate index “varies the fastest”.

Commit

A file TileDB creates to signify the successful creation of a fragment. Commits are eligible for consolidation and vacuuming.

Consolidation

A process in which TileDB combines various fragments, fragment metadata, commits, array metadata, or group metadata within the array into a smaller number of objects for faster reads at the cost of less granularity for time traveling. Consolidation is usually followed by vacuuming and is process-safe (i.e., you can run consolidation while read and write operations are happening on the array without the risk of losing data).

Configuration

A set of customizable key-value pairs in TileDB containing various settings it uses when it opens (creates an instance) of an array. When you pass a configuration object into a context object, and then you pass that context object into the method that opens (instantiates) your array, those configuration settings will ultimately affect the performance and overall functionality of the instance of your array until you close the array.

Context

An object containing a configuration object that gives TileDB instructions on how to configure the instantiation of an array.

Coordinates

The coordinates of an array cell are an ordered tuple of dimension domain values that identifies the cell. In dense arrays, the coordinates of each cell are unique. In sparse arrays, the same coordinates may appear more than once.

Cosine distance

The distance of two vectors computed by calculating the cosine of the angle between those two vectors. In specific vector spaces, cosine similarity is a measure of vector similarity.

D

Dashboard

An app built on top of a notebook allowing you to visualize your data and analyses.

Data tile

A subset of cell values on a particular attribute. Data tiles are also known as the atomic unit of I/O and compression.

Default context

The context to be used for all operations that accept a context as a parameter for the lifecycle of the program. This means that closing and reopening an array will use the same context and, consequently, the same configuration.

Deletion

In the context of sparse arrays, a deletion is a type of write operation that stores the delete conditions to logically represent data have been removed from an array, without altering any past fragments. TileDB materializes deletions (permanently deletes the data matching the delete condition) as soon as you run consolidation and vacuuming.

Dense array

A type of TileDB array where all possible data points are stored, allowing for efficient storage and retrieval.

Dimension

Dimensions are data structures that comprise the hyperspace of the array. Each dimension has a name, a data type, and a domain of values (not to be confused with the array domain).

Dimension domain

The range of possible values a dimension can be, applicable to numeric domains.

Distance function

A function that computes the distance between two vectors. Different types of distance functions exist, including the L2 distance function (Euclidean distance function), the cosine distance, and the inner product distance

E

Embedding

This is a vector generated by a machine learning model given an external input object. Different machine learning models can be used to generate embeddings for different object types (i.e. image, text, audio, video, etc.). Embedding similarity (vector distance) is expected to approximate the input object similarity and therefore vector search is approximating object similarity search.

Empty cell

A cell that contains no attribute data.

Euclidean distance function

Refer to L2 distance function.

Expression quantitative trait loci (eQTL) analysis

A quantitative trait loci (QTL) analysis measures the association between variants and a phenotype. An expression quantitative trail loci (eQTL) analysis is a QTL analysis in which the phenotype is expression.

F

Fill values

Attribute values given to empty cells in a dense array.

Filter

A data transformation on an attribute or dimension, such as compression or encryption, that TileDB applies to data tiles before it writes those data tiles to disk.

FLAT

A straightforward algorithm implementation for Vector Search used to provide exact vector similarity search by computing the distance of the query vector and all the dataset vectors.

Fragment

A timestamped portion of data within a TileDB array that represents a snapshot or subset of the array’s content at a specific time.

Fragment metadata

System-specific, non-user-editable information about a fragment, including whether the fragment is dense or sparse, the non-empty domain of the fragment, the tile offsets, and the tile sizes. For sparse fragments, the fragment metadata includes an R-tree of the fragment data.

Frontier data

Data for bleeding-edge scientific applications that will power current and future generations of scientific discovery. Today, this includes population genomics, biomedical imaging, proteomics, metabolomics, single-cell, and spatial transcriptomics, all data that wasn’t collectable even 50 years ago. In the future, it will include brand new -omics fields and new areas of discovery.

G

Genome-wide association study (GWAS)

Genome-wide association studies identify genetic variants associated with complex traits or diseases across the entire genome.

Genomic VCF (gVCF)

A genomic VCF (gVCF) file, usually limited to one sample, is a VCF file containing variant positions; reference spans, which define long genomic intervals where the sample matches the reference sequence; and no-call spans, where the sequencing depth is too low to make a confident genotype call. TileDB-VCF recognizes these gVCF reference spans during queries and export.

Global cell order

A mapping from the multi-dimensional cell space to the 1-dimensional physical storage space (i.e., it is the order in which TileDB stores cell values on disk). It comprises the tile order and cell order.

Group

An asset within TileDB that creates a logical, hierarchical storage of other TileDB assets, including other groups. Groups are advantageous for objects that may exist in cloud object stores, which have no concept of a directory.

I

Incomplete query

A query where the result size of a subarray is larger than the allocated buffers that will hold the result. TileDB gracefully handles this case via result estimation and subarray partitioning.

Inner product distance

The distance between two vectors computed by the inner product of two vectors. Also known as dot product, it is a measure of vector similarity in specific vector spaces.

Internal allele frequency (IAF)

The internal allele frequency (IAF) frequency of an allele observed in a TileDB variant store, taking into account reference calls, no-calls, and polyallelic sites.

Interoperability

The capability of different software systems and tools to work together seamlessly, often facilitated by standardized data formats and APIs.

IVF_FLAT

A Vector Search algorithm implementation based on \(k\)-means clustering, where TileDB computes \(k\) separate clusters (as well as their centroids) of the dataset vectors, shuffling them in such a way that vectors appear adjacent on storage. Answering a query involves focusing only on a small number of clusters, based on the query’s proximity to their centroids.

L

L2 distance function

Also known as Euclidean distance, is the length of the line segment between 2 vectors. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem.

Large language model (LLM)

A Machine Learning model that takes as input data, typically in the form of a request in natural language, and outputs a response in natural language.

M

Machine Learning (ML)

A subfield of artificial intelligence (AI) focused on developing algorithms that allow computers to learn and improve from experience without being explicitly programmed, imitating intelligent human behavior.

Marketplace

The Marketplace is a collection of publicly available assets uploaded to TileDB by other TileDB users. You can view high-level details about each asset without an account, but you must be logged in to TileDB in order to take actions on these assets or use them programmatically. Users who publish items on the Marketplace have the option to monetize any data and code they publish.

Member

A TileDB user that belongs to an organizational namespace.

ML models

Algorithms or mathematical representations that enable computers to learn from and make predictions or decisions based on data.

Multi-range subarray

A subarray comprised of multiple ranges (subsets represented as hyper-rectangles of the array). Multi-range subarrays are only applicable to reads.

N

N+1 problem

The N+1 problem refers to a long-standing problem in using flat tables when columns introduce new rows. In VCFs, the N+1 problem describes the problem when a new sample is added to a cohort represented by a project VCF. In this situation, any new variants introduced by the new sample must be interrogated among all the existing samples to reconstruct this pVCF.

Namespace

A mechanism in TileDB Cloud where you can control who has access to specific resources and assets and who can use compute resources. TileDB offers two types of namespaces: individual namespaces, known as accounts, and organizational namespaces.

No-call

A no-call refers to a position in a VCF, denoted by ./. in the GT field, where there is not adequate depth to make a genotype call.

Non-empty domain

The minimum bounding hyper-rectangle of an array that tightly encompasses all non-empty cells in the array.

Normalization

The process of adjusting data to remove technical variation, making different datasets comparable.

Notebook

A container object designed by Project Jupyter that stores both Markdown and code (typically Python or R, but other languages are also supported). Notebooks allow you to document, execute, and share analyses on your array data. Notebooks form the basis of dashboards.

Notebook server

A JupyterLab instance spun up by TileDB Cloud for you to run your Jupyter notebooks without having to manually set up servers and deploy JupyterLab.

Nullable attribute

An attribute that accepts a null value. This is different from an empty cell, where the attribute is not present.

O

Omics

A field of study in biology that involves the comprehensive analysis of biological molecules, including genomics, proteomics, transcriptomics, and metabolomics.

Organization

A namespace that allows a team of users to join. An individual user creates this type of namespace. An organization can contain zero or more members.

P

Phenome-wide association studies (PheWAS)

Phenome-wide association studies explore associations between genetic variants and a wide range of phenotypes, providing a comprehensive view of genetic influences.

Polygenic risk scores (PRS)

Polygenic risk scores aggregate multiple genetic variants to predict an individual’s susceptibility to certain traits or diseases, facilitating personalized risk assessment in healthcare.

Population VCF (pVCF)

A population or project VCF (pVCF) is usually a multi-sample VCF without reference/no-call spans.

PQ (Product Quantization)

Product quantization (PQ) is a form of lossy data compression for vectors. PQ reduces the memory footprint of a vector index but also reduces the search result accuracy.

Q

Query

A request made to a TileDB that either reads data from the array, writes new data to the array, or (for sparse arrays only) deletes data from the array.

Query condition

A logical expression on an attribute or a dimension, applied to a query to limit the data accessed by the query.

R

R-tree

A data structure used by TileDB for multi-dimensional indexing in sparse fragments. It allows for fast pruning of irrelevant data during reading.

Reinforcement learning models

ML models that learn by interacting with an environment, receiving rewards or penalties based on actions taken.

Role-Based Access Control (RBAC)

A security mechanism that restricts system access to authorized users based on their role within the organization.

Row-major order

An order for tiles and cells where, assuming each tile or cell can be identified by a set of coordinates in the multi-dimensional space, the righ-most coordinate index “varies the fastest”.

S

Scalability

The ability of software to handle increasing amounts of data or number of users without performance degradation.

Schema

A definition of the structure of an array, including its dimensions, attributes, and their data types.

Schema evolution

The act of modifying the array’s schema after the array has been created.

Semi-supervised learning models

ML models that use both labeled and unlabeled data. Typically, semi-supervised learning models use a small amount of labeled data and a large amount of unlabeled data.

Single-range subarray

A single subset of the array represented as a hyper-rectangle of the array

Slice

A set of tuples corresponding to start and end values of each of the dimensions of an array that defines a subset of that array.

SOMA

Stack Of Matrices, Annotated. This includes the main assay (X counts) matrix, as well as additional annotations on the assay matrix.

Space tile

The tile defined by the tile extents of each dimension. In dense arrays, space tiles have a one-to-one correspondence with data tiles. Space tiles in sparse arrays, however, can have empty cells. Thus, space tiles don’t have the same one-to-one correspondence to data tiles in sparse arrays.

Sparse array

A TileDB array that only stores non-empty data points, optimizing space and retrieval for data that is not uniformly populated.

Storage paths

Paths within your S3 object store where TileDB Cloud will save your assets. You have the option to set granular storage paths for the following assets:

Arrays
Notebooks
UDFs
ML models
Files
Groups
Task graphs

Subarray

A slice of an array. Subarrays can be single-range or multi-range and are applicable to both dense and sparse arrays.

Supervised learning models

ML models that train on labeled data, meaning that each training example is paired with an output label.

T

Task

An arbitrary computation on TileDB. Tasks can be a generic function, an array UDF, a serverless SQL query, or a local function.

Task graph

Also known as a pipeline, a mechanism containing a set of pre-defined, serialized or parallel, synchronous or asynchronous tasks on TileDB. A task graph contains one or more tasks.

Task log

A log of all tasks run by any user belonging to the current namespace. The task log includes information such as the action type, the user who launched the task, and the associated task graph.

Task graph log

A log of all task graphs run by any user belonging to the current namespace. The task graph log includes the name of the task graph, the namespace from where this task graph was launched, who launched the task, the duration of the task graph since it was launched, the start time, the number of tasks, the type of task graph, and the task graph ID.

Technical performance

Evaluating software based on speed, accuracy, and robustness, including how well it handles large datasets and complex analyses.

Throughput

The amount of data a system can process in a given time period, important for high-volume multi-omics analysis.

Tile

A chunk of data within a TileDB array used to facilitate efficient read and write operations by grouping cells.

Tile extent

The number of cells along a specific dimension of an array that can be stored in a tile.

Tile order

The order in which space tiles are stored on disk. The tile order can be row-major or column-major.

TileDB Cloud

The commercial product offering, built by the TileDB team.

Time traveling

A feature of TileDB that allows users to read different facets of an array at different points in time.

U

Unsupervised learning models

ML models that train on unlabeled data. The goal of unsupervised learning models is to find hidden patterns or intrinsic structures in the data.

URI

A Uniform Resource Identifier used to specify the location of a TileDB array in persistent storage.

User-defined function (UDF)

A packaged piece of code you wish to reuse through your notebooks or task graphs on TileDB Cloud.

V

Vacuuming

A process in TileDB, usually run after consolidation, that deletes any fragments, fragment metadata, commits, array metadata, or group metadata (depending on the vacuuming mode chosen) that TileDB consolidated, in an attempt to save space on disk. Vacuuming is not process-safe.

Vamana

The Vamana vector search algorithm is an efficient method for nearest neighbor search in high-dimensional vector spaces, which constructs a graph where each node represents a vector and edges connect to its nearest neighbors. It utilizes a greedy search strategy on this graph to quickly locate the approximate nearest neighbors of a query vector.

Variable-sized attribute

An attribute that has more than one piece of data. TileDB supports two types of variable-length attributes: lists of objects and strings.

Vector

A point in a \(d\)-dimensional vector space. This can be represented as an one-dimensional array of length \(d\), containing values of a specific datatype.

Vector Search

Also known as similarity search or nearest neighbor search, Vector Search involves finding vectors in a dataset that are similar to a given query based on a distance function.

Vector space

A multi-dimensional space with \(d\) dimensions.

A