Glossary
A
Academy
A learning platform containing all the information (explanation of product features, tutorials, foundational concepts, and more) needed to make you a TileDB superuser and extract maximum value for you and your organization.
Account
A personal namespace created for an individual user in TileDB Cloud. This namespace cannot give access to other users into their account namespace, as it is for personal use. By default, each user has a personal namespace.
Aggregate
A function built into TileDB to manipulate qualifying data in parallel in TileDB rather than passing the data to an external system to manipulate data. Examples of aggregates are count, sum, min, max, null count, and mean.
API token
Also known as a REST API token, a string of alphanumeric characters and symbols. You use REST API tokens authenticate to TileDB Cloud as a specific user and perform certain actions, depending on the scope set when you created the token.
Array
A multi-dimensional collection of data values stored in TileDB that allows for complex data structures and efficient access.
Array domain
The hyperspace defined by the dimension domains
Array metadata
Key-value pairs of arbitrary data users can attach to an array, where the key is a string and the value can be any type.
Array schema
An object that stores details about the array definition, such as the attributes, dimensions, tile capacity and extent, and the tile and cell order.
Asset
Any object in TileDB Cloud, including the following:
- Arrays
- Files
- VCF datasets
- SOMA experiments
- Biomedical Imaging datasets
- Vector Search indexes
- Notebooks
- Dashboards
- UDFs
- Task Graphs
- ML Models
- Groups
Attribute
The values stored within each cell, represented by a key (attribute name) and a value.
B
Binary variant call format (BCF)
Binary version of a VCF file. Note: a BCF file is not a tabix bgzipped VCF file, although it provides the necessary indices for TileDB ingestion.
C
Cell
An ordered tuple of dimension domain values (following the order in which the array dimensions were specified during array creation), called coordinates.
Cell order
The order in which cells within each space tile are written to disk. Cell orders can be row-major or column-major
Cloud credentials
Credentials used to access resources on S3-compatible object stores. Cloud credentials are configured in your account settings.
Column-major order
An order for tiles and cells where, assuming each tile or cell can be identified by a set of coordinates in the multi-dimensional space, the leftmost coordinate index “varies the fastest”.
Commit
A file TileDB creates to signify the successful creation of a fragment. Commits are eligible for consolidation and vacuuming.
Consolidation
A process in which TileDB combines various fragments, fragment metadata, commits, array metadata, or group metadata within the array into a smaller number of objects for faster reads at the cost of less granularity for time traveling. Consolidation is usually followed by vacuuming and is process-safe (i.e., you can run consolidation while read and write operations are happening on the array without the risk of losing data).
Configuration
A set of customizable key-value pairs in TileDB containing various settings it uses when it opens (creates an instance) of an array. When you pass a configuration object into a context object, and then you pass that context object into the method that opens (instantiates) your array, those configuration settings will ultimately affect the performance and overall functionality of the instance of your array until you close the array.
Context
An object containing a configuration object that gives TileDB instructions on how to configure the instantiation of an array.
Coordinates
The coordinates of an array cell are an ordered tuple of dimension domain values that identifies the cell. In dense arrays, the coordinates of each cell are unique. In sparse arrays, the same coordinates may appear more than once.
Cosine distance
The distance of two vectors computed by calculating the cosine of the angle between those two vectors. In specific vector spaces, cosine similarity is a measure of vector similarity.
D
Dashboard
An app built on top of a notebook allowing you to visualize your data and analyses.
Data tile
A subset of cell values on a particular attribute. Data tiles are also known as the atomic unit of I/O and compression.
Default context
The context to be used for all operations that accept a context as a parameter for the lifecycle of the program. This means that closing and reopening an array will use the same context and, consequently, the same configuration.
Deletion
In the context of sparse arrays, a deletion is a type of write operation that stores the delete conditions to logically represent data have been removed from an array, without altering any past fragments. TileDB materializes deletions (permanently deletes the data matching the delete condition) as soon as you run consolidation and vacuuming.
Dense array
A type of TileDB array where all possible data points are stored, allowing for efficient storage and retrieval.
Dimension
Dimensions are data structures that comprise the hyperspace of the array. Each dimension has a name, a data type, and a domain of values (not to be confused with the array domain).
Dimension domain
The range of possible values a dimension can be, applicable to numeric domains.
Distance function
A function that computes the distance between two vectors. Different types of distance functions exist, including the L2 distance function (Euclidean distance function), the cosine distance, and the inner product distance
E
Embedding
This is a vector generated by a machine learning model given an external input object. Different machine learning models can be used to generate embeddings for different object types (i.e. image, text, audio, video, etc.). Embedding similarity (vector distance) is expected to approximate the input object similarity and therefore vector search is approximating object similarity search.
Empty cell
A cell that contains no attribute data.
Euclidean distance function
Refer to L2 distance function.
Expression quantitative trait loci (eQTL) analysis
A quantitative trait loci (QTL) analysis measures the association between variants and a phenotype. An expression quantitative trail loci (eQTL) analysis is a QTL analysis in which the phenotype is expression.
F
Fill values
Attribute values given to empty cells in a dense array.
Filter
A data transformation on an attribute or dimension, such as compression or encryption, that TileDB applies to data tiles before it writes those data tiles to disk.
FLAT
A straightforward algorithm implementation for Vector Search used to provide exact vector similarity search by computing the distance of the query vector and all the dataset vectors.
Fragment
A timestamped portion of data within a TileDB array that represents a snapshot or subset of the array’s content at a specific time.
Fragment metadata
System-specific, non-user-editable information about a fragment, including whether the fragment is dense or sparse, the non-empty domain of the fragment, the tile offsets, and the tile sizes. For sparse fragments, the fragment metadata includes an R-tree of the fragment data.
Frontier data
Data for bleeding-edge scientific applications that will power current and future generations of scientific discovery. Today, this includes population genomics, biomedical imaging, proteomics, metabolomics, single-cell, and spatial transcriptomics, all data that wasn’t collectable even 50 years ago. In the future, it will include brand new -omics fields and new areas of discovery.
G
Genome-wide association study (GWAS)
Genome-wide association studies identify genetic variants associated with complex traits or diseases across the entire genome.
Genomic VCF (gVCF)
A genomic VCF (gVCF) file, usually limited to one sample, is a VCF file containing variant positions; reference spans, which define long genomic intervals where the sample matches the reference sequence; and no-call spans, where the sequencing depth is too low to make a confident genotype call. TileDB-VCF recognizes these gVCF reference spans during queries and export.
Global cell order
A mapping from the multi-dimensional cell space to the 1-dimensional physical storage space (i.e., it is the order in which TileDB stores cell values on disk). It comprises the tile order and cell order.
Group
An asset within TileDB that creates a logical, hierarchical storage of other TileDB assets, including other groups. Groups are advantageous for objects that may exist in cloud object stores, which have no concept of a directory.
I
Incomplete query
A query where the result size of a subarray is larger than the allocated buffers that will hold the result. TileDB gracefully handles this case via result estimation and subarray partitioning.
Inner product distance
The distance between two vectors computed by the inner product of two vectors. Also known as dot product, it is a measure of vector similarity in specific vector spaces.
Internal allele frequency (IAF)
The internal allele frequency (IAF) frequency of an allele observed in a TileDB variant store, taking into account reference calls, no-calls, and polyallelic sites.
Interoperability
The capability of different software systems and tools to work together seamlessly, often facilitated by standardized data formats and APIs.
IVF_FLAT
A Vector Search algorithm implementation based on \(k\)-means clustering, where TileDB computes \(k\) separate clusters (as well as their centroids) of the dataset vectors, shuffling them in such a way that vectors appear adjacent on storage. Answering a query involves focusing only on a small number of clusters, based on the query’s proximity to their centroids.
L
L2 distance function
Also known as Euclidean distance, is the length of the line segment between 2 vectors. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem.
Large language model (LLM)
A Machine Learning model that takes as input data, typically in the form of a request in natural language, and outputs a response in natural language.
M
Machine Learning (ML)
A subfield of artificial intelligence (AI) focused on developing algorithms that allow computers to learn and improve from experience without being explicitly programmed, imitating intelligent human behavior.
Marketplace
The Marketplace is a collection of publicly available assets uploaded to TileDB by other TileDB users. You can view high-level details about each asset without an account, but you must be logged in to TileDB in order to take actions on these assets or use them programmatically. Users who publish items on the Marketplace have the option to monetize any data and code they publish.
Member
A TileDB user that belongs to an organizational namespace.
ML models
Algorithms or mathematical representations that enable computers to learn from and make predictions or decisions based on data.
Multi-Modal Omics
Integration of various omics data types (genomics, proteomics, metabolomics, etc.) to provide a comprehensive understanding of biological systems.
Multi-Modal Support
The capability of TileDB to handle different data types and formats within a single database system, providing flexibility for various use cases.
Multi-range subarray
A subarray comprised of multiple ranges (subsets represented as hyper-rectangles of the array). Multi-range subarrays are only applicable to reads.
N
N+1 problem
The N+1 problem refers to a long-standing problem in using flat tables when columns introduce new rows. In VCFs, the N+1 problem describes the problem when a new sample is added to a cohort represented by a project VCF. In this situation, any new variants introduced by the new sample must be interrogated among all the existing samples to reconstruct this pVCF.
Namespace
A mechanism in TileDB Cloud where you can control who has access to specific resources and assets and who can use compute resources. TileDB offers two types of namespaces: individual namespaces, known as accounts, and organizational namespaces.
No-call
A no-call refers to a position in a VCF, denoted by ./.
in the GT
field, where there is not adequate depth to make a genotype call.
Non-empty domain
The minimum bounding hyper-rectangle of an array that tightly encompasses all non-empty cells in the array.
Normalization
The process of adjusting data to remove technical variation, making different datasets comparable.
Notebook
A container object designed by Project Jupyter that stores both Markdown and code (typically Python or R, but other languages are also supported). Notebooks allow you to document, execute, and share analyses on your array data. Notebooks form the basis of dashboards.
Notebook server
A JupyterLab instance spun up by TileDB Cloud for you to run your Jupyter notebooks without having to manually set up servers and deploy JupyterLab.
Nullable attribute
An attribute that accepts a null value. This is different from an empty cell, where the attribute is not present.
O
Omics
A field of study in biology that involves the comprehensive analysis of biological molecules, including genomics, proteomics, transcriptomics, and metabolomics.
Organization
A namespace that allows a team of users to join. An individual user creates this type of namespace. An organization can contain zero or more members.
P
Phenome-wide association studies (PheWAS)
Phenome-wide association studies explore associations between genetic variants and a wide range of phenotypes, providing a comprehensive view of genetic influences.
Polygenic risk scores (PRS)
Polygenic risk scores aggregate multiple genetic variants to predict an individual’s susceptibility to certain traits or diseases, facilitating personalized risk assessment in healthcare.
Population VCF (pVCF)
A population or project VCF (pVCF) is usually a multi-sample VCF without reference/no-call spans.
PQ (Product Quantization)
Product quantization (PQ) is a form of lossy data compression for vectors. PQ reduces the memory footprint of a vector index but also reduces the search result accuracy.
Q
Query
A request made to a TileDB that either reads data from the array, writes new data to the array, or (for sparse arrays only) deletes data from the array.
Query condition
A logical expression on an attribute or a dimension, applied to a query to limit the data accessed by the query.
R
R-tree
A data structure used by TileDB for multi-dimensional indexing in sparse fragments. It allows for fast pruning of irrelevant data during reading.
Reinforcement learning models
ML models that learn by interacting with an environment, receiving rewards or penalties based on actions taken.
Role-Based Access Control (RBAC)
A security mechanism that restricts system access to authorized users based on their role within the organization.
Row-major order
An order for tiles and cells where, assuming each tile or cell can be identified by a set of coordinates in the multi-dimensional space, the righ-most coordinate index “varies the fastest”.
S
Scalability
The ability of software to handle increasing amounts of data or number of users without performance degradation.
Schema
A definition of the structure of an array, including its dimensions, attributes, and their data types.
Schema evolution
The act of modifying the array’s schema after the array has been created.
Semi-supervised learning models
ML models that use both labeled and unlabeled data. Typically, semi-supervised learning models use a small amount of labeled data and a large amount of unlabeled data.
Single-range subarray
A single subset of the array represented as a hyper-rectangle of the array
Slice
A set of tuples corresponding to start and end values of each of the dimensions of an array that defines a subset of that array.
SOMA
Stack Of Matrices, Annotated. This includes the main assay (X counts) matrix, as well as additional annotations on the assay matrix.
Space tile
The tile defined by the tile extents of each dimension. In dense arrays, space tiles have a one-to-one correspondence with data tiles. Space tiles in sparse arrays, however, can have empty cells. Thus, space tiles don’t have the same one-to-one correspondence to data tiles in sparse arrays.
Sparse array
A TileDB array that only stores non-empty data points, optimizing space and retrieval for data that is not uniformly populated.
Storage paths
Paths within your S3 object store where TileDB Cloud will save your assets. You have the option to set granular storage paths for the following assets:
- Arrays
- Notebooks
- UDFs
- ML models
- Files
- Groups
- Task graphs
Subarray
A slice of an array. Subarrays can be single-range or multi-range and are applicable to both dense and sparse arrays.
Supervised learning models
ML models that train on labeled data, meaning that each training example is paired with an output label.
T
Task
An arbitrary computation on TileDB. Tasks can be a generic function, an array UDF, a serverless SQL query, or a local function.
Task graph
Also known as a pipeline, a mechanism containing a set of pre-defined, serialized or parallel, synchronous or asynchronous tasks on TileDB. A task graph contains one or more tasks.
Task log
A log of all tasks run by any user belonging to the current namespace. The task log includes information such as the action type, the user who launched the task, and the associated task graph.
Task graph log
A log of all task graphs run by any user belonging to the current namespace. The task graph log includes the name of the task graph, the namespace from where this task graph was launched, who launched the task, the duration of the task graph since it was launched, the start time, the number of tasks, the type of task graph, and the task graph ID.
Technical performance
Evaluating software based on speed, accuracy, and robustness, including how well it handles large datasets and complex analyses.
Throughput
The amount of data a system can process in a given time period, important for high-volume multi-omics analysis.
Tile
A chunk of data within a TileDB array used to facilitate efficient read and write operations by grouping cells.
Tile extent
The number of cells along a specific dimension of an array that can be stored in a tile.
Tile order
The order in which space tiles are stored on disk. The tile order can be row-major or column-major.
TileDB Cloud
The commercial product offering, built by the TileDB team.
Time traveling
A feature of TileDB that allows users to read different facets of an array at different points in time.
U
Unsupervised learning models
ML models that train on unlabeled data. The goal of unsupervised learning models is to find hidden patterns or intrinsic structures in the data.
URI
A Uniform Resource Identifier used to specify the location of a TileDB array in persistent storage.
User-defined function (UDF)
A packaged piece of code you wish to reuse through your notebooks or task graphs on TileDB Cloud.
V
Vacuuming
A process in TileDB, usually run after consolidation, that deletes any fragments, fragment metadata, commits, array metadata, or group metadata (depending on the vacuuming mode chosen) that TileDB consolidated, in an attempt to save space on disk. Vacuuming is not process-safe.
Vamana
The Vamana vector search algorithm is an efficient method for nearest neighbor search in high-dimensional vector spaces, which constructs a graph where each node represents a vector and edges connect to its nearest neighbors. It utilizes a greedy search strategy on this graph to quickly locate the approximate nearest neighbors of a query vector.
Variable-sized attribute
An attribute that has more than one piece of data. TileDB supports two types of variable-length attributes: lists of objects and strings.
Vector
A point in a \(d\)-dimensional vector space. This can be represented as an one-dimensional array of length \(d\), containing values of a specific datatype.
Vector Search
Also known as similarity search or nearest neighbor search, Vector Search involves finding vectors in a dataset that are similar to a given query based on a distance function.
Vector space
A multi-dimensional space with \(d\) dimensions.