Dimensions vs. Attributes
One of the fundamental questions when designing the array schema is, “what are my dimensions and what are my attributes?”. The answer depends on whether your array is dense or sparse. A good guideline that applies to both dense and sparse arrays is the following:
If you often perform range slicing over a field or column of your dataset, you should consider making it a dimension.
The biggest strengths of array data systems are as follows:
- They can calibrate the layout of the multi-dimensional cells in the 1-dimensional storage medium
- They can build efficient multi-dimensional indexes on top of the cell coordinates
- They can optimize multi-dimensional range queries (slicing) by leveraging what they know about the cell layout and indexes.
When to use a dimension
Use a dimension for a dataset field that receives frequent range queries. Typical dimension examples include:
- The width and height of an image.
- The width, height, and time of a video.
- Time and stock symbol in a stock market tick dataset.
- Longitude, latitude, and elevation in point clouds.
- Sample, chromosome, and position in a variant dataset.
When to use an attribute
If your workloads involve aggregate or non-range filter queries on a particular field, or if you’re uncertain if a field will be used in a query condition, then make that field an attribute. TileDB has optimizations for filters and aggregates on attributes as well, so you will still get great performance. However, make sure to follow the guideline that a field should be a dimension if the majority of the workloads apply a range condition on that field.
How many dimensions
The number of dimensions directly depend on the workloads and, more specifically, on how many of the dataset fields receive frequent range query conditions. Increasing the number of dimensions may increase the query selectivity if they all receive range conditions, as this strengthens the pruning power of the internal multi-dimensinal indexes, which may lead to faster queries.
However, note that increasing the number of dimensions results in diminishing returns, as the complexity of the internal indexes increases, and the spatial locality of the results is lost. A good idea (that should be empirically tested) is to make the most selective fields (that is, those with the largest pruning power) as dimensions, and keep the rest as attributes that can receive extra filtering conditions that TileDB can efficiently process as well.