import tiledbsoma
uri = "tiledb://TileDB-Inc/shapes-example-processed"
exp = tiledbsoma.Experiment.open(uri)Shapes in TileDB-SOMA
The TileDB-SOMA team is proud to support an intuitive and extensible notion of shape with the release of TileDB-SOMA 1.15.
In this notebook, you will learn how to use shapes for the dataframes and arrays within your SOMA experiments, when and how you can resize, and options for experiments created in TileDB-SOMA versions before 1.15.
The dataset used is from Peripheral Blood Mononuclear Cells (PBMC3K), which is freely available from 10X Genomics.
The shape feature
Like other tutorials in this series, the SOMA data model brings across many familiar concepts from AnnData. This includes the ability to ask component dataframes and arrays what their shapes are.
First, import the necessary libraries and open an example experiment.
This is data ingested to TileDB-SOMA from PBMC3K.
The obs dataframe has a domain, which is a soft limit on what values you may write to it. You’ll get an exception like Query: A range was set outside of the current domain if you try to read or write soma_joinid values outside this range. This is an important data-integrity reassurance.
The domain seen here matches with the data populated inside of it. This will usually be the case, unless you created the dataframe but haven’t written any data to it yet. In that case, it’s empty, but it still has a domain.
If you have more data (more cells) to add to the experiment later, you will be able resize the obs, up to the maxdomain, which is a hard limit.
You’ll learn more about this on experiment-level resizes throughout this tutorial, as well as in the tutorial on TileDB-SOMA’s append mode.
The var dataframe’s domain is similar:
Likewise, the N-dimensional arrays within the experiment have their shapes as well.
An important difference: while the dataframe domain gives you the inclusive lower and upper bounds for soma_joinid writes, the shape for the N-dimensional arrays is the upper bound plus 1.
Since there are 2638 cells and 1838 genes here, X’s shape reflects that.
The other N-dimensional arrays are similar:
In particular, the X array in this experiment — and in most experiments — is sparse. That means the matrix doesn’t need a number in every row or cell. Still, the shape serves as a soft limit for reads and writes: you’ll get an exception trying to read or write outside of these bounds. (Specifically, the message you’ll see is Query: A range was set outside of the current domain.)
As a convenience, you can see all the experiment’s objects’ shapes at once as follows:
import tiledbsoma.io
tiledbsoma.io.show_experiment_shapes(exp.uri)[DataFrame] obs
URI tiledb://TileDB-Inc/4e63acce-71cc-4d42-96b8-0815bf7fc497
non_empty_domain ((0, 2637),)
domain ((0, 2637),)
maxdomain ((0, 9223372036854773758),)
upgraded True
[DataFrame] ms/RNA/var
URI tiledb://TileDB-Inc/95998d1a-82f9-4555-adc9-dfdee2f057f0
non_empty_domain ((0, 1837),)
domain ((0, 1837),)
maxdomain ((0, 9223372036854773968),)
upgraded True
[SparseNDArray] ms/RNA/X/data
URI tiledb://TileDB-Inc/68acd3b3-fb31-4089-8242-f72f35288ab6
used_shape ((0, 2637), (0, 1837))
shape (2638, 1838)
maxshape (9223372036854773759, 9223372036854773759)
upgraded True
...
[SparseNDArray] ms/RNA/obsm/X_pca
URI tiledb://TileDB-Inc/e147bdff-4066-45ca-90d3-e0041ee4259b
used_shape ((0, 2637), (0, 49))
shape (2638, 50)
maxshape (9223372036854773759, 9223372036854773759)
upgraded True
...
[SparseNDArray] ms/RNA/obsp/distances
URI tiledb://TileDB-Inc/b37fb332-6e31-4a08-8138-272f196081d9
used_shape ((0, 2637), (0, 2637))
shape (2638, 2638)
maxshape (9223372036854773759, 9223372036854773759)
upgraded True
[SparseNDArray] ms/RNA/varm/PCs
URI tiledb://TileDB-Inc/7b2849bb-5804-469c-95e1-c5bf52aa6266
used_shape ((0, 1837), (0, 49))
shape (1838, 50)
maxshape (9223372036854773759, 9223372036854773759)
upgraded True
...
(Not currently implemented in R.)
As with AnnData, as a general rule you’ll see the following:
- An
Xarray’sshapeisnobsxnvar. - An
obsmarray’s shape isnobsx some number, maybe 50. - An
obsparray’s shape isnobsxnobs. - A
varmarray’s shape isnvarx some number, maybe 50. - A
varparray’s shape isnvarxnvar.
When and how to resize at the experiment level
The primary reason you’d resize a dataframe or an array within an experiment is to append more data. For example, say you have an experiment with the results of Monday’s lab run on a sample of 100,000 cells. Then maybe on Tuesday, you’ll want to add that day’s lab run of another 70,000 cells to the same experiment, for a new total of 170,000 cells. It’s also possible that Tuesday’s data might include some infrequently expressed genes that didn’t appear in Monday’s data.
Because the shapes are soft limits, reading or writing beyond which will result in an exception, you’d need to resize the experiment to accommodate new shapes for the dataframes and arrays in the experiment to allow for new nobs = 170,000.
Visit the append-mode tutorial for information on how to resize experiments by using tiledbsoma.io.register_anndatas and tiledbsoma.io.resize_experiment
While you can resize each dataframe and array in the experiment one at a time (refer to Advanced usage), the most common case is tiledbsoma.io.resize_experiment, which exists to make this quick and convenient.
resize_experiment is available only in Python, because the append-mode feature only exists currently in Python.
How to upgrade older experiments
Experiments created by TileDB-SOMA 1.15 and later will look as shown previously. The following code block shows an experiment created using TileDB-SOMA 1.14.5. This is the same PBMC3K dataset as before, except it’s the unprocessed version: this has fewer component arrays, which keeps the display here more compact.
Experiment-level upgrade is applicable only to the TileDB-SOMA Python API. This is because TileDB-SOMA experiments created n R before TileDB-SOMA 1.15 have their array shape already the same as maxshape, so these can’t be expanded more.
Compare the shapes from before TileDB-SOMA 1.15 to TileDB-SOMA 1.15:
Note that for the pre-1.15 experiment, the shape is large — like the maxshape — and tiledbsoma_has_upgraded_domain is False.
To make the old experiment look like the new experiment, call upgrade_experiment_shapes, and reopen.
For purposes of this document, we show the results of having done that.
Note that show_experiment_shapes and upgrade_experiment_shapes are currently only implemented in Python.
Before upgrading:
tiledbsoma.io.show_experiment_shapes(
"tiledb://TileDB-Inc/shapes-example-pre-1.15-upgraded"
)[DataFrame] obs
URI tiledb://TileDB-Inc/85bdf23b-e0fe-4494-9012-c9102fc6be90
non_empty_domain ((0, 2699),)
domain ((0, 2147483646),)
maxdomain ((0, 2147483646),)
upgraded False
[DataFrame] ms/RNA/var
URI tiledb://TileDB-Inc/45bd8385-dd82-40f6-a428-2a85c8626afe
non_empty_domain ((0, 13713),)
domain ((0, 2147483646),)
maxdomain ((0, 2147483646),)
upgraded False
[SparseNDArray] ms/RNA/X/data
URI tiledb://TileDB-Inc/b714d8f6-9283-4191-8e2d-9b41c4007ee1
used_shape ((0, 2699), (0, 13713))
shape (2147483646, 2147483646)
maxshape (2147483646, 2147483646)
upgraded False
True
Applying the upgrade:
tiledbsoma.io.upgrade_experiment_shapes(
"tiledb://TileDB-Inc/shapes-example-pre-1.15-upgraded", verbose=True
)[DataFrame] obs
URI tiledb://TileDB-Inc/85bdf23b-e0fe-4494-9012-c9102fc6be90
Applying tiledbsoma_upgrade_soma_joinid_shape(2700)
[DataFrame] ms/RNA/var
URI tiledb://TileDB-Inc/45bd8385-dd82-40f6-a428-2a85c8626afe
Applying tiledbsoma_upgrade_soma_joinid_shape(13714)
[SparseNDArray] ms/RNA/X/data
URI tiledb://TileDB-Inc/b714d8f6-9283-4191-8e2d-9b41c4007ee1
Applying tiledbsoma_upgrade_shape((2700, 13714))
True
After the upgrade:
tio.show_experiment_shapes("tiledb://TileDB-Inc/shapes-example-pre-1.15-upgraded")[DataFrame] obs
URI tiledb://TileDB-Inc/85bdf23b-e0fe-4494-9012-c9102fc6be90
non_empty_domain ((0, 2699),)
domain ((0, 2699),)
maxdomain ((0, 2147483646),)
upgraded True
[DataFrame] ms/RNA/var
URI tiledb://TileDB-Inc/45bd8385-dd82-40f6-a428-2a85c8626afe
non_empty_domain ((0, 13713),)
domain ((0, 13713),)
maxdomain ((0, 2147483646),)
upgraded True
[SparseNDArray] ms/RNA/X/data
URI tiledb://TileDB-Inc/b714d8f6-9283-4191-8e2d-9b41c4007ee1
used_shape ((0, 2699), (0, 13713))
shape (2700, 13714)
maxshape (2147483646, 2147483646)
upgraded True
To run a pre-check, you can do the following:
tiledbsoma.io.upgrade_experiment_shapes(the_uri, check_only=True)(Not currently implemented in R.)
This won’t change anything. It’ll only tell you if the operation will be possible.
Advanced usage
Dataframes with non-standard index columns
In the SOMA data model, the SparseNDArray and DenseNDArray objects always have int64 dimensions named soma_dim_0, soma_dim_1, and up, and they have a numeric soma_data attribute for the contents of the array.
For dataframes, though, while there must be a soma_joinid column of type int64, you can have additional index columns, or soma_joinid may be a non-index column.
This means that in the most common case, you can think of a dataframe has having a shape just as the N-dimensional arrays do.
That being said, dataframes are capable of more than that, via the index-column names you specify at creation time.
Create some dataframes, with the same data, but different choices of index-column names.
Now inspect the domain and maxdomain for these dataframes.
Notice the soma_joinid slot of the dataframe’s domain is as requested.
Another point is that domain cannot be specified for string-type index columns.
You can set them at creation time in one of two ways:
domain = ([(0, 9), None],)
# or
domain = ([(0, 9), ("", "")],) domain=list(soma_joinid=c(0, 9), mystring=NULL),
# or
domain=list(soma_joinid=c(0, 9), mystring=c('', '')),In either case, the domain slot for a string-typed index column will read back as a pair of empty strings:
Now inspect the other dataframe. Here, soma_joinid isn’t an index column at all. This is fine, as long as within the data you write to it, the index-column values uniquely identify each row.
The domain reads back as written.
Use resize at the dataframe/array level with the SOMA API
Earlier in this tutorial, you learned a fast and convenient way to resize all the dataframes and arrays within an experiment.
However, should you choose to do so, you can apply these one dataframe or array at a time.
For N-dimensional arrays that have been upgraded, or that were created using TileDB-SOMA 1.15 or later, do the following:
- If the array’s
tiledbsoma_has_upgraded_shapemethod reports False, invoke thetiledbsoma_upgrade_shapemethod. - Otherwise, invoke the
.resizemethod.
Note: for purposes of this document, two experiments are shown: a before and an after. For your purposes, you would use a single experiment, and operate on only that.
Unpack a pre-1.15 experiment:
Notice that the X array has not been upgraded, and that its shape reports the same as maxshape:
Now give the X array the new-style shape. First, consult its non-empty domain to find get a report of what data have already been successfully written there:
Next, reopen the experiment to find out what happened:
If you want, you can resize it even more:
For dataframes, the process is similar. If you want to expand only the soft limits for soma_joinid, you can use these methods instead:
- If the dataframe’s
tiledbsoma_has_upgraded_domainreports False, invoke.tiledbsoma_upgrade_domain - Otherwise, invoke the
.change_domainmethod.
TileDB-SOMA shape and domain in comparison to other TileDB terminology
TileDB-SOMA uses TileDB to implement the SOMA specification. You may find terminology corresponding to both TileDB and SOMA. This document has made use of SOMA terminology only. However, if you are familiar with broader TileDB concepts, here are the mappings.
- Core
domain:- This has always existed.
- This is immutable: it cannot be changed either larger or smaller once a dataframe or array has been created.
- A SOMA
DataFrame’smaxdomainis implemented by coredomain. - A SOMA
SparseNDArrayorDenseNDArray’smaxshapeis implemented by coredomain. - It’s a runtime error to read or write data outside these boundaries.
- This is a hard limit, in that it can’t be increased.
- Core
current_domain:- This was introduced in 2024 as of version 2.26 of the open-source core of TileDB, and is available in TileDB-SOMA as of version 1.15.
- This is mutable: it can’t be made smaller after dataframe or array creation, but you can make it larger, up to the core
domain(SOMAmaxdomain/maxshape). - A SOMA
DataFrame’sdomainis implemented by corecurrent_domain. - A SOMA
SparseNDArrayorDenseNDArray’sshapeis implemented by corecurrent_domain. - TileDB-SOMA will throw a runtime error if you try to read or write data outside these boundaries: you will see the error message
A range was set outside of the current domain. - This is a soft limit, in that may be increased up to the hard limit.
- Dataframes/arrays created by TileDB-SOMA 1.14 or lower:
- These will necessarily have core
domain(SOMAmaxdomainandmaxshape, respectively). - These won’t have the core
current_domain. - When you ask for a SOMA dataset’s
domainorshape, you get the same value asmaxdomainormaxshape. - Their
tiledbsoma_has_upgraded_domain()andtiledbsoma_has_upgraded_shape()methods returnFalse. - Using the upgrade feature mentioned previously, you can apply a core
current_domain.
- These will necessarily have core
- Dataframes and arrays created by TileDB-SOMA 1.15 and later, or that have been upgraded:
- These will necessarily have the core
domain(SOMAmaxdomainandmaxshape, respectively). - These will also have the core
current_domain(SOMAdomainandshape, respectively). - Their
tiledbsoma_has_upgraded_domain()andtiledbsoma_has_upgraded_shape()methods returnTrue.
- These will necessarily have the core