import tiledbsoma
= "tiledb://TileDB-Inc/shapes-example-processed"
uri = tiledbsoma.Experiment.open(uri) exp
Shapes in TileDB-SOMA
The TileDB-SOMA team is proud to support an intuitive and extensible notion of shape
with the release of TileDB-SOMA 1.15.
In this notebook, you will learn how to use shapes for the dataframes and arrays within your SOMA experiments, when and how you can resize, and options for experiments created in TileDB-SOMA versions before 1.15.
The dataset used is from Peripheral Blood Mononuclear Cells (PBMC3K), which is freely available from 10X Genomics.
The shape feature
Like other tutorials in this series, the SOMA data model brings across many familiar concepts from AnnData. This includes the ability to ask component dataframes and arrays what their shapes are.
First, import the necessary libraries and open an example experiment.
This is data ingested to TileDB-SOMA from PBMC3K.
The obs
dataframe has a domain, which is a soft limit on what values you may write to it. You’ll get an exception like Query: A range was set outside of the current domain
if you try to read or write soma_joinid
values outside this range. This is an important data-integrity reassurance.
The domain seen here matches with the data populated inside of it. This will usually be the case, unless you created the dataframe but haven’t written any data to it yet. In that case, it’s empty, but it still has a domain.
If you have more data (more cells) to add to the experiment later, you will be able resize the obs
, up to the maxdomain
, which is a hard limit.
You’ll learn more about this on experiment-level resizes throughout this tutorial, as well as in the tutorial on TileDB-SOMA’s append mode.
The var
dataframe’s domain is similar:
Likewise, the N-dimensional arrays within the experiment have their shapes as well.
An important difference: while the dataframe domain gives you the inclusive lower and upper bounds for soma_joinid
writes, the shape
for the N-dimensional arrays is the upper bound plus 1.
Since there are 2638 cells and 1838 genes here, X
’s shape reflects that.
The other N-dimensional arrays are similar:
In particular, the X
array in this experiment — and in most experiments — is sparse. That means the matrix doesn’t need a number in every row or cell. Still, the shape serves as a soft limit for reads and writes: you’ll get an exception trying to read or write outside of these bounds. (Specifically, the message you’ll see is Query: A range was set outside of the current domain
.)
As a convenience, you can see all the experiment’s objects’ shapes at once as follows:
import tiledbsoma.io
tiledbsoma.io.show_experiment_shapes(exp.uri)
[DataFrame] obs
URI tiledb://TileDB-Inc/4e63acce-71cc-4d42-96b8-0815bf7fc497
non_empty_domain ((0, 2637),)
domain ((0, 2637),)
maxdomain ((0, 9223372036854773758),)
upgraded True
[DataFrame] ms/RNA/var
URI tiledb://TileDB-Inc/95998d1a-82f9-4555-adc9-dfdee2f057f0
non_empty_domain ((0, 1837),)
domain ((0, 1837),)
maxdomain ((0, 9223372036854773968),)
upgraded True
[SparseNDArray] ms/RNA/X/data
URI tiledb://TileDB-Inc/68acd3b3-fb31-4089-8242-f72f35288ab6
used_shape ((0, 2637), (0, 1837))
shape (2638, 1838)
maxshape (9223372036854773759, 9223372036854773759)
upgraded True
...
[SparseNDArray] ms/RNA/obsm/X_pca
URI tiledb://TileDB-Inc/e147bdff-4066-45ca-90d3-e0041ee4259b
used_shape ((0, 2637), (0, 49))
shape (2638, 50)
maxshape (9223372036854773759, 9223372036854773759)
upgraded True
...
[SparseNDArray] ms/RNA/obsp/distances
URI tiledb://TileDB-Inc/b37fb332-6e31-4a08-8138-272f196081d9
used_shape ((0, 2637), (0, 2637))
shape (2638, 2638)
maxshape (9223372036854773759, 9223372036854773759)
upgraded True
[SparseNDArray] ms/RNA/varm/PCs
URI tiledb://TileDB-Inc/7b2849bb-5804-469c-95e1-c5bf52aa6266
used_shape ((0, 1837), (0, 49))
shape (1838, 50)
maxshape (9223372036854773759, 9223372036854773759)
upgraded True
...
(Not currently implemented in R.)
As with AnnData, as a general rule you’ll see the following:
- An
X
array’sshape
isnobs
xnvar
. - An
obsm
array’s shape isnobs
x some number, maybe 50. - An
obsp
array’s shape isnobs
xnobs
. - A
varm
array’s shape isnvar
x some number, maybe 50. - A
varp
array’s shape isnvar
xnvar
.
When and how to resize at the experiment level
The primary reason you’d resize a dataframe or an array within an experiment is to append more data. For example, say you have an experiment with the results of Monday’s lab run on a sample of 100,000 cells. Then maybe on Tuesday, you’ll want to add that day’s lab run of another 70,000 cells to the same experiment, for a new total of 170,000 cells. It’s also possible that Tuesday’s data might include some infrequently expressed genes that didn’t appear in Monday’s data.
Because the shapes are soft limits, reading or writing beyond which will result in an exception, you’d need to resize the experiment to accommodate new shapes for the dataframes and arrays in the experiment to allow for new nobs
= 170,000.
Visit the append-mode tutorial for information on how to resize experiments by using tiledbsoma.io.register_anndatas
and tiledbsoma.io.resize_experiment
While you can resize each dataframe and array in the experiment one at a time (refer to Advanced usage), the most common case is tiledbsoma.io.resize_experiment
, which exists to make this quick and convenient.
resize_experiment
is available only in Python, because the append-mode feature only exists currently in Python.
How to upgrade older experiments
Experiments created by TileDB-SOMA 1.15 and later will look as shown previously. The following code block shows an experiment created using TileDB-SOMA 1.14.5. This is the same PBMC3K dataset as before, except it’s the unprocessed version: this has fewer component arrays, which keeps the display here more compact.
Experiment-level upgrade is applicable only to the TileDB-SOMA Python API. This is because TileDB-SOMA experiments created n R before TileDB-SOMA 1.15 have their array shape
already the same as maxshape
, so these can’t be expanded more.
Compare the shapes from before TileDB-SOMA 1.15 to TileDB-SOMA 1.15:
Note that for the pre-1.15 experiment, the shape
is large — like the maxshape
— and tiledbsoma_has_upgraded_domain
is False
.
To make the old experiment look like the new experiment, call upgrade_experiment_shapes
, and reopen.
For purposes of this document, we show the results of having done that.
Note that show_experiment_shapes
and upgrade_experiment_shapes
are currently only implemented in Python.
Before upgrading:
tiledbsoma.io.show_experiment_shapes("tiledb://TileDB-Inc/shapes-example-pre-1.15-upgraded"
)
[DataFrame] obs
URI tiledb://TileDB-Inc/85bdf23b-e0fe-4494-9012-c9102fc6be90
non_empty_domain ((0, 2699),)
domain ((0, 2147483646),)
maxdomain ((0, 2147483646),)
upgraded False
[DataFrame] ms/RNA/var
URI tiledb://TileDB-Inc/45bd8385-dd82-40f6-a428-2a85c8626afe
non_empty_domain ((0, 13713),)
domain ((0, 2147483646),)
maxdomain ((0, 2147483646),)
upgraded False
[SparseNDArray] ms/RNA/X/data
URI tiledb://TileDB-Inc/b714d8f6-9283-4191-8e2d-9b41c4007ee1
used_shape ((0, 2699), (0, 13713))
shape (2147483646, 2147483646)
maxshape (2147483646, 2147483646)
upgraded False
True
Applying the upgrade:
tiledbsoma.io.upgrade_experiment_shapes("tiledb://TileDB-Inc/shapes-example-pre-1.15-upgraded", verbose=True
)
[DataFrame] obs
URI tiledb://TileDB-Inc/85bdf23b-e0fe-4494-9012-c9102fc6be90
Applying tiledbsoma_upgrade_soma_joinid_shape(2700)
[DataFrame] ms/RNA/var
URI tiledb://TileDB-Inc/45bd8385-dd82-40f6-a428-2a85c8626afe
Applying tiledbsoma_upgrade_soma_joinid_shape(13714)
[SparseNDArray] ms/RNA/X/data
URI tiledb://TileDB-Inc/b714d8f6-9283-4191-8e2d-9b41c4007ee1
Applying tiledbsoma_upgrade_shape((2700, 13714))
True
After the upgrade:
"tiledb://TileDB-Inc/shapes-example-pre-1.15-upgraded") tio.show_experiment_shapes(
[DataFrame] obs
URI tiledb://TileDB-Inc/85bdf23b-e0fe-4494-9012-c9102fc6be90
non_empty_domain ((0, 2699),)
domain ((0, 2699),)
maxdomain ((0, 2147483646),)
upgraded True
[DataFrame] ms/RNA/var
URI tiledb://TileDB-Inc/45bd8385-dd82-40f6-a428-2a85c8626afe
non_empty_domain ((0, 13713),)
domain ((0, 13713),)
maxdomain ((0, 2147483646),)
upgraded True
[SparseNDArray] ms/RNA/X/data
URI tiledb://TileDB-Inc/b714d8f6-9283-4191-8e2d-9b41c4007ee1
used_shape ((0, 2699), (0, 13713))
shape (2700, 13714)
maxshape (2147483646, 2147483646)
upgraded True
To run a pre-check, you can do the following:
=True) tiledbsoma.io.upgrade_experiment_shapes(the_uri, check_only
(Not currently implemented in R.)
This won’t change anything. It’ll only tell you if the operation will be possible.
Advanced usage
Dataframes with non-standard index columns
In the SOMA data model, the SparseNDArray
and DenseNDArray
objects always have int64
dimensions named soma_dim_0
, soma_dim_1
, and up, and they have a numeric soma_data
attribute for the contents of the array.
For dataframes, though, while there must be a soma_joinid
column of type int64, you can have additional index columns, or soma_joinid
may be a non-index column.
This means that in the most common case, you can think of a dataframe has having a shape just as the N-dimensional arrays do.
That being said, dataframes are capable of more than that, via the index-column names you specify at creation time.
Create some dataframes, with the same data, but different choices of index-column names.
Now inspect the domain
and maxdomain
for these dataframes.
Notice the soma_joinid
slot of the dataframe’s domain is as requested.
Another point is that domain cannot be specified for string-type index columns.
You can set them at creation time in one of two ways:
= ([(0, 9), None],)
domain # or
= ([(0, 9), ("", "")],) domain
=list(soma_joinid=c(0, 9), mystring=NULL),
domain# or
=list(soma_joinid=c(0, 9), mystring=c('', '')), domain
In either case, the domain slot for a string-typed index column will read back as a pair of empty strings:
Now inspect the other dataframe. Here, soma_joinid
isn’t an index column at all. This is fine, as long as within the data you write to it, the index-column values uniquely identify each row.
The domain reads back as written.
Use resize
at the dataframe/array level with the SOMA API
Earlier in this tutorial, you learned a fast and convenient way to resize all the dataframes and arrays within an experiment.
However, should you choose to do so, you can apply these one dataframe or array at a time.
For N-dimensional arrays that have been upgraded, or that were created using TileDB-SOMA 1.15 or later, do the following:
- If the array’s
tiledbsoma_has_upgraded_shape
method reports False, invoke thetiledbsoma_upgrade_shape
method. - Otherwise, invoke the
.resize
method.
Note: for purposes of this document, two experiments are shown: a before and an after. For your purposes, you would use a single experiment, and operate on only that.
Unpack a pre-1.15 experiment:
Notice that the X
array has not been upgraded, and that its shape
reports the same as maxshape
:
Now give the X
array the new-style shape. First, consult its non-empty domain to find get a report of what data have already been successfully written there:
Next, reopen the experiment to find out what happened:
If you want, you can resize it even more:
For dataframes, the process is similar. If you want to expand only the soft limits for soma_joinid
, you can use these methods instead:
- If the dataframe’s
tiledbsoma_has_upgraded_domain
reports False, invoke.tiledbsoma_upgrade_domain
- Otherwise, invoke the
.change_domain
method.
TileDB-SOMA shape
and domain
in comparison to other TileDB terminology
TileDB-SOMA uses TileDB to implement the SOMA specification. You may find terminology corresponding to both TileDB and SOMA. This document has made use of SOMA terminology only. However, if you are familiar with broader TileDB concepts, here are the mappings.
- Core
domain
:- This has always existed.
- This is immutable: it cannot be changed either larger or smaller once a dataframe or array has been created.
- A SOMA
DataFrame
’smaxdomain
is implemented by coredomain
. - A SOMA
SparseNDArray
orDenseNDArray
’smaxshape
is implemented by coredomain
. - It’s a runtime error to read or write data outside these boundaries.
- This is a hard limit, in that it can’t be increased.
- Core
current_domain
:- This was introduced in 2024 as of version 2.26 of the open-source core of TileDB, and is available in TileDB-SOMA as of version 1.15.
- This is mutable: it can’t be made smaller after dataframe or array creation, but you can make it larger, up to the core
domain
(SOMAmaxdomain
/maxshape
). - A SOMA
DataFrame
’sdomain
is implemented by corecurrent_domain
. - A SOMA
SparseNDArray
orDenseNDArray
’sshape
is implemented by corecurrent_domain
. - TileDB-SOMA will throw a runtime error if you try to read or write data outside these boundaries: you will see the error message
A range was set outside of the current domain
. - This is a soft limit, in that may be increased up to the hard limit.
- Dataframes/arrays created by TileDB-SOMA 1.14 or lower:
- These will necessarily have core
domain
(SOMAmaxdomain
andmaxshape
, respectively). - These won’t have the core
current_domain
. - When you ask for a SOMA dataset’s
domain
orshape
, you get the same value asmaxdomain
ormaxshape
. - Their
tiledbsoma_has_upgraded_domain()
andtiledbsoma_has_upgraded_shape()
methods returnFalse
. - Using the upgrade feature mentioned previously, you can apply a core
current_domain
.
- These will necessarily have core
- Dataframes and arrays created by TileDB-SOMA 1.15 and later, or that have been upgraded:
- These will necessarily have the core
domain
(SOMAmaxdomain
andmaxshape
, respectively). - These will also have the core
current_domain
(SOMAdomain
andshape
, respectively). - Their
tiledbsoma_has_upgraded_domain()
andtiledbsoma_has_upgraded_shape()
methods returnTrue
.
- These will necessarily have the core