Join IDs
As described in the Data Model section, an annotated matrix, which is the core structure SOMA encapsulates, consists of multiple array types:
- The
obs
dataframe with information about observations (e.g., cells). - The
var
dataframe with information about variables (e.g., transcripts). - The two-dimensional
X
matrix containing the actual measurements, where one dimension corresponds to observations and the other to variables (e.g., a cell-by-gene count matrix).
For single-cell data, this core structure is extended to include additional arrays for storing derived data, such as PCA coordinates, UMAP coordinates, and pairwise connectivities or distances.
The coordinates in X
(and the other arrays) needs to line up with the information in the obs
and var
dataframes to maintain the integrity of the assay data and its annotations.
AnnData conventions
This section examines the conventions used in the AnnData Python package, as an example of how different other software tracks relationships between different components of an annotated matrix.
In the AnnData world, the following indexing conventions apply:
obs
has an index column, nominally a string column, often containing cell barcodes.var
has an index column, nominally a string column, generally containing Ensembl or NCBI identifiers.X
is integer-indexed. Theobs
andvar
dataframes are row and column annotations for the indices of theX
matrix.- Similarly:
obs
positions annotate the row indices of matrices in theobsm
collection.var
positions annotate the row indices of matrices in thevarm
collection.obs
positions annotate the row and column indices of matrices in theobsp
collection.var
positions annotate the row and column indices of matrices in thevarp
collection.
For example, consider the following obs
dataframe from an AnnData
object. Values in the obs_id
column are used to index the rows of the X
matrix.
obs_id (index) |
n_genes |
percent_mito |
Note |
---|---|---|---|
AAACATTGAGCTAC | 135 | 0.034 | This is implicitly row 0 |
GATTTAGATTCGTT | 24 | 0.022 | This is implicitly row 1 |
TTTCGAACTCTCAT | 589 | 0.017 | This is implicitly row 2 |
Similarly, the var
dataframe uses values in the var_id
column to index the columns of the X
matrix.
var_id (index) |
n_cells |
Note |
---|---|---|
APOE |
137 | This is implicitly row 0 |
ESR1 |
248 | This is implicitly row 1 |
Finally, the X
matrix itself might look like this:
Column 0 | Column 1 | |
---|---|---|
Row 0 | 17 | 34 |
Row 1 | 29 | 22 |
Row 2 | 5 | 28 |
In this example, the value in row 0, column 0 of the X
matrix corresponds to the expression level of gene APOE
in cell AAACATTGAGCTAC
.
SOMA conventions
TileDB-SOMA uses an approach that is conceptually similar to AnnData’s (and other annotated matrix software), except that integer IDs are always used to track relationships between different components of a SOMA experiment. These join IDs are always int64
values in the range [0, 2^63-1]
and conventionally, but not necessarily, contiguous starting from 0
.
Users will most often encounter join IDs in:
SOMADataFrame
s, which always include a column calledsoma_joinid
that contains the join IDs for each row.SOMASparseNDArray
s, which contain one or more dimensions, each with a name likesoma_dim_N
, whereN
is the dimension number and the values are the join IDs.
In the context of a SOMA experiment:
obs
is aSOMADataFrame
in which thesoma_joinid
column contains a unique value for each observation:soma_joinid
obs_id
n_genes
percent_mito
0 AAACATTGAGCTAC
135 0.034 1 GATTTAGATTCGTT
24 0.022 2 TTTCGAACTCTCAT
589 0.017 Within each
SOMAMeasurement
:var
is aSOMADataFrame
in which thesoma_joinid
column contains a unique value for each variable:soma_joinid
var_id
n_cells
0 APOE
137 1 ESR1
248 Each
X
layer is aSOMASparseNDArray
where values insoma_dim_0
map toobs
’ssoma_joinid
column and values insoma_dim_1
map tovar
’ssoma_joinid
column:X
soma_dim_1=0
soma_dim_1=1
soma_dim_0=0
17 34 soma_dim_0=1
29 22 soma_dim_0=2
5 28 Furthermore:
obs
’ssoma_joinid
annotate the row indices of layers in theobsm
collectionvar
’ssoma_joinid
annotate the row indices of matrices in thevarm
collectionobs
’ssoma_joinid
annotate the row and column indices of matrices in theobsp
collectionvar
’ssoma_joinid
annotate the row and column indices of matrices in thevarp
collection
Conclusion
From a user’s perspective, the join IDs are mostly abstracted away by the TileDB-SOMA API. For example, when using the provided ingestors (e.g., tiledbsoma.io.from_anndata()
in Python and tiledbsoma::write_soma.Seurat()
in R), the join IDs are automatically generated. However, understanding the concept of join IDs is useful for working with the data programmatically or when extending the SOMA data model. You can learn more about the use of join IDs in the SOMA API specification.