Join IDs
As described in the Data Model section, an annotated matrix, which is the core structure SOMA encapsulates, consists of multiple array types:
- The
obsdataframe with information about observations (e.g., cells). - The
vardataframe with information about variables (e.g., transcripts). - The two-dimensional
Xmatrix containing the actual measurements, where one dimension corresponds to observations and the other to variables (e.g., a cell-by-gene count matrix).
For single-cell data, this core structure is extended to include additional arrays for storing derived data, such as PCA coordinates, UMAP coordinates, and pairwise connectivities or distances.
The coordinates in X (and the other arrays) needs to line up with the information in the obs and var dataframes to maintain the integrity of the assay data and its annotations.
AnnData conventions
This section examines the conventions used in the AnnData Python package, as an example of how different other software tracks relationships between different components of an annotated matrix.
In the AnnData world, the following indexing conventions apply:
obshas an index column, nominally a string column, often containing cell barcodes.varhas an index column, nominally a string column, generally containing Ensembl or NCBI identifiers.Xis integer-indexed. Theobsandvardataframes are row and column annotations for the indices of theXmatrix.- Similarly:
obspositions annotate the row indices of matrices in theobsmcollection.varpositions annotate the row indices of matrices in thevarmcollection.obspositions annotate the row and column indices of matrices in theobspcollection.varpositions annotate the row and column indices of matrices in thevarpcollection.
For example, consider the following obs dataframe from an AnnData object. Values in the obs_id column are used to index the rows of the X matrix.
obs_id (index) |
n_genes |
percent_mito |
Note |
|---|---|---|---|
| AAACATTGAGCTAC | 135 | 0.034 | This is implicitly row 0 |
| GATTTAGATTCGTT | 24 | 0.022 | This is implicitly row 1 |
| TTTCGAACTCTCAT | 589 | 0.017 | This is implicitly row 2 |
Similarly, the var dataframe uses values in the var_id column to index the columns of the X matrix.
var_id (index) |
n_cells |
Note |
|---|---|---|
APOE |
137 | This is implicitly row 0 |
ESR1 |
248 | This is implicitly row 1 |
Finally, the X matrix itself might look like this:
| Column 0 | Column 1 | |
|---|---|---|
| Row 0 | 17 | 34 |
| Row 1 | 29 | 22 |
| Row 2 | 5 | 28 |
In this example, the value in row 0, column 0 of the X matrix corresponds to the expression level of gene APOE in cell AAACATTGAGCTAC.
SOMA conventions
TileDB-SOMA uses an approach that is conceptually similar to AnnData’s (and other annotated matrix software), except that integer IDs are always used to track relationships between different components of a SOMA experiment. These join IDs are always int64 values in the range [0, 2^63-1] and conventionally, but not necessarily, contiguous starting from 0.
Users will most often encounter join IDs in:
SOMADataFrames, which always include a column calledsoma_joinidthat contains the join IDs for each row.SOMASparseNDArrays, which contain one or more dimensions, each with a name likesoma_dim_N, whereNis the dimension number and the values are the join IDs.
In the context of a SOMA experiment:
obsis aSOMADataFramein which thesoma_joinidcolumn contains a unique value for each observation:soma_joinidobs_idn_genespercent_mito0 AAACATTGAGCTAC135 0.034 1 GATTTAGATTCGTT24 0.022 2 TTTCGAACTCTCAT589 0.017 Within each
SOMAMeasurement:varis aSOMADataFramein which thesoma_joinidcolumn contains a unique value for each variable:soma_joinidvar_idn_cells0 APOE137 1 ESR1248 Each
Xlayer is aSOMASparseNDArraywhere values insoma_dim_0map toobs’ssoma_joinidcolumn and values insoma_dim_1map tovar’ssoma_joinidcolumn:Xsoma_dim_1=0soma_dim_1=1soma_dim_0=017 34 soma_dim_0=129 22 soma_dim_0=25 28 Furthermore:
obs’ssoma_joinidannotate the row indices of layers in theobsmcollectionvar’ssoma_joinidannotate the row indices of matrices in thevarmcollectionobs’ssoma_joinidannotate the row and column indices of matrices in theobspcollectionvar’ssoma_joinidannotate the row and column indices of matrices in thevarpcollection
Conclusion
From a user’s perspective, the join IDs are mostly abstracted away by the TileDB-SOMA API. For example, when using the provided ingestors (e.g., tiledbsoma.io.from_anndata() in Python and tiledbsoma::write_soma.Seurat() in R), the join IDs are automatically generated. However, understanding the concept of join IDs is useful for working with the data programmatically or when extending the SOMA data model. You can learn more about the use of join IDs in the SOMA API specification.