Join IDs

life sciences

single cell (soma)

foundation

Learn about the use of join IDs in TileDB-SOMA.

As described in the Data Model section, an annotated matrix, which is the core structure SOMA encapsulates, consists of multiple array types:

The obs dataframe with information about observations (e.g., cells).
The var dataframe with information about variables (e.g., transcripts).
The two-dimensional X matrix containing the actual measurements, where one dimension corresponds to observations and the other to variables (e.g., a cell-by-gene count matrix).

For single-cell data, this core structure is extended to include additional arrays for storing derived data, such as PCA coordinates, UMAP coordinates, and pairwise connectivities or distances.

The coordinates in X (and the other arrays) needs to line up with the information in the obs and var dataframes to maintain the integrity of the assay data and its annotations.

AnnData conventions

This section examines the conventions used in the AnnData Python package, as an example of how different other software tracks relationships between different components of an annotated matrix.

In the AnnData world, the following indexing conventions apply:

obs has an index column, nominally a string column, often containing cell barcodes.
var has an index column, nominally a string column, generally containing Ensembl or NCBI identifiers.
X is integer-indexed. The obs and var dataframes are row and column annotations for the indices of the X matrix.
Similarly:
- obs positions annotate the row indices of matrices in the obsm collection.
- var positions annotate the row indices of matrices in the varm collection.
- obs positions annotate the row and column indices of matrices in the obsp collection.
- var positions annotate the row and column indices of matrices in the varp collection.

For example, consider the following obs dataframe from an AnnData object. Values in the obs_id column are used to index the rows of the X matrix.

`obs_id` (index)	`n_genes`	`percent_mito`	Note
AAACATTGAGCTAC	135	0.034	This is implicitly row 0
GATTTAGATTCGTT	24	0.022	This is implicitly row 1
TTTCGAACTCTCAT	589	0.017	This is implicitly row 2

Similarly, the var dataframe uses values in the var_id column to index the columns of the X matrix.

`var_id` (index)	`n_cells`	Note
`APOE`	137	This is implicitly row 0
`ESR1`	248	This is implicitly row 1

Finally, the X matrix itself might look like this:

	Column 0	Column 1
Row 0	17	34
Row 1	29	22
Row 2	5	28

In this example, the value in row 0, column 0 of the X matrix corresponds to the expression level of gene APOE in cell AAACATTGAGCTAC.

SOMA conventions

TileDB-SOMA uses an approach that is conceptually similar to AnnData’s (and other annotated matrix software), except that integer IDs are always used to track relationships between different components of a SOMA experiment. These join IDs are always int64 values in the range [0, 2^63-1] and conventionally, but not necessarily, contiguous starting from 0.

Users will most often encounter join IDs in:

SOMADataFrames, which always include a column called soma_joinid that contains the join IDs for each row.
SOMASparseNDArrays, which contain one or more dimensions, each with a name like soma_dim_N, where N is the dimension number and the values are the join IDs.

In the context of a SOMA experiment:

obs is a SOMADataFrame in which the soma_joinid column contains a unique value for each observation:

soma_joinid obs_id n_genes percent_mito

0 AAACATTGAGCTAC 135 0.034

1 GATTTAGATTCGTT 24 0.022

2 TTTCGAACTCTCAT 589 0.017
Within each SOMAMeasurement:
- var is a SOMADataFrame in which the soma_joinid column contains a unique value for each variable:
  
  soma_joinid var_id n_cells
  
  0 APOE 137
  
  1 ESR1 248
- Each X layer is a SOMASparseNDArray where values in soma_dim_0 map to obs’s soma_joinid column and values in soma_dim_1 map to var’s soma_joinid column:
  
  X soma_dim_1=0 soma_dim_1=1
  
  soma_dim_0=0 17 34
  
  soma_dim_0=1 29 22
  
  soma_dim_0=2 5 28
- Furthermore:
  - obs’s soma_joinid annotate the row indices of layers in the obsm collection
  - var’s soma_joinid annotate the row indices of matrices in the varm collection
  - obs’s soma_joinid annotate the row and column indices of matrices in the obsp collection
  - var’s soma_joinid annotate the row and column indices of matrices in the varp collection

`soma_joinid`	`obs_id`	`n_genes`	`percent_mito`
0	`AAACATTGAGCTAC`	135	0.034
1	`GATTTAGATTCGTT`	24	0.022
2	`TTTCGAACTCTCAT`	589	0.017

`soma_joinid`	`var_id`	`n_cells`
0	`APOE`	137
1	`ESR1`	248

`X`	`soma_dim_1=0`	`soma_dim_1=1`
`soma_dim_0=0`	17	34
`soma_dim_0=1`	29	22
`soma_dim_0=2`	5	28

Conclusion

From a user’s perspective, the join IDs are mostly abstracted away by the TileDB-SOMA API. For example, when using the provided ingestors (e.g., tiledbsoma.io.from_anndata() in Python and tiledbsoma::write_soma.Seurat() in R), the join IDs are automatically generated. However, understanding the concept of join IDs is useful for working with the data programmatically or when extending the SOMA data model. You can learn more about the use of join IDs in the SOMA API specification.