1. Structure
  2. Life Sciences
  3. Single-cell
  4. Foundation
  5. Key Concepts
  6. Join IDs
  • Home
  • What is TileDB?
  • Get Started
  • Explore Content
  • Accounts
    • Individual Accounts
      • Apply for the Free Tier
      • Profile
        • Overview
        • Cloud Credentials
        • Storage Paths
        • REST API Tokens
        • Credits
    • Organization Admins
      • Create an Organization
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
    • Organization Members
      • Organization Invitations
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
  • Catalog
    • Introduction
    • Data
      • Arrays
      • Tables
      • Single-Cell (SOMA)
      • Genomics (VCF)
      • Biomedical Imaging
      • Vector Search
      • Files
    • Code
      • Notebooks
      • Dashboards
      • User-Defined Functions
      • Task Graphs
      • ML Models
    • Groups
    • Marketplace
    • Search
  • Collaborate
    • Introduction
    • Organizations
    • Access Control
      • Introduction
      • Share Assets
      • Asset Permissions
      • Public Assets
    • Logging
    • Marketplace
  • Analyze
    • Introduction
    • Slice Data
    • Multi-Region Redirection
    • Notebooks
      • Launch a Notebook
      • Usage
      • Widgets
      • Notebook Image Dependencies
    • Dashboards
      • Dashboards
      • Streamlit
    • Preview
    • User-Defined Functions
    • Task Graphs
    • Serverless SQL
    • Monitor
      • Task Log
      • Task Graph Log
  • Scale
    • Introduction
    • Task Graphs
    • API Usage
  • Structure
    • Why Structure Is Important
    • Arrays
      • Introduction
      • Quickstart
      • Foundation
        • Array Data Model
        • Key Concepts
          • Storage
            • Arrays
            • Dimensions
            • Attributes
            • Cells
            • Domain
            • Tiles
            • Data Layout
            • Compression
            • Encryption
            • Tile Filters
            • Array Schema
            • Schema Evolution
            • Fragments
            • Fragment Metadata
            • Commits
            • Indexing
            • Array Metadata
            • Datetimes
            • Groups
            • Object Stores
          • Compute
            • Writes
            • Deletions
            • Consolidation
            • Vacuuming
            • Time Traveling
            • Reads
            • Query Conditions
            • Aggregates
            • User-Defined Functions
            • Distributed Compute
            • Concurrency
            • Parallelism
        • Storage Format Spec
      • Tutorials
        • Basics
          • Basic Dense Array
          • Basic Sparse Array
          • Array Metadata
          • Compression
          • Encryption
          • Data Layout
          • Tile Filters
          • Datetimes
          • Multiple Attributes
          • Variable-Length Attributes
          • String Dimensions
          • Nullable Attributes
          • Multi-Range Reads
          • Query Conditions
          • Aggregates
          • Deletions
          • Catching Errors
          • Configuration
          • Basic S3 Example
          • Basic TileDB Cloud
          • fromDataFrame
          • Palmer Penguins
        • Advanced
          • Schema Evolution
          • Advanced Writes
            • Write at a Timestamp
            • Get Fragment Info
            • Consolidation
              • Fragments
              • Fragment List
              • Consolidation Plan
              • Commits
              • Fragment Metadata
              • Array Metadata
            • Vacuuming
              • Fragments
              • Commits
              • Fragment Metadata
              • Array Metadata
          • Advanced Reads
            • Get Fragment Info
            • Time Traveling
              • Introduction
              • Fragments
              • Array Metadata
              • Schema Evolution
          • Array Upgrade
          • Backends
            • Amazon S3
            • Azure Blob Storage
            • Google Cloud Storage
            • MinIO
            • Lustre
          • Virtual Filesystem
          • User-Defined Functions
          • Distributed Compute
          • Result Estimation
          • Incomplete Queries
        • Management
          • Array Schema
          • Groups
          • Object Management
        • Performance
          • Summary of Factors
          • Dense vs. Sparse
          • Dimensions vs. Attributes
          • Compression
          • Tiling and Data Layout
          • Tuning Writes
          • Tuning Reads
      • API Reference
    • Tables
      • Introduction
      • Quickstart
      • Foundation
        • Data Model
        • Key Concepts
          • Indexes
          • Columnar Storage
          • Compression
          • Data Manipulation
          • Optimize Tables
          • ACID
          • Serverless SQL
          • SQL Connectors
          • Dataframes
          • CSV Ingestion
      • Tutorials
        • Basics
          • Ingestion with SQL
          • CSV Ingestion
          • Basic S3 Example
          • Running Locally
        • Advanced
          • Scalable Ingestion
          • Scalable Queries
      • API Reference
    • AI & ML
      • Vector Search
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Vector Search
            • Vector Databases
            • Algorithms
            • Distance Metrics
            • Updates
            • Deployment Methods
            • Architecture
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Ingestion & Querying
            • Updates
            • Deletions
            • Basic S3 Example
            • Running Locally
          • Advanced
            • Versioning
            • Time Traveling
            • Consolidation
            • Distributed Compute
            • RAG LLM
            • LLM Memory
            • File Search
            • Image Search
            • Protein Search
          • Performance
        • API Reference
      • ML Models
        • Introduction
        • Quickstart
        • Foundation
          • Basics
          • Storage
          • Cloud Execution
          • Why TileDB for Machine Learning
        • Tutorials
          • Ingestion
            • Data Ingestion
              • Dense Datasets
              • Sparse Datasets
            • ML Model Ingestion
          • Management
            • Array Schema
            • Machine Learning: Groups
            • Time Traveling
    • Life Sciences
      • Single-cell
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Data Structures
            • Use of Apache Arrow
            • Join IDs
            • State Management
            • TileDB Cloud URIs
          • SOMA API Specification
        • Tutorials
          • Data Ingestion
          • Bulk Ingestion Tutorial
          • Data Access
          • Distributed Compute
          • Basic S3 Example
          • Multi-Experiment Queries
          • Appending Data to a SOMA Experiment
          • Add New Measurements
          • SQL Queries
          • Running Locally
          • Shapes in TileDB-SOMA
          • Drug Discovery App
        • Spatial
          • Introduction
          • Foundation
            • Spatial Data Model
            • Data Structures
          • Tutorials
            • Spatial Data Ingestion
            • Access Spatial Data
            • Manage Coordinate Spaces
        • API Reference
      • Population Genomics
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • The N+1 Problem
            • Architecture
            • Arrays
            • Ingestion
            • Reads
            • Variant Statistics
            • Annotations
            • User-Defined Functions
            • Tables and SQL
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Basic Ingestion
            • Basic Queries
            • Export to VCF
            • Add New Samples
            • Deleting Samples
            • Basic S3 Example
            • Basic TileDB Cloud
          • Advanced
            • Scalable Ingestion
            • Scalable Queries
            • Query Transforms
            • Handling Large Queries
            • Annotations
              • Finding Annotations
              • Embedded Annotations
              • External Annotations
              • Annotation VCFs
              • Ingesting Annotations
            • Variant Statistics
            • Tables and SQL
            • User-Defined Functions
            • Sample Metadata
            • Split VCF
          • Performance
        • API Reference
          • Command Line Interface
          • Python API
          • Cloud API
      • Biomedical Imaging
        • Introduction
        • Foundation
          • Data Model
          • Key Concepts
            • Arrays
            • Ingestion
            • Reads
            • User Defined Functions
          • Storage Format Spec
        • Quickstart
        • Tutorials
          • Basics
            • Ingestion
            • Read
              • OpenSlide
              • TileDB-Py
          • Advanced
            • Batched Ingestion
            • Chunked Ingestion
            • Machine Learning
              • PyTorch
            • Napari
    • Files
  • API Reference
  • Self-Hosting
    • Installation
    • Upgrades
    • Administrative Tasks
    • Image Customization
      • Customize User-Defined Function Images
      • AWS ECR Container Registry
      • Customize Jupyter Notebook Images
    • Single Sign-On
      • Configure Single Sign-On
      • OpenID Connect
      • Okta SCIM
      • Microsoft Entra
  • Glossary

On this page

  • AnnData conventions
  • SOMA conventions
  • Conclusion
  1. Structure
  2. Life Sciences
  3. Single-cell
  4. Foundation
  5. Key Concepts
  6. Join IDs

Join IDs

life sciences
single cell (soma)
foundation
Learn about the use of join IDs in TileDB-SOMA.

As described in the Data Model section, an annotated matrix, which is the core structure SOMA encapsulates, consists of multiple array types:

  • The obs dataframe with information about observations (e.g., cells).
  • The var dataframe with information about variables (e.g., transcripts).
  • The two-dimensional X matrix containing the actual measurements, where one dimension corresponds to observations and the other to variables (e.g., a cell-by-gene count matrix).

For single-cell data, this core structure is extended to include additional arrays for storing derived data, such as PCA coordinates, UMAP coordinates, and pairwise connectivities or distances.

The coordinates in X (and the other arrays) needs to line up with the information in the obs and var dataframes to maintain the integrity of the assay data and its annotations.

AnnData conventions

This section examines the conventions used in the AnnData Python package, as an example of how different other software tracks relationships between different components of an annotated matrix.

In the AnnData world, the following indexing conventions apply:

  • obs has an index column, nominally a string column, often containing cell barcodes.
  • var has an index column, nominally a string column, generally containing Ensembl or NCBI identifiers.
  • X is integer-indexed. The obs and var dataframes are row and column annotations for the indices of the X matrix.
  • Similarly:
    • obs positions annotate the row indices of matrices in the obsm collection.
    • var positions annotate the row indices of matrices in the varm collection.
    • obs positions annotate the row and column indices of matrices in the obsp collection.
    • var positions annotate the row and column indices of matrices in the varp collection.

For example, consider the following obs dataframe from an AnnData object. Values in the obs_id column are used to index the rows of the X matrix.

obs_id (index) n_genes percent_mito Note
AAACATTGAGCTAC 135 0.034 This is implicitly row 0
GATTTAGATTCGTT 24 0.022 This is implicitly row 1
TTTCGAACTCTCAT 589 0.017 This is implicitly row 2

Similarly, the var dataframe uses values in the var_id column to index the columns of the X matrix.

var_id (index) n_cells Note
APOE 137 This is implicitly row 0
ESR1 248 This is implicitly row 1

Finally, the X matrix itself might look like this:

Column 0 Column 1
Row 0 17 34
Row 1 29 22
Row 2 5 28

In this example, the value in row 0, column 0 of the X matrix corresponds to the expression level of gene APOE in cell AAACATTGAGCTAC.

SOMA conventions

TileDB-SOMA uses an approach that is conceptually similar to AnnData’s (and other annotated matrix software), except that integer IDs are always used to track relationships between different components of a SOMA experiment. These join IDs are always int64 values in the range [0, 2^63-1] and conventionally, but not necessarily, contiguous starting from 0.

Users will most often encounter join IDs in:

  • SOMADataFrames, which always include a column called soma_joinid that contains the join IDs for each row.
  • SOMASparseNDArrays, which contain one or more dimensions, each with a name like soma_dim_N, where N is the dimension number and the values are the join IDs.

In the context of a SOMA experiment:

  • obs is a SOMADataFrame in which the soma_joinid column contains a unique value for each observation:

    soma_joinid obs_id n_genes percent_mito
    0 AAACATTGAGCTAC 135 0.034
    1 GATTTAGATTCGTT 24 0.022
    2 TTTCGAACTCTCAT 589 0.017
  • Within each SOMAMeasurement:

    • var is a SOMADataFrame in which the soma_joinid column contains a unique value for each variable:

      soma_joinid var_id n_cells
      0 APOE 137
      1 ESR1 248
    • Each X layer is a SOMASparseNDArray where values in soma_dim_0 map to obs’s soma_joinid column and values in soma_dim_1 map to var’s soma_joinid column:

      X soma_dim_1=0 soma_dim_1=1
      soma_dim_0=0 17 34
      soma_dim_0=1 29 22
      soma_dim_0=2 5 28
    • Furthermore:

      • obs’s soma_joinid annotate the row indices of layers in the obsm collection
      • var’s soma_joinid annotate the row indices of matrices in the varm collection
      • obs’s soma_joinid annotate the row and column indices of matrices in the obsp collection
      • var’s soma_joinid annotate the row and column indices of matrices in the varp collection

Conclusion

From a user’s perspective, the join IDs are mostly abstracted away by the TileDB-SOMA API. For example, when using the provided ingestors (e.g., tiledbsoma.io.from_anndata() in Python and tiledbsoma::write_soma.Seurat() in R), the join IDs are automatically generated. However, understanding the concept of join IDs is useful for working with the data programmatically or when extending the SOMA data model. You can learn more about the use of join IDs in the SOMA API specification.

Use of Apache Arrow
State Management