1. Structure
  2. Life Sciences
  3. Single-cell
  4. Tutorials
  5. Appending Data to a SOMA Experiment
  • Home
  • What is TileDB?
  • Get Started
  • Explore Content
  • Accounts
    • Individual Accounts
      • Apply for the Free Tier
      • Profile
        • Overview
        • Cloud Credentials
        • Storage Paths
        • REST API Tokens
        • Credits
    • Organization Admins
      • Create an Organization
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
    • Organization Members
      • Organization Invitations
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
  • Catalog
    • Introduction
    • Data
      • Arrays
      • Tables
      • Single-Cell (SOMA)
      • Genomics (VCF)
      • Biomedical Imaging
      • Vector Search
      • Files
    • Code
      • Notebooks
      • Dashboards
      • User-Defined Functions
      • Task Graphs
      • ML Models
    • Groups
    • Marketplace
    • Search
  • Collaborate
    • Introduction
    • Organizations
    • Access Control
      • Introduction
      • Share Assets
      • Asset Permissions
      • Public Assets
    • Logging
    • Marketplace
  • Analyze
    • Introduction
    • Slice Data
    • Multi-Region Redirection
    • Notebooks
      • Launch a Notebook
      • Usage
      • Widgets
      • Notebook Image Dependencies
    • Dashboards
      • Dashboards
      • Streamlit
    • Preview
    • User-Defined Functions
    • Task Graphs
    • Serverless SQL
    • Monitor
      • Task Log
      • Task Graph Log
  • Scale
    • Introduction
    • Task Graphs
    • API Usage
  • Structure
    • Why Structure Is Important
    • Arrays
      • Introduction
      • Quickstart
      • Foundation
        • Array Data Model
        • Key Concepts
          • Storage
            • Arrays
            • Dimensions
            • Attributes
            • Cells
            • Domain
            • Tiles
            • Data Layout
            • Compression
            • Encryption
            • Tile Filters
            • Array Schema
            • Schema Evolution
            • Fragments
            • Fragment Metadata
            • Commits
            • Indexing
            • Array Metadata
            • Datetimes
            • Groups
            • Object Stores
          • Compute
            • Writes
            • Deletions
            • Consolidation
            • Vacuuming
            • Time Traveling
            • Reads
            • Query Conditions
            • Aggregates
            • User-Defined Functions
            • Distributed Compute
            • Concurrency
            • Parallelism
        • Storage Format Spec
      • Tutorials
        • Basics
          • Basic Dense Array
          • Basic Sparse Array
          • Array Metadata
          • Compression
          • Encryption
          • Data Layout
          • Tile Filters
          • Datetimes
          • Multiple Attributes
          • Variable-Length Attributes
          • String Dimensions
          • Nullable Attributes
          • Multi-Range Reads
          • Query Conditions
          • Aggregates
          • Deletions
          • Catching Errors
          • Configuration
          • Basic S3 Example
          • Basic TileDB Cloud
          • fromDataFrame
          • Palmer Penguins
        • Advanced
          • Schema Evolution
          • Advanced Writes
            • Write at a Timestamp
            • Get Fragment Info
            • Consolidation
              • Fragments
              • Fragment List
              • Consolidation Plan
              • Commits
              • Fragment Metadata
              • Array Metadata
            • Vacuuming
              • Fragments
              • Commits
              • Fragment Metadata
              • Array Metadata
          • Advanced Reads
            • Get Fragment Info
            • Time Traveling
              • Introduction
              • Fragments
              • Array Metadata
              • Schema Evolution
          • Array Upgrade
          • Backends
            • Amazon S3
            • Azure Blob Storage
            • Google Cloud Storage
            • MinIO
            • Lustre
          • Virtual Filesystem
          • User-Defined Functions
          • Distributed Compute
          • Result Estimation
          • Incomplete Queries
        • Management
          • Array Schema
          • Groups
          • Object Management
        • Performance
          • Summary of Factors
          • Dense vs. Sparse
          • Dimensions vs. Attributes
          • Compression
          • Tiling and Data Layout
          • Tuning Writes
          • Tuning Reads
      • API Reference
    • Tables
      • Introduction
      • Quickstart
      • Foundation
        • Data Model
        • Key Concepts
          • Indexes
          • Columnar Storage
          • Compression
          • Data Manipulation
          • Optimize Tables
          • ACID
          • Serverless SQL
          • SQL Connectors
          • Dataframes
          • CSV Ingestion
      • Tutorials
        • Basics
          • Ingestion with SQL
          • CSV Ingestion
          • Basic S3 Example
          • Running Locally
        • Advanced
          • Scalable Ingestion
          • Scalable Queries
      • API Reference
    • AI & ML
      • Vector Search
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Vector Search
            • Vector Databases
            • Algorithms
            • Distance Metrics
            • Updates
            • Deployment Methods
            • Architecture
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Ingestion & Querying
            • Updates
            • Deletions
            • Basic S3 Example
            • Running Locally
          • Advanced
            • Versioning
            • Time Traveling
            • Consolidation
            • Distributed Compute
            • RAG LLM
            • LLM Memory
            • File Search
            • Image Search
            • Protein Search
          • Performance
        • API Reference
      • ML Models
        • Introduction
        • Quickstart
        • Foundation
          • Basics
          • Storage
          • Cloud Execution
          • Why TileDB for Machine Learning
        • Tutorials
          • Ingestion
            • Data Ingestion
              • Dense Datasets
              • Sparse Datasets
            • ML Model Ingestion
          • Management
            • Array Schema
            • Machine Learning: Groups
            • Time Traveling
    • Life Sciences
      • Single-cell
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Data Structures
            • Use of Apache Arrow
            • Join IDs
            • State Management
            • TileDB Cloud URIs
          • SOMA API Specification
        • Tutorials
          • Data Ingestion
          • Bulk Ingestion Tutorial
          • Data Access
          • Distributed Compute
          • Basic S3 Example
          • Multi-Experiment Queries
          • Appending Data to a SOMA Experiment
          • Add New Measurements
          • SQL Queries
          • Running Locally
          • Shapes in TileDB-SOMA
          • Drug Discovery App
        • Spatial
          • Introduction
          • Foundation
            • Spatial Data Model
            • Data Structures
          • Tutorials
            • Spatial Data Ingestion
            • Access Spatial Data
            • Manage Coordinate Spaces
        • API Reference
      • Population Genomics
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • The N+1 Problem
            • Architecture
            • Arrays
            • Ingestion
            • Reads
            • Variant Statistics
            • Annotations
            • User-Defined Functions
            • Tables and SQL
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Basic Ingestion
            • Basic Queries
            • Export to VCF
            • Add New Samples
            • Deleting Samples
            • Basic S3 Example
            • Basic TileDB Cloud
          • Advanced
            • Scalable Ingestion
            • Scalable Queries
            • Query Transforms
            • Handling Large Queries
            • Annotations
              • Finding Annotations
              • Embedded Annotations
              • External Annotations
              • Annotation VCFs
              • Ingesting Annotations
            • Variant Statistics
            • Tables and SQL
            • User-Defined Functions
            • Sample Metadata
            • Split VCF
          • Performance
        • API Reference
          • Command Line Interface
          • Python API
          • Cloud API
      • Biomedical Imaging
        • Introduction
        • Foundation
          • Data Model
          • Key Concepts
            • Arrays
            • Ingestion
            • Reads
            • User Defined Functions
          • Storage Format Spec
        • Quickstart
        • Tutorials
          • Basics
            • Ingestion
            • Read
              • OpenSlide
              • TileDB-Py
          • Advanced
            • Batched Ingestion
            • Chunked Ingestion
            • Machine Learning
              • PyTorch
            • Napari
    • Files
  • API Reference
  • Self-Hosting
    • Installation
    • Upgrades
    • Administrative Tasks
    • Image Customization
      • Customize User-Defined Function Images
      • AWS ECR Container Registry
      • Customize Jupyter Notebook Images
    • Single Sign-On
      • Configure Single Sign-On
      • OpenID Connect
      • Okta SCIM
      • Microsoft Entra
  • Glossary

On this page

  • Create the initial SOMA experiment
  • Inspect the initial SOMA experiment
    • obs
  • var
  • X layer
  • Create a new dataset to append
  • Append multiple datasets to a SOMA Experiment
  • Cleanup
  1. Structure
  2. Life Sciences
  3. Single-cell
  4. Tutorials
  5. Appending Data to a SOMA Experiment

Appending Data to a SOMA Experiment

life sciences
single cell (soma)
tutorials
python
updates
Extend an existing SOMA experiment with new data.
Warning

This feature is currently limited to Python.

Overview

The ability to continuously update datasets with new information is crucial in single-cell research. Whether you’re part of a lab that regularly sequences new samples or building an atlas from independent studies, the ability to efficiently append data is key. In this tutorial, you will go through the process of adding new cells to an existing SOMA experiment.

Details

TileDB-SOMA supports extending an existing SOMA experiment with new observations, variables, or both from an in-memory AnnData object or an on-disk H5AD file. The ingestor assumes the datasets have been standardized and follow the same schema as the original experiment. Specifically:

  • obs and var must contain the same set of columns as the original experiment with identical data types.
  • X, obsm and varm arrays must use the same data type as the original experiment.

Prerequisites

{{< include /_includes/environment-variables.qmd >}}

Additionally, the following environment variables must be defined in your environment with custom values before running the following examples.

  • S3_REGION with the region of the destination S3 bucket.
  • TILEDB_NAMESPACE with the TileDB account name.

Setup

Import tiledbsoma and a few other packages necessary for this tutorial.

import os

import scanpy as sc
import tiledb.cloud
import tiledbsoma
import tiledbsoma.io
import tiledbsoma.logging

tiledbsoma.show_package_versions()
tiledbsoma.__version__              1.15.7
TileDB core version (libtiledbsoma) 2.27.0
python version                      3.9.20.final.0
OS version                          Linux 5.10.230-223.885.amzn2.x86_64

Next, define where the SOMA experiment will be stored. For the purpose of this tutorial, you will use a temporary directory.

# Set the TileDB REST API server address to which you'll connect
tiledb_server_uri = os.environ["TILEDB_REST_API_SERVER_ADDRESS"]

try:
    tiledb.default_ctx(tiledb.Config({"rest.server_address": tiledb_server_uri}))
finally:
    pass

TILEDB_ACCOUNT = os.environ.get("TILEDB_ACCOUNT")
S3_BUCKET = os.environ.get("S3_BUCKET")
EXPERIMENT_NAME = "soma-exp-pbmc3k-append-data"
MEASUREMENT_NAME = "RNA"

EXPERIMENT_URI = f"tiledb://{TILEDB_ACCOUNT}/{S3_BUCKET}/{EXPERIMENT_NAME}"

try:
    tiledb.cloud.asset.info(uri=EXPERIMENT_URI)
except Exception as e:
    print("Experiment doesn't exist. Continuing...")
else:
    tiledb.cloud.asset.delete(uri=EXPERIMENT_URI, recursive=True)

EXPERIMENT_URI
Experiment doesn't exist. Continuing...
'tiledb://tiledb-academy-ci/s3://tiledb-academy-ci/soma-exp-pbmc3k'

Create the initial SOMA experiment

To make things convenient for this self-contained demo, you will use Scanpy’s pbmc3k dataset, which is a small dataset containing 2,700 peripheral blood mononuclear cells (PBMCs) from a healthy donor. The data is processed to calculate various quality control metrics to populate the obs and var dataframes.

ad1 = sc.datasets.pbmc3k()
sc.pp.calculate_qc_metrics(ad1, inplace=True)
ad1
AnnData object with n_obs × n_vars = 2700 × 32738
    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes'
    var: 'gene_ids', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'

To better differentiate between the initial and appended data, add a new obs column containing the day of the week the data was collected.

ad1.obs["day"] = ["Monday"] * ad1.n_obs

Use tiledbsoma’s AnnData ingestor to create the new SOMA experiment from the pbmc3k dataset.

tiledbsoma.logging.info()

tiledbsoma.io.from_anndata(
    experiment_uri=EXPERIMENT_URI,
    measurement_name=MEASUREMENT_NAME,
    anndata=ad1,
)
'tiledb://tiledb-academy-ci/s3://tiledb-academy-ci/soma-exp-pbmc3k'

Inspect the initial SOMA experiment

Now read back the data to inspect obs, var, and X.

obs

Read the relevant attributes from the obs array within the SOMA experiment:

  • soma_joinid contains the unique identifier for each cell that indexes rows of each X layer.
  • obs_id contains the cell barcodes, which all end with -1 in this initial dataset.
  • day is the new column added to the obs dataframe, all cells have the same value Monday.

There are 2,700 total cells in the initial dataset.

with tiledbsoma.Experiment.open(EXPERIMENT_URI) as exp:
    print(
        exp.obs.read(column_names=["soma_joinid", "obs_id", "day"])
        .concat()
        .to_pandas(),
    )
      soma_joinid            obs_id     day
0               0  AAACATACAACCAC-1  Monday
1               1  AAACATTGAGCTAC-1  Monday
2               2  AAACATTGATCAGC-1  Monday
3               3  AAACCGTGCTTCCG-1  Monday
4               4  AAACCGTGTATGCG-1  Monday
...           ...               ...     ...
2695         2695  TTTCGAACTCTCAT-1  Monday
2696         2696  TTTCTACTGAGGCA-1  Monday
2697         2697  TTTCTACTTCCTCG-1  Monday
2698         2698  TTTGCATGAGAGGC-1  Monday
2699         2699  TTTGCATGCCTCAC-1  Monday

[2700 rows x 3 columns]

var

Now examine relevant attributes from the var array:

  • soma_joinid contains the unique identifier for each feature that indexes the columns of each X layer.
  • var_id contains gene symbols.

There are 32,738 genes in the initial dataset.

with tiledbsoma.Experiment.open(EXPERIMENT_URI) as exp:
    print(
        exp.ms[MEASUREMENT_NAME]
        .var.read(column_names=["soma_joinid", "var_id"])
        .concat()
        .to_pandas(),
    )
       soma_joinid        var_id
0                0    MIR1302-10
1                1       FAM138A
2                2         OR4F5
3                3  RP11-34P13.7
4                4  RP11-34P13.8
...            ...           ...
32733        32733    AC145205.1
32734        32734         BAGE5
32735        32735    CU459201.1
32736        32736    AC002321.2
32737        32737    AC002321.1

[32738 rows x 2 columns]

X layer

Lastly, examine the expression matrix in COO format. The rows (soma_dim_0) and columns (soma_dim_1) are indexed by the soma_joinid of the obs and var arrays, respectively.

with tiledbsoma.Experiment.open(EXPERIMENT_URI) as exp:
    print(exp.ms["RNA"].X["data"].read().tables().concat().to_pandas())
         soma_dim_0  soma_dim_1  soma_data
0                 0          70        1.0
1                 0         166        1.0
2                 0         178        2.0
3                 0         326        1.0
4                 0         363        1.0
...             ...         ...        ...
2286879        2699       32697        1.0
2286880        2699       32698        7.0
2286881        2699       32702        1.0
2286882        2699       32705        1.0
2286883        2699       32708        3.0

[2286884 rows x 3 columns]

Create a new dataset to append

Now, to simulate a dataset from a second sequencing run to be appended to the existing SOMA experiment, a new AnnData object with the same schema as the original experiment is created by modifying the original dataset.

First, increment the barcode suffix from -1 to -2 in the obs dataframe.

ad2 = ad1.copy()
ad2.obs.index = ad2.obs.index.str.replace("-1", "-2")

Update values in the day column from Monday to Tuesday.

ad2.obs["day"] = ["Tuesday"] * ad2.n_obs

Multiply values in X by 10.

ad2.X *= 10

The new dataset will have the same number of genes but a different set of cells.

Ingest the new dataset

Before the new dataset can be ingested into the existing SOMA experiment, a registration step is required to detect which, if any, cell and gene IDs are new.

Tip

You can also use tiledbsoma.io.register_h5ads() to register a new dataset stored in an H5AD file.

rd = tiledbsoma.io.register_anndatas(
    experiment_uri=EXPERIMENT_URI,
    adatas=[ad2],
    measurement_name=MEASUREMENT_NAME,
    obs_field_name="obs_id",
    var_field_name="var_id",
)
# TileDB-SOMA 1.16.2 and above:
rd.prepare_experiment(EXPERIMENT_URI)
# TileDB-SOMA 1.16.1 and below:
tiledbsoma.io.resize_experiment(
    EXPERIMENT_URI,
    nobs=rd.get_obs_shape(),
    nvars=rd.get_var_shapes(),
)
True
tiledbsoma.io.show_experiment_shapes(EXPERIMENT_URI)

[DataFrame] obs 
  URI tiledb://tiledb-academy-ci/aee2cd83-579d-42e0-92be-d0d89e4d2a46
  non_empty_domain     ((0, 2699),)
  domain               ((0, 5399),)
  maxdomain            ((0, 9223372036854773758),)
  upgraded             True

[DataFrame] ms/RNA/var 
  URI tiledb://tiledb-academy-ci/24f0684d-a693-4c17-bbdf-a860adaafad7
  non_empty_domain     ((0, 32737),)
  domain               ((0, 32737),)
  maxdomain            ((0, 9223372036854773758),)
  upgraded             True

[SparseNDArray] ms/RNA/X/data 
  URI tiledb://tiledb-academy-ci/91287c64-91d6-461d-93a3-08693b8d2213
  used_shape           ((0, 2699), (0, 32732))
  shape                (5400, 32738)
  maxshape             (9223372036854773759, 9223372036854773759)
  upgraded             True
True

Logs from the registration step indicate that appending the new dataset to the existing SOMA experiment will result in a total a of 5,400 cells, and the number of genes will remain at 32,738.

With the registration complete, the new dataset can be ingested into the existing SOMA experiment by using the same function used to create the initial experiment. The only difference is that the ExperimentAmbientLabelMapping object is passed to the registration_mapping argument.

As of TileDB-SOMA 1.15, with the new shape feature, you’ll need to first resize the experiment.

tiledbsoma.io.from_anndata(
    experiment_uri=EXPERIMENT_URI,
    anndata=ad2,
    measurement_name=MEASUREMENT_NAME,
    registration_mapping=rd,
)
'tiledb://tiledb-academy-ci/s3://tiledb-academy-ci/soma-exp-pbmc3k'

Since the new dataset contained new cells but the same set of genes, the var array was unchanged, while the obs and X arrays grew downward with new rows.

If the dataset had contained new genes, the var array would also grow downward with new rows and the X layer would grow right with new columns.

Append multiple datasets to a SOMA Experiment

It’s also possible to append multiple datasets to a SOMA experiment. The process is very similar to the single-dataset case:

  1. One call to register_anndatas (or register_h5ads) passing all input AnnDatas/H5ADs
  2. One call to from_anndata (or from_h5ad) for each input AnnData

Use the make_adata() helper function to simulate multiple sequencing runs. As before, where the pbmc3k dataset was used simulate Monday and Tuesday data, this time the helper function will simulate Wednesday, Thursday, and Friday data. It’s been a busy week!

def make_adata(day, scale, obs_id_suffix):
    ad = ad1.copy()
    ad.obs.index = ad.obs.index.str.replace("-1", obs_id_suffix)
    ad.obs["day"] = [day] * ad.n_obs
    ad.X *= scale
    return ad


ads = [
    make_adata(day, scale, f"-{idx + 3}")
    for idx, (day, scale) in enumerate(
        {
            "Wednesday": 20,
            "Thursday": 30,
            "Friday": 40,
        }.items(),
    )
]

ads
[AnnData object with n_obs × n_vars = 2700 × 32738
     obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'day'
     var: 'gene_ids', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts',
 AnnData object with n_obs × n_vars = 2700 × 32738
     obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'day'
     var: 'gene_ids', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts',
 AnnData object with n_obs × n_vars = 2700 × 32738
     obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'day'
     var: 'gene_ids', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts']

Register all of the new AnnData objects at once.

rd2 = tiledbsoma.io.register_anndatas(
    experiment_uri=EXPERIMENT_URI,
    adatas=ads,
    measurement_name=MEASUREMENT_NAME,
    obs_field_name="obs_id",
    var_field_name="var_id",
)
# TileDB-SOMA 1.16.2 and above:
rd2.prepare_experiment(EXPERIMENT_URI)
# TileDB-SOMA 1.16.1 and below:
tiledbsoma.io.resize_experiment(
    EXPERIMENT_URI,
    nobs=rd2.get_obs_shape(),
    nvars=rd2.get_var_shapes(),
)
True

Now that the datasets have all been registered, they can be ingested into the existing SOMA experiment one at a time.

Tip

This process could be parallelized by having multiple workers ingest the datasets in parallel, one worker per AnnData object, as long as the registration data are passed to each worker.

for ad in ads:
    tiledbsoma.io.from_anndata(
        experiment_uri=EXPERIMENT_URI,
        anndata=ad,
        measurement_name=MEASUREMENT_NAME,
        registration_mapping=rd2,
    )

Reading back the concatenated data, you can observe 2700 rows for each day of the week, for a total of 13,500 cells.

with tiledbsoma.Experiment.open(EXPERIMENT_URI) as exp:
    obs = (
        exp.obs.read(column_names=["soma_joinid", "obs_id", "day"]).concat().to_pandas()
    )

obs["day"].value_counts()
Monday       2700
Tuesday      2700
Wednesday    2700
Thursday     2700
Friday       2700
Name: day, dtype: int64

Cleanup

To remove this dataset from your TileDB account and physically delete it from S3, you can call the delete method provided by the tiledb.cloud package.

tiledb.cloud.asset.delete(EXPERIMENT_URI, recursive=True)
Multi-Experiment Queries
Add New Measurements