Appending Data to a SOMA Experiment

life sciences

single cell (soma)

tutorials

python

updates

Extend an existing SOMA experiment with new data.

Warning

This feature is currently limited to Python.

Overview

The ability to continuously update datasets with new information is crucial in single-cell research. Whether you’re part of a lab that regularly sequences new samples or building an atlas from independent studies, the ability to efficiently append data is key. In this tutorial, you will go through the process of adding new cells to an existing SOMA experiment.

Details

TileDB-SOMA supports extending an existing SOMA experiment with new observations, variables, or both from an in-memory AnnData object or an on-disk H5AD file. The ingestor assumes the datasets have been standardized and follow the same schema as the original experiment. Specifically:

obs and var must contain the same set of columns as the original experiment with identical data types.
X, obsm and varm arrays must use the same data type as the original experiment.

Prerequisites

To run this tutorial, you need to create the following environment variables:

Variable	Description
`$S3_BUCKET`	The name of your S3 bucket, prefixed with the scheme (for example, `s3://your_bucket_name`).
`$TILEDB_ACCOUNT`	Your TileDB account username.
`$TILEDB_REST_API_SERVER_ADDRESS`*	The TileDB REST API server address to which you’ll connect.
`$TILEDB_REST_TOKEN`*	Your TileDB REST API token.

* Required only if you’re running this tutorial outside a TileDB notebook environment.

You can create these environment variables in Python as follows:

Python

import os

os.environ["S3_BUCKET"] = "s3://<your_bucket_name>"
os.environ["TILEDB_ACCOUNT"] = "<your_tiledb_account_username>"
os.environ["TILEDB_REST_API_SERVER_ADDRESS"] = "<rest_api_server_address>"
with open("~/.tiledb_rest_api_token", "r") as f:
    # Store your TileDB REST API token in a file named ".tiledb_rest_api_token" in your home directory.
    os.environ["TILEDB_REST_TOKEN"] = f.read().strip()

Additionally, the following environment variables must be defined in your environment with custom values before running the following examples.

S3_REGION with the region of the destination S3 bucket.
TILEDB_NAMESPACE with the TileDB account name.

Setup

Import tiledbsoma and a few other packages necessary for this tutorial.

import os

import scanpy as sc
import tiledb.cloud
import tiledbsoma
import tiledbsoma.io
import tiledbsoma.logging

tiledbsoma.show_package_versions()

tiledbsoma.__version__              1.15.7
TileDB core version (libtiledbsoma) 2.27.0
python version                      3.9.20.final.0
OS version                          Linux 5.10.230-223.885.amzn2.x86_64

Next, define where the SOMA experiment will be stored. For the purpose of this tutorial, you will use a temporary directory.

# Set the TileDB REST API server address to which you'll connect
tiledb_server_uri = os.environ["TILEDB_REST_API_SERVER_ADDRESS"]

try:
    tiledb.default_ctx(tiledb.Config({"rest.server_address": tiledb_server_uri}))
finally:
    pass

TILEDB_ACCOUNT = os.environ.get("TILEDB_ACCOUNT")
S3_BUCKET = os.environ.get("S3_BUCKET")
EXPERIMENT_NAME = "soma-exp-pbmc3k-append-data"
MEASUREMENT_NAME = "RNA"

EXPERIMENT_URI = f"tiledb://{TILEDB_ACCOUNT}/{S3_BUCKET}/{EXPERIMENT_NAME}"

try:
    tiledb.cloud.asset.info(uri=EXPERIMENT_URI)
except Exception as e:
    print("Experiment doesn't exist. Continuing...")
else:
    tiledb.cloud.asset.delete(uri=EXPERIMENT_URI, recursive=True)

EXPERIMENT_URI

Experiment doesn't exist. Continuing...

'tiledb://tiledb-academy-ci/s3://tiledb-academy-ci/soma-exp-pbmc3k'

Create the initial SOMA experiment

To make things convenient for this self-contained demo, you will use Scanpy’s pbmc3k dataset, which is a small dataset containing 2,700 peripheral blood mononuclear cells (PBMCs) from a healthy donor. The data is processed to calculate various quality control metrics to populate the obs and var dataframes.

ad1 = sc.datasets.pbmc3k()
sc.pp.calculate_qc_metrics(ad1, inplace=True)
ad1

AnnData object with n_obs × n_vars = 2700 × 32738
    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes'
    var: 'gene_ids', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'

To better differentiate between the initial and appended data, add a new obs column containing the day of the week the data was collected.

ad1.obs["day"] = ["Monday"] * ad1.n_obs

Use tiledbsoma’s AnnData ingestor to create the new SOMA experiment from the pbmc3k dataset.

tiledbsoma.logging.info()

tiledbsoma.io.from_anndata(
    experiment_uri=EXPERIMENT_URI,
    measurement_name=MEASUREMENT_NAME,
    anndata=ad1,
)

'tiledb://tiledb-academy-ci/s3://tiledb-academy-ci/soma-exp-pbmc3k'

Inspect the initial SOMA experiment

Now read back the data to inspect obs, var, and X.

`obs`

Read the relevant attributes from the obs array within the SOMA experiment:

soma_joinid contains the unique identifier for each cell that indexes rows of each X layer.
obs_id contains the cell barcodes, which all end with -1 in this initial dataset.
day is the new column added to the obs dataframe, all cells have the same value Monday.

There are 2,700 total cells in the initial dataset.

with tiledbsoma.Experiment.open(EXPERIMENT_URI) as exp:
    print(
        exp.obs.read(column_names=["soma_joinid", "obs_id", "day"])
        .concat()
        .to_pandas(),
    )

      soma_joinid            obs_id     day
0               0  AAACATACAACCAC-1  Monday
1               1  AAACATTGAGCTAC-1  Monday
2               2  AAACATTGATCAGC-1  Monday
3               3  AAACCGTGCTTCCG-1  Monday
4               4  AAACCGTGTATGCG-1  Monday
...           ...               ...     ...
2695         2695  TTTCGAACTCTCAT-1  Monday
2696         2696  TTTCTACTGAGGCA-1  Monday
2697         2697  TTTCTACTTCCTCG-1  Monday
2698         2698  TTTGCATGAGAGGC-1  Monday
2699         2699  TTTGCATGCCTCAC-1  Monday

[2700 rows x 3 columns]

`var`

Now examine relevant attributes from the var array:

soma_joinid contains the unique identifier for each feature that indexes the columns of each X layer.
var_id contains gene symbols.

There are 32,738 genes in the initial dataset.

with tiledbsoma.Experiment.open(EXPERIMENT_URI) as exp:
    print(
        exp.ms[MEASUREMENT_NAME]
        .var.read(column_names=["soma_joinid", "var_id"])
        .concat()
        .to_pandas(),
    )

       soma_joinid        var_id
0                0    MIR1302-10
1                1       FAM138A
2                2         OR4F5
3                3  RP11-34P13.7
4                4  RP11-34P13.8
...            ...           ...
32733        32733    AC145205.1
32734        32734         BAGE5
32735        32735    CU459201.1
32736        32736    AC002321.2
32737        32737    AC002321.1

[32738 rows x 2 columns]

`X` layer

Lastly, examine the expression matrix in COO format. The rows (soma_dim_0) and columns (soma_dim_1) are indexed by the soma_joinid of the obs and var arrays, respectively.

with tiledbsoma.Experiment.open(EXPERIMENT_URI) as exp:
    print(exp.ms["RNA"].X["data"].read().tables().concat().to_pandas())

         soma_dim_0  soma_dim_1  soma_data
0                 0          70        1.0
1                 0         166        1.0
2                 0         178        2.0
3                 0         326        1.0
4                 0         363        1.0
...             ...         ...        ...
2286879        2699       32697        1.0
2286880        2699       32698        7.0
2286881        2699       32702        1.0
2286882        2699       32705        1.0
2286883        2699       32708        3.0

[2286884 rows x 3 columns]

Create a new dataset to append

Now, to simulate a dataset from a second sequencing run to be appended to the existing SOMA experiment, a new AnnData object with the same schema as the original experiment is created by modifying the original dataset.

First, increment the barcode suffix from -1 to -2 in the obs dataframe.

ad2 = ad1.copy()
ad2.obs.index = ad2.obs.index.str.replace("-1", "-2")

Update values in the day column from Monday to Tuesday.

ad2.obs["day"] = ["Tuesday"] * ad2.n_obs

Multiply values in X by 10.

ad2.X *= 10

The new dataset will have the same number of genes but a different set of cells.

Ingest the new dataset

Before the new dataset can be ingested into the existing SOMA experiment, a registration step is required to detect which, if any, cell and gene IDs are new.

Tip

You can also use tiledbsoma.io.register_h5ads() to register a new dataset stored in an H5AD file.

rd = tiledbsoma.io.register_anndatas(
    experiment_uri=EXPERIMENT_URI,
    adatas=[ad2],
    measurement_name=MEASUREMENT_NAME,
    obs_field_name="obs_id",
    var_field_name="var_id",
)

# TileDB-SOMA 1.16.2 and above:
rd.prepare_experiment(EXPERIMENT_URI)
# TileDB-SOMA 1.16.1 and below:
tiledbsoma.io.resize_experiment(
    EXPERIMENT_URI,
    nobs=rd.get_obs_shape(),
    nvars=rd.get_var_shapes(),
)

True

tiledbsoma.io.show_experiment_shapes(EXPERIMENT_URI)


[DataFrame] obs 
  URI tiledb://tiledb-academy-ci/aee2cd83-579d-42e0-92be-d0d89e4d2a46
  non_empty_domain     ((0, 2699),)
  domain               ((0, 5399),)
  maxdomain            ((0, 9223372036854773758),)
  upgraded             True

[DataFrame] ms/RNA/var 
  URI tiledb://tiledb-academy-ci/24f0684d-a693-4c17-bbdf-a860adaafad7
  non_empty_domain     ((0, 32737),)
  domain               ((0, 32737),)
  maxdomain            ((0, 9223372036854773758),)
  upgraded             True

[SparseNDArray] ms/RNA/X/data 
  URI tiledb://tiledb-academy-ci/91287c64-91d6-461d-93a3-08693b8d2213
  used_shape           ((0, 2699), (0, 32732))
  shape                (5400, 32738)
  maxshape             (9223372036854773759, 9223372036854773759)
  upgraded             True

True

Logs from the registration step indicate that appending the new dataset to the existing SOMA experiment will result in a total a of 5,400 cells, and the number of genes will remain at 32,738.

With the registration complete, the new dataset can be ingested into the existing SOMA experiment by using the same function used to create the initial experiment. The only difference is that the ExperimentAmbientLabelMapping object is passed to the registration_mapping argument.

As of TileDB-SOMA 1.15, with the new shape feature, you’ll need to first resize the experiment.

tiledbsoma.io.from_anndata(
    experiment_uri=EXPERIMENT_URI,
    anndata=ad2,
    measurement_name=MEASUREMENT_NAME,
    registration_mapping=rd,
)

'tiledb://tiledb-academy-ci/s3://tiledb-academy-ci/soma-exp-pbmc3k'

Since the new dataset contained new cells but the same set of genes, the var array was unchanged, while the obs and X arrays grew downward with new rows.

If the dataset had contained new genes, the var array would also grow downward with new rows and the X layer would grow right with new columns.

Append multiple datasets to a SOMA Experiment

It’s also possible to append multiple datasets to a SOMA experiment. The process is very similar to the single-dataset case:

One call to register_anndatas (or register_h5ads) passing all input AnnDatas/H5ADs
One call to from_anndata (or from_h5ad) for each input AnnData

Use the make_adata() helper function to simulate multiple sequencing runs. As before, where the pbmc3k dataset was used simulate Monday and Tuesday data, this time the helper function will simulate Wednesday, Thursday, and Friday data. It’s been a busy week!

def make_adata(day, scale, obs_id_suffix):
    ad = ad1.copy()
    ad.obs.index = ad.obs.index.str.replace("-1", obs_id_suffix)
    ad.obs["day"] = [day] * ad.n_obs
    ad.X *= scale
    return ad


ads = [
    make_adata(day, scale, f"-{idx + 3}")
    for idx, (day, scale) in enumerate(
        {
            "Wednesday": 20,
            "Thursday": 30,
            "Friday": 40,
        }.items(),
    )
]

ads

[AnnData object with n_obs × n_vars = 2700 × 32738
     obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'day'
     var: 'gene_ids', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts',
 AnnData object with n_obs × n_vars = 2700 × 32738
     obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'day'
     var: 'gene_ids', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts',
 AnnData object with n_obs × n_vars = 2700 × 32738
     obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'day'
     var: 'gene_ids', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts']

rd2 = tiledbsoma.io.register_anndatas(
    experiment_uri=EXPERIMENT_URI,
    adatas=ads,
    measurement_name=MEASUREMENT_NAME,
    obs_field_name="obs_id",
    var_field_name="var_id",
)

# TileDB-SOMA 1.16.2 and above:
rd2.prepare_experiment(EXPERIMENT_URI)
# TileDB-SOMA 1.16.1 and below:
tiledbsoma.io.resize_experiment(
    EXPERIMENT_URI,
    nobs=rd2.get_obs_shape(),
    nvars=rd2.get_var_shapes(),
)

True

Now that the datasets have all been registered, they can be ingested into the existing SOMA experiment one at a time.

Tip

This process could be parallelized by having multiple workers ingest the datasets in parallel, one worker per AnnData object, as long as the registration data are passed to each worker.

for ad in ads:
    tiledbsoma.io.from_anndata(
        experiment_uri=EXPERIMENT_URI,
        anndata=ad,
        measurement_name=MEASUREMENT_NAME,
        registration_mapping=rd2,
    )

Reading back the concatenated data, you can observe 2700 rows for each day of the week, for a total of 13,500 cells.

with tiledbsoma.Experiment.open(EXPERIMENT_URI) as exp:
    obs = (
        exp.obs.read(column_names=["soma_joinid", "obs_id", "day"]).concat().to_pandas()
    )

obs["day"].value_counts()

Monday       2700
Tuesday      2700
Wednesday    2700
Thursday     2700
Friday       2700
Name: day, dtype: int64

Cleanup

To remove this dataset from your TileDB account and physically delete it from S3, you can call the delete method provided by the tiledb.cloud package.

tiledb.cloud.asset.delete(EXPERIMENT_URI, recursive=True)