The ability to continuously update datasets with new information is crucial in single-cell research. Whether you’re part of a lab that regularly sequences new samples or building an atlas from independent studies, the ability to efficiently append data is key. In this tutorial, you will go through the process of adding new cells to an existing SOMA experiment.
Details
TileDB-SOMA supports extending an existing SOMA experiment with new observations and/or variables from an in-memory AnnData object or an on-disk H5AD file. The ingestor assumes the datasets have been standardized and follow the same schema as the original experiment. Specifically:
obs and var must contain the same set of columns as the original experiment with identical data types.
X, obsm and varm arrays must use the same data type as the original experiment.
Prerequisites
{{< include /_includes/tiledb-cloud-rest-token.qmd >}}
However, this is not necessary when running on TileDB Cloud where the REST API token is automatically generated and configured for you.
Additionally, the following environment variables must be defined in your environment with custom values before running the following examples.
S3_BUCKET with the URI for the destination S3 bucket.
S3_REGION with the region of the destination S3 bucket.
TILEDB_NAMESPACE with the TileDB Cloud account name.
Setup
Import tiledbsoma and a few other packages necessary for this tutorial.
import osimport scanpy as scimport tiledb.cloudimport tiledbsomaimport tiledbsoma.ioimport tiledbsoma.loggingtiledbsoma.show_package_versions()
tiledbsoma.__version__ 1.15.0rc4
TileDB core version (libtiledbsoma) 2.27.0
python version 3.11.9.final.0
OS version Darwin 24.1.0
Next, define where the SOMA experiment will be stored. For the purpose of this tutorial, you will use a temporary directory.
To make things convenient for this self-contained demo, you will use Scanpy’s pbmc3k dataset, which is a small dataset containing 2,700 peripheral blood mononuclear cells (PBMCs) from a healthy donor. The data is processed to calculate various quality control metrics to pouplate the obs and var dataframes.
Now examine relevant attributes from the var array:
soma_joinid contains the unique identifier for each feature that indexes the columns of each X layer.
var_id contains gene symbols.
There are 32,738 genes in the initial dataset.
with tiledbsoma.Experiment.open(EXPERIMENT_URI) as exp:print( exp.ms[MEASUREMENT_NAME] .var.read(column_names=["soma_joinid", "var_id"]) .concat() .to_pandas() )
Lastly, examine the expression matrix in COO format. The rows (soma_dim_0) and columns (soma_dim_1) are indexed by the soma_joinid of the obs and var arrays, respectively.
with tiledbsoma.Experiment.open(EXPERIMENT_URI) as exp:print(exp.ms["RNA"].X["data"].read().tables().concat().to_pandas())
Now, to simulate a dataset from a second sequencing run to be appended to the existing SOMA experiment, a new AnnData object with the same schema as the original experiment is created by modifying the original daset.
First, increment the barcode suffix from -1 to -2 in the obs dataframe.
Update values in the day column from Monday to Tuesday.
ad2.obs["day"] = ["Tuesday"] * ad2.n_obs
Multiply values in X by 10.
ad2.X *=10
The new dataset will have the same number of genes but a different set of cells.
Ingest the new dataset
Before the new dataset can be ingested into the existing SOMA experiment, a registration step is required to detect which, if any, cell and gene IDs are new.
Tip
You can also use tiledbsoma.io.register_h5ads() to register a new dataset stored in an H5AD file.
Logs from the registration step indicate that appending the new dataset to the existing SOMA experiment will result in a total a of 5,400 cells, and the number of genes will remain at 32,738.
With the registration complete, the new dataset can be ingested into the existing SOMA experiment using the same function used to create the initial experiment. The only difference is that the ExperimentAmbientLabelMapping object is passed to the registration_mapping argument.
As of TileDB-SOMA 1.15, with the new shape feature, you’ll need to first resize the experiment.
Since the new dataset contained new cells but the same set of genes, the var array was unchanged, while the obs and X arrays grew downward with new rows.
If the dataset had contained new genes, the var array would also grow downward with new rows and the X layer would grow right with new columns.
Appending multiple datasets to a SOMA Experiment
It’s also possible to append multiple datasets to a SOMA experiment. The process is very similar to the single-dataset case:
One call to register_anndatas (or register_h5ads) passing all input AnnDatas/H5ADs
One call to from_anndata (or from_h5ad) for each input AnnData
Use the make_adata() helper function to simulate multiple sequencing runs. As before, where the pbmc3k dataset was used simulate Monday and Tuesday data, this time the helper function will simulate Wednesday, Thursday, and Friday data. It’s been a busy week!
Now that the datasets have all been registered, they can be ingested into the existing SOMA experiment one at a time.
Tip
This process could be parallelized by having multiple workers ingest the datasets in parallel, one worker per AnnData object, as long as the registration data are passed to each worker.
for ad in ads: tiledbsoma.io.from_anndata( experiment_uri=EXPERIMENT_URI, anndata=ad, measurement_name=MEASUREMENT_NAME, registration_mapping=rd2, )
Reading back the concatenated data, you can observe 2700 rows for each day of the week, for a total of 13,500 cells.
with tiledbsoma.Experiment.open(EXPERIMENT_URI) as exp: obs = ( exp.obs.read(column_names=["soma_joinid", "obs_id", "day"]).concat().to_pandas() )obs["day"].value_counts()
To remove this dataset from your TileDB Cloud account and physically delete it from S3, you can call the delete method provided by the tiledb.cloud package.