Basic S3 Example with Single-Cell Data

life sciences

single cell (soma)

tutorials

python

remote access

storage backends

Learn how to use TileDB-SOMA with Amazon S3.

How to run this tutorial

You can run this tutorial in two ways:

Locally on your machine.
On TileDB Cloud.

However, since TileDB Cloud has a free tier, we strongly recommend that you sign up and run everything there, as that requires no installations or deployment.

This tutorial introduces you to using TileDB-SOMA with Amazon S3, allowing you to leverage the scalability and flexibility of S3 as a storage backend for your single-cell data. By the end of this tutorial, you will be able to ingest, query, and manage SOMA experiments stored on S3 with ease.

The examples included here are similar to those in the Data Ingestion and Data Access tutorials, as the focus of this tutorial is to highlight the extra steps required to use S3.

Before you can run the examples in this tutorial, make sure you have the following prerequisites:

An AWS account.
An Amazon S3 bucket.
The AWS credentials required to access the bucket.

Note

For more details on TileDB’s support for Amazon S3, as well as information about how to use the underlying core TileDB engine with other object stores, visit the Advanced Backends section.

Setup

Load the tiledbsoma package and a few other packages to complete this tutorial.

Python
R

import os

import scanpy as sc
import tiledb
import tiledbsoma
import tiledbsoma.io

tiledbsoma.show_package_versions()

tiledbsoma.__version__              1.11.4
TileDB-Py version                   0.29.0
TileDB core version (tiledb)        2.23.1
TileDB core version (libtiledbsoma) 2.23.1
python version                      3.9.19.final.0
OS version                          Linux 6.8.0-1013-aws

library(tiledb)
library(tiledbsoma)
suppressPackageStartupMessages(library(Seurat))

show_package_versions()

tiledbsoma:    1.11.4
tiledb-r:      0.27.0
tiledb core:   2.23.1
libtiledbsoma: 2.23.1
R:             R version 4.3.3 (2024-02-29)
OS:            Debian GNU/Linux 11 (bullseye)

Your starting point will be the pbmc3k dataset, which contains 2,700 peripheral blood mononuclear cells (PBMC) from a healthy donor. The raw data was generated by 10X Genomics and is available from 10X’s website. The version of the dataset you will use here was processed with this scanpy notebook.

Python
R

Download and load the pbmc3k dataset using the scanpy package.

adata = sc.datasets.pbmc3k_processed()
adata

AnnData object with n_obs × n_vars = 2638 × 1838
    obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain'
    var: 'n_cells'
    uns: 'draw_graph', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
    obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_draw_graph_fr'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'

Download an RDS file containing a Seurat version of the dataset described earlier, which has been made available on TileDB Cloud using the Files feature, and load it into your R environment.

rds_uri <- "tiledb://TileDB-Inc/scanpy_pbmc3k_processed_rds"
rds_path <- file.path(tempdir(), "pbmc3k_processed.rds")

if (!file.exists(rds_path)) {
  if (!tiledb_filestore_uri_export(rds_path, rds_uri)) {
    stop("Failed to export RDS file from TileDB Cloud")
  }
}

pbmc3k <- readRDS(rds_path)
pbmc3k

An object of class Seurat 
1838 features across 2638 samples within 1 assay 
Active assay: RNA (1838 features, 0 variable features)
 2 layers present: counts, data
 4 dimensional reductions calculated: umap, tsne, draw_graph_fr, pca

Authenticate

In order for TileDB-SOMA to be able to access S3, you must provide the S3 bucket URI and region (e.g., us-east-1, us-west-2, etc.), as well as your credentials.

It’s crucial to avoid storing private information such as AWS credentials directly in your notebook to protect against potential security leaks. Instead, store them securely as environment variables and access them within your code. This practice helps keep your sensitive information safe.

The URI for an S3 bucket created for this tutorial is stored in the S3_BUCKET environment variable, along with the region in S3_REGION. These variables must be defined in your environment with custom values before running the following code.

Python
R

# Get the keys from the environment variables.
config = {
    "vfs.s3.aws_access_key_id": os.environ.get("AWS_ACCESS_KEY_ID"),
    "vfs.s3.aws_secret_access_key": os.environ.get("AWS_SECRET_ACCESS_KEY"),
    "vfs.s3.region": os.environ.get("S3_REGION"),
}

s3_bucket = os.environ.get("S3_BUCKET")

# Get the keys from the environment variables.
config <- list(
  vfs.s3.aws_access_key_id = Sys.getenv("AWS_ACCESS_KEY_ID"),
  vfs.s3.aws_secret_access_key = Sys.getenv("AWS_SECRET_ACCESS_KEY"),
  vfs.s3.region = Sys.getenv("S3_REGION")
)

s3_bucket <- Sys.getenv("S3_BUCKET")

Pass the AWS keys and region to the TileDB-SOMA context constructor.

Python
R

ctx = tiledbsoma.SOMATileDBContext(tiledb_config=config)

ctx <- tiledbsoma::SOMATileDBContext$new(config = config)

You will need to provide the context object to any TileDB or TileDB-SOMA function that interacts with S3.

Ingest

Create a URI for the new SOMA experiment by appending an experiment name to the S3 bucket.

Python
R

EXPERIMENT_URI = f"{s3_bucket}/soma-exp-pbmc3k"
EXPERIMENT_URI

's3://tiledb-aaron/academy/soma-exp-pbmc3k'

EXPERIMENT_URI <- sprintf("%s/soma-exp-pbmc3k", s3_bucket)
EXPERIMENT_URI

's3://tiledb-aaron/academy/soma-exp-pbmc3k'

The ingestion process is the same as the Data Ingestion tutorial. The only differences are the S3 URI and the context object, which contains the Amazon S3 credentials.

Python
R

vfs = tiledb.VFS(ctx=ctx.tiledb_ctx)

if vfs.is_dir(EXPERIMENT_URI):
    vfs.remove_dir(EXPERIMENT_URI)

tiledbsoma.io.from_anndata(
    experiment_uri=EXPERIMENT_URI, measurement_name="RNA", anndata=adata, context=ctx
)

's3://tiledb-aaron/academy/soma-exp-pbmc3k'

vfs <- tiledb::tiledb_vfs(ctx = ctx$to_tiledb_context())
if (tiledb::tiledb_vfs_is_dir(uri = EXPERIMENT_URI, vfs = vfs)) {
  tiledb::tiledb_vfs_remove_dir(uri = EXPERIMENT_URI, vfs = vfs)
}

write_soma(pbmc3k, uri = EXPERIMENT_URI, tiledbsoma_ctx = ctx)

's3://tiledb-aaron/academy/soma-exp-pbmc3k'

The EXPERIMENT_URI now points to the new SOMA experiment on S3. You can verify this using TileDB’s VFS to list the contents of the bucket. Note that the context object must be passed to the VFS constructor to access the bucket.

Python
R

vfs.ls(EXPERIMENT_URI)

['s3://tiledb-aaron/academy/soma-exp-pbmc3k/__group',
 's3://tiledb-aaron/academy/soma-exp-pbmc3k/__meta',
 's3://tiledb-aaron/academy/soma-exp-pbmc3k/__tiledb_group.tdb',
 's3://tiledb-aaron/academy/soma-exp-pbmc3k/ms',
 's3://tiledb-aaron/academy/soma-exp-pbmc3k/obs']

tiledb::tiledb_vfs_ls(uri = EXPERIMENT_URI, vfs = vfs)

's3://tiledb-aaron/academy/soma-exp-pbmc3k/__group'
's3://tiledb-aaron/academy/soma-exp-pbmc3k/__meta'
's3://tiledb-aaron/academy/soma-exp-pbmc3k/__tiledb_group.tdb'
's3://tiledb-aaron/academy/soma-exp-pbmc3k/ms'
's3://tiledb-aaron/academy/soma-exp-pbmc3k/obs'
's3://tiledb-aaron/academy/soma-exp-pbmc3k/uns'

This shows the ms collection and obs array at the root of the bucket, which follows SOMA’s data model.

Query

You can query the SOMA experiment directly from S3. When opening the experiment the context object must be provided.

Python
R

with tiledbsoma.Experiment.open(EXPERIMENT_URI, context=ctx) as experiment:
    with experiment.axis_query(
        measurement_name="RNA",
        obs_query=tiledbsoma.AxisQuery(
            value_filter="louvain == 'B cells'",
        ),
    ) as query:
        obs = query.obs().concat().to_pandas()

obs

	soma_joinid	obs_id	n_genes	percent_mito	n_counts	louvain
0	1	AAACATTGAGCTAC-1	1352	0.037936	4903.0	B cells
1	10	AAACTTGAAAAACG-1	1116	0.026316	3914.0	B cells
2	18	AAAGGCCTGTCTAG-1	1446	0.015283	4973.0	B cells
3	19	AAAGTTTGATCACG-1	446	0.034700	1268.0	B cells
4	20	AAAGTTTGGGGTGA-1	1020	0.025907	3281.0	B cells
...	...	...	...	...	...	...
337	2628	TTTCAGTGTCACGA-1	700	0.034314	1632.0	B cells
338	2630	TTTCAGTGTGCAGT-1	637	0.018925	1321.0	B cells
339	2634	TTTCTACTGAGGCA-1	1227	0.009294	3443.0	B cells
340	2635	TTTCTACTTCCTCG-1	622	0.021971	1684.0	B cells
341	2636	TTTGCATGAGAGGC-1	454	0.020548	1022.0	B cells

342 rows × 6 columns

experiment <- SOMAExperimentOpen(EXPERIMENT_URI, tiledbsoma_ctx = ctx)

query <- experiment$axis_query(
  measurement_name = "RNA",
  obs_query = SOMAAxisQuery$new(
    value_filter = "louvain == 'B cells'"
  )
)

obs <- query$obs()$concat()$to_data_frame()
obs

A tibble: 342 x 9
soma_joinid	orig.ident	nCount_RNA	nFeature_RNA	n_genes	percent_mito	n_counts	louvain	obs_id
<int>	<fct>	<dbl>	<int>	<int>	<dbl>	<dbl>	<chr>	<chr>
1	SeuratProject	233.96095	249	1352	0.03793596	4903	B cells	AAACATTGAGCTAC-1
10	SeuratProject	191.90643	216	1116	0.02631579	3914	B cells	AAACTTGAAAAACG-1
18	SeuratProject	250.50210	277	1446	0.01528253	4973	B cells	AAAGGCCTGTCTAG-1
19	SeuratProject	73.80223	88	446	0.03470032	1268	B cells	AAAGTTTGATCACG-1
20	SeuratProject	187.42732	207	1020	0.02590674	3281	B cells	AAAGTTTGGGGTGA-1
...	...	...	...	...	...	...	...	...
2628	SeuratProject	113.45525	139	700	0.03431373	1632	B cells	TTTCAGTGTCACGA-1
2630	SeuratProject	96.41425	119	637	0.01892506	1321	B cells	TTTCAGTGTGCAGT-1
2634	SeuratProject	171.67429	193	1227	0.00929422	3443	B cells	TTTCTACTGAGGCA-1
2635	SeuratProject	92.68251	108	622	0.02197150	1684	B cells	TTTCTACTTCCTCG-1
2636	SeuratProject	77.38343	95	454	0.02054795	1022	B cells	TTTGCATGAGAGGC-1

Cleanup

You can also use TileDB’s VFS to delete the experiment from S3 to clean up.

Python
R

vfs.remove_dir(EXPERIMENT_URI)

tiledb::tiledb_vfs_remove_dir(uri = EXPERIMENT_URI, vfs = vfs)

Summary

In this tutorial, you successfully created, accessed, and managed a SOMA experiment on Amazon S3. These skills allow you to seamlessly integrate TileDB-SOMA with cloud storage, providing a scalable and efficient solution for managing your single-cell experiments.