However, since TileDB Cloud has a free tier, we strongly recommend that you sign up and run everything there, as that requires no installations or deployment.
This tutorial introduces you to using TileDB-SOMA with Amazon S3, allowing you to leverage the scalability and flexibility of S3 as a storage backend for your single-cell data. By the end of this tutorial, you will be able to ingest, query, and manage SOMA experiments stored on S3 with ease.
The examples included here are similar to those in the Data Ingestion and Data Access tutorials, as the focus of this tutorial is to highlight the extra steps required to use S3.
Before you can run the examples in this tutorial, make sure you have the following prerequisites:
An AWS account.
An Amazon S3 bucket.
The AWS credentials required to access the bucket.
Note
For more details on TileDB’s support for Amazon S3, as well as information about how to use the underlying core TileDB engine with other object stores, visit the Advanced Backends section.
Setup
Load the tiledbsoma package and a few other packages to complete this tutorial.
import osimport scanpy as scimport tiledbimport tiledbsomaimport tiledbsoma.iotiledbsoma.show_package_versions()
tiledbsoma.__version__ 1.11.4
TileDB-Py version 0.29.0
TileDB core version (tiledb) 2.23.1
TileDB core version (libtiledbsoma) 2.23.1
python version 3.9.19.final.0
OS version Linux 6.8.0-1013-aws
tiledbsoma: 1.11.4
tiledb-r: 0.27.0
tiledb core: 2.23.1
libtiledbsoma: 2.23.1
R: R version 4.3.3 (2024-02-29)
OS: Debian GNU/Linux 11 (bullseye)
Your starting point will be the pbmc3k dataset, which contains 2,700 peripheral blood mononuclear cells (PBMC) from a healthy donor. The raw data was generated by 10X Genomics and is available from 10X’s website. The version of the dataset you will use here was processed with this scanpy notebook.
Download an RDS file containing a Seurat version of the dataset described earlier, which has been made available on TileDB Cloud using the Files feature, and load it into your R environment.
rds_uri <-"tiledb://TileDB-Inc/scanpy_pbmc3k_processed_rds"rds_path <-file.path(tempdir(), "pbmc3k_processed.rds")if (!file.exists(rds_path)) {if (!tiledb_filestore_uri_export(rds_path, rds_uri)) {stop("Failed to export RDS file from TileDB Cloud") }}pbmc3k <-readRDS(rds_path)pbmc3k
An object of class Seurat
1838 features across 2638 samples within 1 assay
Active assay: RNA (1838 features, 0 variable features)
2 layers present: counts, data
4 dimensional reductions calculated: umap, tsne, draw_graph_fr, pca
Authenticate
In order for TileDB-SOMA to be able to access S3, you must provide the S3 bucket URI and region (e.g., us-east-1, us-west-2, etc.), as well as your credentials.
It’s crucial to avoid storing private information such as AWS credentials directly in your notebook to protect against potential security leaks. Instead, store them securely as environment variables and access them within your code. This practice helps keep your sensitive information safe.
The URI for an S3 bucket created for this tutorial is stored in the S3_BUCKET environment variable, along with the region in S3_REGION. These variables must be defined in your environment with custom values before running the following code.
# Get the keys from the environment variables.config = {"vfs.s3.aws_access_key_id": os.environ.get("AWS_ACCESS_KEY_ID"),"vfs.s3.aws_secret_access_key": os.environ.get("AWS_SECRET_ACCESS_KEY"),"vfs.s3.region": os.environ.get("S3_REGION"),}s3_bucket = os.environ.get("S3_BUCKET")
# Get the keys from the environment variables.config <-list(vfs.s3.aws_access_key_id =Sys.getenv("AWS_ACCESS_KEY_ID"),vfs.s3.aws_secret_access_key =Sys.getenv("AWS_SECRET_ACCESS_KEY"),vfs.s3.region =Sys.getenv("S3_REGION"))s3_bucket <-Sys.getenv("S3_BUCKET")
Pass the AWS keys and region to the TileDB-SOMA context constructor.
The ingestion process is the same as the Data Ingestion tutorial. The only differences are the S3 URI and the context object, which contains the Amazon S3 credentials.
The EXPERIMENT_URI now points to the new SOMA experiment on S3. You can verify this using TileDB’s VFS to list the contents of the bucket. Note that the context object must be passed to the VFS constructor to access the bucket.
In this tutorial, you successfully created, accessed, and managed a SOMA experiment on Amazon S3. These skills allow you to seamlessly integrate TileDB-SOMA with cloud storage, providing a scalable and efficient solution for managing your single-cell experiments.