1. Structure
  2. Life Sciences
  3. Single-cell
  4. Tutorials
  5. Data Access
  • Home
  • What is TileDB?
  • Get Started
  • Explore Content
  • Accounts
    • Individual Accounts
      • Apply for the Free Tier
      • Profile
        • Overview
        • Cloud Credentials
        • Storage Paths
        • REST API Tokens
        • Credits
    • Organization Admins
      • Create an Organization
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
    • Organization Members
      • Organization Invitations
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
  • Catalog
    • Introduction
    • Data
      • Arrays
      • Tables
      • Single-Cell (SOMA)
      • Genomics (VCF)
      • Biomedical Imaging
      • Vector Search
      • Files
    • Code
      • Notebooks
      • Dashboards
      • User-Defined Functions
      • Task Graphs
      • ML Models
    • Groups
    • Marketplace
    • Search
  • Collaborate
    • Introduction
    • Organizations
    • Access Control
      • Introduction
      • Share Assets
      • Asset Permissions
      • Public Assets
    • Logging
    • Marketplace
  • Analyze
    • Introduction
    • Slice Data
    • Multi-Region Redirection
    • Notebooks
      • Launch a Notebook
      • Usage
      • Widgets
      • Notebook Image Dependencies
    • Dashboards
      • Dashboards
      • Streamlit
    • Preview
    • User-Defined Functions
    • Task Graphs
    • Serverless SQL
    • Monitor
      • Task Log
      • Task Graph Log
  • Scale
    • Introduction
    • Task Graphs
    • API Usage
  • Structure
    • Why Structure Is Important
    • Arrays
      • Introduction
      • Quickstart
      • Foundation
        • Array Data Model
        • Key Concepts
          • Storage
            • Arrays
            • Dimensions
            • Attributes
            • Cells
            • Domain
            • Tiles
            • Data Layout
            • Compression
            • Encryption
            • Tile Filters
            • Array Schema
            • Schema Evolution
            • Fragments
            • Fragment Metadata
            • Commits
            • Indexing
            • Array Metadata
            • Datetimes
            • Groups
            • Object Stores
          • Compute
            • Writes
            • Deletions
            • Consolidation
            • Vacuuming
            • Time Traveling
            • Reads
            • Query Conditions
            • Aggregates
            • User-Defined Functions
            • Distributed Compute
            • Concurrency
            • Parallelism
        • Storage Format Spec
      • Tutorials
        • Basics
          • Basic Dense Array
          • Basic Sparse Array
          • Array Metadata
          • Compression
          • Encryption
          • Data Layout
          • Tile Filters
          • Datetimes
          • Multiple Attributes
          • Variable-Length Attributes
          • String Dimensions
          • Nullable Attributes
          • Multi-Range Reads
          • Query Conditions
          • Aggregates
          • Deletions
          • Catching Errors
          • Configuration
          • Basic S3 Example
          • Basic TileDB Cloud
          • fromDataFrame
          • Palmer Penguins
        • Advanced
          • Schema Evolution
          • Advanced Writes
            • Write at a Timestamp
            • Get Fragment Info
            • Consolidation
              • Fragments
              • Fragment List
              • Consolidation Plan
              • Commits
              • Fragment Metadata
              • Array Metadata
            • Vacuuming
              • Fragments
              • Commits
              • Fragment Metadata
              • Array Metadata
          • Advanced Reads
            • Get Fragment Info
            • Time Traveling
              • Introduction
              • Fragments
              • Array Metadata
              • Schema Evolution
          • Array Upgrade
          • Backends
            • Amazon S3
            • Azure Blob Storage
            • Google Cloud Storage
            • MinIO
            • Lustre
          • Virtual Filesystem
          • User-Defined Functions
          • Distributed Compute
          • Result Estimation
          • Incomplete Queries
        • Management
          • Array Schema
          • Groups
          • Object Management
        • Performance
          • Summary of Factors
          • Dense vs. Sparse
          • Dimensions vs. Attributes
          • Compression
          • Tiling and Data Layout
          • Tuning Writes
          • Tuning Reads
      • API Reference
    • Tables
      • Introduction
      • Quickstart
      • Foundation
        • Data Model
        • Key Concepts
          • Indexes
          • Columnar Storage
          • Compression
          • Data Manipulation
          • Optimize Tables
          • ACID
          • Serverless SQL
          • SQL Connectors
          • Dataframes
          • CSV Ingestion
      • Tutorials
        • Basics
          • Ingestion with SQL
          • CSV Ingestion
          • Basic S3 Example
          • Running Locally
        • Advanced
          • Scalable Ingestion
          • Scalable Queries
      • API Reference
    • AI & ML
      • Vector Search
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Vector Search
            • Vector Databases
            • Algorithms
            • Distance Metrics
            • Updates
            • Deployment Methods
            • Architecture
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Ingestion & Querying
            • Updates
            • Deletions
            • Basic S3 Example
            • Running Locally
          • Advanced
            • Versioning
            • Time Traveling
            • Consolidation
            • Distributed Compute
            • RAG LLM
            • LLM Memory
            • File Search
            • Image Search
            • Protein Search
          • Performance
        • API Reference
      • ML Models
        • Introduction
        • Quickstart
        • Foundation
          • Basics
          • Storage
          • Cloud Execution
          • Why TileDB for Machine Learning
        • Tutorials
          • Ingestion
            • Data Ingestion
              • Dense Datasets
              • Sparse Datasets
            • ML Model Ingestion
          • Management
            • Array Schema
            • Machine Learning: Groups
            • Time Traveling
    • Life Sciences
      • Single-cell
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Data Structures
            • Use of Apache Arrow
            • Join IDs
            • State Management
            • TileDB Cloud URIs
          • SOMA API Specification
        • Tutorials
          • Data Ingestion
          • Bulk Ingestion Tutorial
          • Data Access
          • Distributed Compute
          • Basic S3 Example
          • Multi-Experiment Queries
          • Appending Data to a SOMA Experiment
          • Add New Measurements
          • SQL Queries
          • Running Locally
          • Shapes in TileDB-SOMA
          • Drug Discovery App
        • Spatial
          • Introduction
          • Foundation
            • Spatial Data Model
            • Data Structures
          • Tutorials
            • Spatial Data Ingestion
            • Access Spatial Data
            • Manage Coordinate Spaces
        • API Reference
      • Population Genomics
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • The N+1 Problem
            • Architecture
            • Arrays
            • Ingestion
            • Reads
            • Variant Statistics
            • Annotations
            • User-Defined Functions
            • Tables and SQL
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Basic Ingestion
            • Basic Queries
            • Export to VCF
            • Add New Samples
            • Deleting Samples
            • Basic S3 Example
            • Basic TileDB Cloud
          • Advanced
            • Scalable Ingestion
            • Scalable Queries
            • Query Transforms
            • Handling Large Queries
            • Annotations
              • Finding Annotations
              • Embedded Annotations
              • External Annotations
              • Annotation VCFs
              • Ingesting Annotations
            • Variant Statistics
            • Tables and SQL
            • User-Defined Functions
            • Sample Metadata
            • Split VCF
          • Performance
        • API Reference
          • Command Line Interface
          • Python API
          • Cloud API
      • Biomedical Imaging
        • Introduction
        • Foundation
          • Data Model
          • Key Concepts
            • Arrays
            • Ingestion
            • Reads
            • User Defined Functions
          • Storage Format Spec
        • Quickstart
        • Tutorials
          • Basics
            • Ingestion
            • Read
              • OpenSlide
              • TileDB-Py
          • Advanced
            • Batched Ingestion
            • Chunked Ingestion
            • Machine Learning
              • PyTorch
            • Napari
    • Files
  • API Reference
  • Self-Hosting
    • Installation
    • Upgrades
    • Administrative Tasks
    • Image Customization
      • Customize User-Defined Function Images
      • AWS ECR Container Registry
      • Customize Jupyter Notebook Images
    • Single Sign-On
      • Configure Single Sign-On
      • OpenID Connect
      • Okta SCIM
      • Microsoft Entra
  • Glossary

On this page

  • Prerequisites
  • Setup
  • Dataset
  • Access SOMA components
  • Read into memory
  • Select and filter data
  • Experiment-level queries
    • Inspect result
    • Load result
    • Toolkit interoperability
  1. Structure
  2. Life Sciences
  3. Single-cell
  4. Tutorials
  5. Data Access

Data Access

life sciences
single cell (soma)
tutorials
python
r
reads
Learn how to access TileDB-SOMA data in a variety of ways.

This tutorial outlines how to access single-cell data stored in a SOMA experiment. See the Data Ingestion tutorial for information on how to create a SOMA experiment from a single-cell dataset.

Prerequisites

While you can run this tutorial locally, note that this tutorial relies on remote resources to run correctly.

You must create a REST API token and create an environment variable named $TILEDB_REST_TOKEN set to the value of your generated token.

Setup

First, load tiledbsoma as well as a few other packages used in this tutorial:

  • Python
  • R
import scanpy as sc
import tiledbsoma
import tiledbsoma.io

tiledbsoma.show_package_versions()
tiledbsoma.__version__              1.11.4
TileDB-Py version                   0.29.0
TileDB core version (tiledb)        2.23.1
TileDB core version (libtiledbsoma) 2.23.1
python version                      3.9.19.final.0
OS version                          Linux 6.8.0-1013-aws
library(tiledb)
library(tiledbsoma)
suppressPackageStartupMessages(library(Seurat))

show_package_versions()
tiledbsoma:    1.11.4
tiledb-r:      0.27.0
tiledb core:   2.23.1
libtiledbsoma: 2.23.1
R:             R version 4.3.3 (2024-02-29)
OS:            Debian GNU/Linux 11 (bullseye)

Dataset

This tutorial uses a dataset from the Tabula Sapiens consortium, which includes nearly 265,000 immune cells across various tissue types. The original H5AD file was downloaded from Figshare and converted into a SOMA experiment using the TileDB-SOMA API. The resulting SOMA experiment is hosted on TileDB Cloud (tabula-sapiens-immune).

This SOMA experiment is accessible programmatically with the following URI:

  • Python
  • R
SOMA_URI = "tiledb://TileDB-Inc/tabula-sapiens-immune"
SOMA_URI <- "tiledb://TileDB-Inc/tabula-sapiens-immune"

Access SOMA components

Open the new SOMA experiment in read mode to view its structure:

  • Python
  • R
experiment = tiledbsoma.Experiment.open(SOMA_URI)
experiment
<Experiment 'tiledb://TileDB-Inc/tabula-sapiens-immune' (open for 'r') (2 items)
    'ms': 'tiledb://TileDB-Inc/e19ed185-3710-4542-be4f-a82ce8418fd6' (unopened)
    'obs': 'tiledb://TileDB-Inc/e11d2d07-ab5a-41aa-9408-378802cd4890' (unopened)>
experiment <- SOMAExperimentOpen(SOMA_URI)
experiment
<SOMAExperiment>
  uri: tiledb://TileDB-Inc/tabula-sapiens-immune 
  arrays: obs* 
  groups: ms* 

Note that opening a SOMA experiment (or any SOMA object) only returns a pointer to the object on disk. No data is actually loaded into memory until it’s requested.

The top level of the experiment contains two elements: obs, a SOMA DataFrame containing the cell annotations; and ms, a SOMA Collection of the measurements (e.g., RNA) in the experiment.

You can access the obs array directly with:

  • Python
  • R
experiment.obs
<DataFrame 'tiledb://TileDB-Inc/e11d2d07-ab5a-41aa-9408-378802cd4890' (open for 'r')>
experiment$obs
<SOMADataFrame>
  uri: tiledb://TileDB-Inc/e11d2d07-ab5a-41aa-9408-378802cd4890 
  dimensions: soma_joinid 
  attributes: cell_id, organ_tissue, method, donor, anatomical_information, n_counts_UMIs, ... 

Other elements are nested within the experiment according to the SOMA data model (see the following diagram) but are accessible using a similar syntax.

A diagram with nested rectangles representing the different objects comprising TileDB-SOMA's data model. A diagram with nested rectangles representing the different objects comprising TileDB-SOMA's data model.

For example, feature-level annotations are stored in the var array, which is always located at the top-level of each SOMA Measurement. This dataset contains only a single measurement, RNA but more complex datasets may contain multiple measurements. Access the RNA measurement’s var array.

  • Python
  • R
experiment.ms["RNA"].var
<DataFrame 'tiledb://TileDB-Inc/51fd7f27-3d17-49d0-abc3-04efd8fb9712' (open for 'r')>
experiment$ms$get("RNA")$var
<SOMADataFrame>
  uri: tiledb://TileDB-Inc/51fd7f27-3d17-49d0-abc3-04efd8fb9712 
  dimensions: soma_joinid 
  attributes: var_id, gene_symbol, feature_type, ensemblid, highly_variable, means, dispers... 

Similarly, assay data (e.g., RNA expression levels) is stored in SOMASparseNDArrays within the X collection. Each array within X is referred to as a layer. Access the X collection to see what layers are available.

  • Python
  • R
experiment.ms["RNA"].X
<Collection 'tiledb://TileDB-Inc/eed32c99-793e-45c8-9fc4-2d2bfbf1ea75' (open for 'r') (3 items)
    'raw_counts': 'tiledb://TileDB-Inc/c7e36602-3603-43aa-9b74-0702dfc67261' (unopened)
    'decontXcounts': 'tiledb://TileDB-Inc/9e00e1b2-3839-466c-8d84-b563bdc9ad16' (unopened)
    'data': 'tiledb://TileDB-Inc/f831aedb-ec83-4a28-87c9-ebeda0932bce' (unopened)>
experiment$ms$get("RNA")$X
<SOMACollection>
  uri: tiledb://TileDB-Inc/eed32c99-793e-45c8-9fc4-2d2bfbf1ea75 
  arrays: data*, decontXcounts*, raw_counts* 

The next section covers reading data from these components.

Read into memory

All array-based SOMA objects provide a read method for loading data into memory. Designed with large datasets in mind, these methods always return an iterator, allowing data to be loaded in chunks intelligently sized by TileDB to accommodate the allocated memory, and efficiently materialize the results in Apache Arrow format, leveraging zero-copy memory sharing where possible.

In following example, expression values from the X data layer are loaded into memory one chunk at a time.

  • Python
  • R

The .tables() method is used to materialize each chunk as an Arrow Table. For the sake of this tutorial, the operation is limited to the first 3 chunks.

chunks = []
for chunk in experiment.ms["RNA"].X["data"].read().tables():
    if len(chunks) == 3:
        break
    chunks.append(chunk)

chunks
[pyarrow.Table
 soma_dim_0: int64
 soma_dim_1: int64
 soma_data: float
 ----
 soma_dim_0: [[0,0,0,0,0,...,1941,1941,1941,1941,1941]]
 soma_dim_1: [[38,137,148,197,229,...,25576,25581,25620,25679,25714]]
 soma_data: [[2.3135314,2.017924,1.7682451,2.9146569,1.7524959,...,2.342981,2.5981085,4.8260975,2.4611263,1.7042084]],
 pyarrow.Table
 soma_dim_0: int64
 soma_dim_1: int64
 soma_data: float
 ----
 soma_dim_0: [[1941,1941,1941,1941,1941,...,1832,1832,1832,1832,1832]]
 soma_dim_1: [[25769,25801,25879,25953,25959,...,51698,51760,51772,51790,51920]]
 soma_data: [[2.325368,1.4856703,4.992154,2.4606707,2.4371808,...,4.936844,1.9636455,6.960403,5.834246,1.6239011]],
 pyarrow.Table
 soma_dim_0: int64
 soma_dim_1: int64
 soma_data: float
 ----
 soma_dim_0: [[1832,1832,1832,1832,1832,...,3432,3432,3432,3432,3432]]
 soma_dim_1: [[51925,51930,51951,51994,52036,...,21530,21597,21603,21642,21679]]
 soma_data: [[3.1878636,2.1417744,2.521798,8.600453,3.123161,...,1.8475122,1.0498972,0,2.0656362,2.8646808]]]

The $tables() method is used to materialize each chunk as Arrow Table. For the sake of this tutorial, the operation is limited to the first 3 chunks.

chunks <- list()

x_reader <- experiment$ms$get("RNA")$X$get("data")$read()$tables()

while (!x_reader$read_complete()) {
  if (length(chunks) == 3) {
    break
  }
  chunks <- c(chunks, x_reader$read_next())
}

chunks
[[1]]
Table
2097152 rows x 3 columns
$soma_dim_0 <int64 not null>
$soma_dim_1 <int64 not null>
$soma_data <float not null>

[[2]]
Table
2097152 rows x 3 columns
$soma_dim_0 <int64 not null>
$soma_dim_1 <int64 not null>
$soma_data <float not null>

[[3]]
Table
2097152 rows x 3 columns
$soma_dim_0 <int64 not null>
$soma_dim_1 <int64 not null>
$soma_data <float not null>

This approach is particularly useful when working with large arrays that may not fit into memory all at once. However, for smaller arrays that comfortably fit into memory, the concat method is used to automatically load all chunks and concatenate them into a single Arrow Table.

  • Python
  • R

Use .concat() to load the entirety of the obs array as an Arrow Table that is then converted to a Pandas DataFrame.

experiment.obs.read().concat().to_pandas()
soma_joinid cell_id organ_tissue method donor anatomical_information n_counts_UMIs n_genes cell_ontology_class free_annotation manually_annotated compartment gender
0 0 AAACCCACACTCCTGT_TSP6_Liver_NA_10X_1_1 Liver 10X TSP6 nan 7633.0 2259 macrophage Monocyte/Macrophage True immune male
1 1 AAACGAAGTACCAGAG_TSP6_Liver_NA_10X_1_1 Liver 10X TSP6 nan 2858.0 1152 monocyte Monocyte True immune male
2 2 AAAGAACAGCCTCTTC_TSP6_Liver_NA_10X_1_1 Liver 10X TSP6 nan 10395.0 2598 macrophage Monocyte/Macrophage True immune male
3 3 AAAGAACGTAGCACAG_TSP6_Liver_NA_10X_1_1 Liver 10X TSP6 nan 6610.0 2125 liver dendritic cell Dendritic cell True immune male
4 4 AAAGAACGTTTCTTAC_TSP6_Liver_NA_10X_1_1 Liver 10X TSP6 nan 9387.0 2345 macrophage Monocyte/Macrophage True immune male
... ... ... ... ... ... ... ... ... ... ... ... ... ...
264819 264819 TSP2_Vasculature_aorta_SS2_B113343_B133091_Imm... Vasculature smartseq2 TSP2 aorta 37347.0 395 macrophage macrophage True immune female
264820 264820 TSP2_Vasculature_aorta_SS2_B113343_B133091_Imm... Vasculature smartseq2 TSP2 aorta 111047.0 769 macrophage macrophage True immune female
264821 264821 TSP2_Vasculature_aorta_SS2_B113343_B133091_Imm... Vasculature smartseq2 TSP2 aorta 140634.0 2468 macrophage macrophage True immune female
264822 264822 TSP2_Vasculature_aorta_SS2_B113343_B133091_Imm... Vasculature smartseq2 TSP2 aorta 176268.0 2700 macrophage macrophage True immune female
264823 264823 TSP2_Vasculature_aorta_SS2_B113343_B133091_Imm... Vasculature smartseq2 TSP2 aorta 69025.0 982 t cell t cell True immune female

264824 rows × 13 columns

Use .concat() to load the entirety of the obs array as an Arrow Table that is then converted to a data.frame.

experiment$obs$read()$concat()$to_data_frame()
A tibble: 264824 x 13
soma_joinid cell_id organ_tissue method donor anatomical_information n_counts_UMIs n_genes cell_ontology_class free_annotation manually_annotated compartment gender
<int> <chr> <fct> <fct> <fct> <fct> <dbl> <int> <fct> <fct> <lgl> <fct> <fct>
0 AAACCCACACTCCTGT_TSP6_Liver_NA_10X_1_1 Liver 10X TSP6 nan 7633 2259 macrophage Monocyte/Macrophage TRUE immune male
1 AAACGAAGTACCAGAG_TSP6_Liver_NA_10X_1_1 Liver 10X TSP6 nan 2858 1152 monocyte Monocyte TRUE immune male
2 AAAGAACAGCCTCTTC_TSP6_Liver_NA_10X_1_1 Liver 10X TSP6 nan 10395 2598 macrophage Monocyte/Macrophage TRUE immune male
3 AAAGAACGTAGCACAG_TSP6_Liver_NA_10X_1_1 Liver 10X TSP6 nan 6610 2125 liver dendritic cell Dendritic cell TRUE immune male
4 AAAGAACGTTTCTTAC_TSP6_Liver_NA_10X_1_1 Liver 10X TSP6 nan 9387 2345 macrophage Monocyte/Macrophage TRUE immune male
... ... ... ... ... ... ... ... ... ... ... ... ...
264819 TSP2_Vasculature_aorta_SS2_B113343_B133091_Immune_P5_S365 Vasculature smartseq2 TSP2 aorta 37347 395 macrophage macrophage TRUE immune female
264820 TSP2_Vasculature_aorta_SS2_B113343_B133091_Immune_P6_S366 Vasculature smartseq2 TSP2 aorta 111047 769 macrophage macrophage TRUE immune female
264821 TSP2_Vasculature_aorta_SS2_B113343_B133091_Immune_P7_S367 Vasculature smartseq2 TSP2 aorta 140634 2468 macrophage macrophage TRUE immune female
264822 TSP2_Vasculature_aorta_SS2_B113343_B133091_Immune_P8_S368 Vasculature smartseq2 TSP2 aorta 176268 2700 macrophage macrophage TRUE immune female
264823 TSP2_Vasculature_aorta_SS2_B113343_B133091_Immune_P9_S369 Vasculature smartseq2 TSP2 aorta 69025 982 t cell t cell TRUE immune female

Select and filter data

One of the most useful features of SOMA is the ability to efficiently select and filter only the data necessary for your analysis without loading the entire dataset into memory first. The read methods offer several arguments to access a specific subset of data, depending on the type of object being read.

The most basic type of filtering is selecting a subset of records based on their coordinates using the coords argument, which is available on all array-based SOMA objects.

  • Python
  • R

This example loads the expression data for the first 100 cells and 50 genes. Because the requested slice is realtively small the .concat() method is added to return a single Arrow Table before converting the output to a pandas DataFrame.

(
    experiment.ms["RNA"]
    .X["data"]
    .read(coords=[slice(0, 99), slice(0, 49)])
    .tables()
    .concat()
    .to_pandas()
)
soma_dim_0 soma_dim_1 soma_data
0 0 38 2.313531
1 1 36 0.000000
2 1 38 2.566711
3 2 32 0.622891
4 2 38 1.939062
... ... ... ...
174 97 38 2.798242
175 98 12 7.249046
176 98 32 0.884104
177 98 38 2.239379
178 99 38 2.808726

179 rows × 3 columns

This example loads the expression data for the first 100 cells and 50 genes. Because the requested slice is realtively small the $concat() method is added to return a single Arrow Table before converting the output to a data.frame.

experiment$ms$get("RNA")$X$get("data")$read(
  coords = list(0:99L, 0:49L)
)$tables()$concat()$to_data_frame()
A tibble: 179 x 3
soma_dim_0 soma_dim_1 soma_data
<int> <int> <dbl>
0 38 2.3135314
1 36 0.0000000
1 38 2.5667109
2 32 0.6228909
2 38 1.9390616
... ... ...
97 38 2.7982423
98 12 7.2490458
98 32 0.8841037
98 38 2.2393794
99 38 2.8087256

For SOMADataFrame objects like obs and var, the read method provides additional arguments to filter by values on query conditions and select a subset of columns to return.

Load the first 100 records from obs with at least 2,000 detected reads and retrieve only two columns of interest from the array.

  • Python
  • R
experiment.obs.read(
    coords=[slice(0, 99)],
    value_filter="n_counts_UMIs > 2000",
    column_names=["cell_id", "n_counts_UMIs"],
).concat().to_pandas()
cell_id n_counts_UMIs
0 AAACCCACACTCCTGT_TSP6_Liver_NA_10X_1_1 7633.0
1 AAACGAAGTACCAGAG_TSP6_Liver_NA_10X_1_1 2858.0
2 AAAGAACAGCCTCTTC_TSP6_Liver_NA_10X_1_1 10395.0
3 AAAGAACGTAGCACAG_TSP6_Liver_NA_10X_1_1 6610.0
4 AAAGAACGTTTCTTAC_TSP6_Liver_NA_10X_1_1 9387.0
... ... ...
95 ACACGCGAGCGAGTAC_TSP6_Liver_NA_10X_1_1 6029.0
96 ACACGCGAGTATGTAG_TSP6_Liver_NA_10X_1_1 8961.0
97 ACACTGAAGGTAGTAT_TSP6_Liver_NA_10X_1_1 4409.0
98 ACAGAAAAGCAATAGT_TSP6_Liver_NA_10X_1_1 6239.0
99 ACAGCCGCAGGATTCT_TSP6_Liver_NA_10X_1_1 8171.0

100 rows × 2 columns

experiment$obs$read(
  coords = 0:99L,
  value_filter = "n_counts_UMIs > 2000",
  column_names = c("cell_id", "n_counts_UMIs")
)$concat()$to_data_frame()
A tibble: 100 x 2
cell_id n_counts_UMIs
<chr> <dbl>
AAACCCACACTCCTGT_TSP6_Liver_NA_10X_1_1 7633
AAACGAAGTACCAGAG_TSP6_Liver_NA_10X_1_1 2858
AAAGAACAGCCTCTTC_TSP6_Liver_NA_10X_1_1 10395
AAAGAACGTAGCACAG_TSP6_Liver_NA_10X_1_1 6610
AAAGAACGTTTCTTAC_TSP6_Liver_NA_10X_1_1 9387
... ...
ACACGCGAGCGAGTAC_TSP6_Liver_NA_10X_1_1 6029
ACACGCGAGTATGTAG_TSP6_Liver_NA_10X_1_1 8961
ACACTGAAGGTAGTAT_TSP6_Liver_NA_10X_1_1 4409
ACAGAAAAGCAATAGT_TSP6_Liver_NA_10X_1_1 6239
ACAGCCGCAGGATTCT_TSP6_Liver_NA_10X_1_1 8171

Leverage the same filtering options on the var array to retrieve pre-calculated gene expression means and standard deviations for a set of relevant genes.

  • Python
  • R
experiment.ms["RNA"].var.read(
    value_filter="gene_symbol in ['CD19', 'CD3E', 'CD4', 'CD8A', 'CD14']",
    column_names=["ensemblid", "mean", "std", "gene_symbol"],
).concat().to_pandas()
[2024-08-15 02:21:29.735] [tiledbsoma] [Process: 246425] [Thread: 246425] [warning] [TileDB-SOMA::ManagedQuery] [51fd7f27-3d17-49d0-abc3-04efd8fb9712] Invalid column selected: feature_name
ensemblid mean std gene_symbol
0 ENSG00000153563.15 0.115993 0.462204 CD8A
1 ENSG00000170458.14 0.284372 0.698817 CD14
2 ENSG00000198851.9 0.338931 0.756713 CD3E
3 ENSG00000010610.10 0.122590 0.371681 CD4
4 ENSG00000177455.13 0.054992 0.266716 CD19
experiment$ms$get("RNA")$var$read(
  value_filter = "gene_symbol %in% c('CD19', 'CD3E', 'CD4', 'CD8A', 'CD14')",
  column_names = c("ensemblid", "gene_symbol", "mean", "std")
)$concat()$to_data_frame()
A tibble: 5 x 4
ensemblid gene_symbol mean std
<chr> <fct> <dbl> <dbl>
ENSG00000153563.15 CD8A 0.11599332 0.4622044
ENSG00000170458.14 CD14 0.28437189 0.6988171
ENSG00000198851.9 CD3E 0.33893079 0.7567127
ENSG00000010610.10 CD4 0.12258978 0.3716811
ENSG00000177455.13 CD19 0.05499229 0.2667163

For SOMASparseNDArrays such as X layers containing expression data (for this dataset) or obsm layers containing dimensionality reduction results, the read method’s filtering capabilities are limited to the coords argument.

This example loads expression data for the first 100 cells and 50 genes as a table.

  • Python
  • R
(
    experiment.ms["RNA"]
    .X["data"]
    .read(coords=[slice(0, 99), slice(0, 49)])
    .tables()
    .concat()
    .to_pandas()
)
soma_dim_0 soma_dim_1 soma_data
0 0 38 2.313531
1 1 36 0.000000
2 1 38 2.566711
3 2 32 0.622891
4 2 38 1.939062
... ... ... ...
174 97 38 2.798242
175 98 12 7.249046
176 98 32 0.884104
177 98 38 2.239379
178 99 38 2.808726

179 rows × 3 columns

experiment$ms$get("RNA")$X$get("data")$read(
  coords = list(0:99L, 0:49L)
)$tables()$concat()$to_data_frame()
A tibble: 179 x 3
soma_dim_0 soma_dim_1 soma_data
<int> <int> <dbl>
0 38 2.3135314
1 36 0.0000000
1 38 2.5667109
2 32 0.6228909
2 38 1.9390616
... ... ...
97 38 2.7982423
98 12 7.2490458
98 32 0.8841037
98 38 2.2393794
99 38 2.8087256

Experiment-level queries

The real power of the SOMA API comes from the ability to slice and filter measurement data based on the cell- and feature-level annotations stored in the experiment. For datasets containing millions of cells, this means you can easily access expression values for cells within a specific cluster, or that meet a certain quality threshold, etc.

The example below shows how to filter for highly variable genes within dendritic cells.

  • Python
  • R
query = experiment.axis_query(
    measurement_name="RNA",
    obs_query=tiledbsoma.AxisQuery(
        value_filter="cell_ontology_class == 'dendritic cell'",
    ),
    var_query=tiledbsoma.AxisQuery(
        value_filter="highly_variable == True",
    ),
)
query <- experiment$axis_query(
  measurement_name = "RNA",
  obs_query = SOMAAxisQuery$new(
    value_filter = "cell_ontology_class == 'dendritic cell'"
  ),
  var_query = SOMAAxisQuery$new(
    value_filter = "highly_variable == TRUE"
  )
)

Inspect result

The returned query object allows you to inspect the query results and selectively access data. For example, you can see how many cells and genes were returned by the query:

  • Python
  • R
{"cells": query.n_obs, "genes": query.n_vars}
{'cells': 533, 'genes': 2435}
c("cells" = query$n_obs, "genes" = query$n_vars)
cells
533
genes
2435

Load result

Data loaded into memory from a SOMA experiment via the query object only includes records that matches the specified query criteria. The following example demonstrates how to load expression values for matching cells and genes from an X layer.

  • Python
  • R

You can also load the expression data into memory for the selected cells and genes as an Arrow sparse tensor.

query.X(layer_name="data").coos().concat()
<pyarrow.SparseCOOTensor>
type: float
shape: (2147483646, 2147483646)
Important

Note the shape of the returned tensor corresponds to the capacity of the underlying TileDB array. By default, SOMA creates arrays with extra room for adding new data in the future.

From here, you could use the to_sparse_matrix() method to easily load query results for any matrix-like data as a sparse dgTMatrix (from the Matrix package). At a minimum, you need to pass a collection (e.g., X, or obsm) and layer (e.g., data). You can also populate the matrix dimension names by specifying which obs column contains the values to use for row names and which var column contains the values to use for column names.

mat <- query$to_sparse_matrix(
  collection = "X",
  layer_name = "data",
  obs_index = "cell_id",
  var_index = "var_id"
)

mat[1:10, 1:5]
10 x 5 sparse Matrix of class "dgTMatrix"
                                     AL627309.6 MTCO1P12    ISG15    CCNL2 NADK
ACAAAGATCATGCAGT_TSP5_Eye_NA_10X_1_2          . .        .        .           .
ACGGAAGAGGTTGCCC_TSP5_Eye_NA_10X_1_2          . .        2.728290 .           .
AGTGACTAGCGGTAAC_TSP5_Eye_NA_10X_1_2          . .        .        .           .
ATGCCTCAGCCGAACA_TSP5_Eye_NA_10X_1_2          . .        .        .           .
CAACCAAAGTTGGCGA_TSP5_Eye_NA_10X_1_2          . .        2.816236 .           .
CATACTTTCATTGTGG_TSP5_Eye_NA_10X_1_2          . .        .        .           .
CATGCAATCGAGCTGC_TSP5_Eye_NA_10X_1_2          . .        2.300955 .           .
CATGCGGGTACTAGCT_TSP5_Eye_NA_10X_1_2          . .        .        2.834261    .
CCGAACGGTAGGGAGG_TSP5_Eye_NA_10X_1_2          . 1.926033 .        .           .
CCTGCATCACAATGAA_TSP5_Eye_NA_10X_1_2          . .        .        .           .

Toolkit interoperability

SOMA also provides support for exporting query results to various in-memory data structures used by popular single-cell analysis toolkits. As before, the results only include data that passed the specified query criteria. Unlike the query accessors shown previously, these methods must access and load multiple data elements to construct these complex objects but still offer flexibility to customize what is included in the resulting object.

  • Python
  • R

This example shows how to materialize the query results as an AnnData object, populating X with expression data from the "data" layer.

adat = query.to_anndata(X_name="data")
adat
AnnData object with n_obs × n_vars = 533 × 2435
    obs: 'soma_joinid', 'cell_id', 'organ_tissue', 'method', 'donor', 'anatomical_information', 'n_counts_UMIs', 'n_genes', 'cell_ontology_class', 'free_annotation', 'manually_annotated', 'compartment', 'gender'
    var: 'soma_joinid', 'var_id', 'gene_symbol', 'feature_type', 'ensemblid', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'

This example shows how to materialize the query results as an Seurat object, populating the RNA Assay’s data slot with expression data from the data layer. The obs_index and var_index arguments are used to specify which columns in the obs and var arrays should be used as column and row names, respectively.

sobj <- query$to_seurat(
  X_layers = c(data = "data"),
  obs_index = "cell_id",
  var_index = "var_id",
  obsm_layers = FALSE,
  varm_layers = FALSE
)

sobj
An object of class Seurat 
2435 features across 533 samples within 1 assay 
Active assay: RNA (2435 features, 0 variable features)
 1 layer present: data

Now that you have the data loaded in memory and in a toolkit-specific format, the full suite of analysis and visualization methods provided by that toolkit are available.

  • Python
  • R

This example leverages scanpy’s plotting capabilities to visualize the distribution of the n_counts_UMIs and n_genes attributes across the cells in the query result.

sc.pl.violin(adat, ["n_counts_UMIs", "n_genes"], jitter=0.4, multi_panel=True)
Figure 1

This example leverages Seurat’s plotting capabilities to visualize the distribution of the n_counts_UMIs and n_genes attributes across the cells in the query result.

Seurat::VlnPlot(sobj, features = c("n_counts_UMIs", "n_genes"))
Figure 2
Bulk Ingestion Tutorial
Distributed Compute