1. Structure
  2. Life Sciences
  3. Single-cell
  4. Tutorials
  5. Multi-Experiment Queries
  • Home
  • What is TileDB?
  • Get Started
  • Explore Content
  • Accounts
    • Individual Accounts
      • Apply for the Free Tier
      • Profile
        • Overview
        • Cloud Credentials
        • Storage Paths
        • REST API Tokens
        • Credits
    • Organization Admins
      • Create an Organization
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
    • Organization Members
      • Organization Invitations
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
  • Catalog
    • Introduction
    • Data
      • Arrays
      • Tables
      • Single-Cell (SOMA)
      • Genomics (VCF)
      • Biomedical Imaging
      • Vector Search
      • Files
    • Code
      • Notebooks
      • Dashboards
      • User-Defined Functions
      • Task Graphs
      • ML Models
    • Groups
    • Marketplace
    • Search
  • Collaborate
    • Introduction
    • Organizations
    • Access Control
      • Introduction
      • Share Assets
      • Asset Permissions
      • Public Assets
    • Logging
    • Marketplace
  • Analyze
    • Introduction
    • Slice Data
    • Multi-Region Redirection
    • Notebooks
      • Launch a Notebook
      • Usage
      • Widgets
      • Notebook Image Dependencies
    • Dashboards
      • Dashboards
      • Streamlit
    • Preview
    • User-Defined Functions
    • Task Graphs
    • Serverless SQL
    • Monitor
      • Task Log
      • Task Graph Log
  • Scale
    • Introduction
    • Task Graphs
    • API Usage
  • Structure
    • Why Structure Is Important
    • Arrays
      • Introduction
      • Quickstart
      • Foundation
        • Array Data Model
        • Key Concepts
          • Storage
            • Arrays
            • Dimensions
            • Attributes
            • Cells
            • Domain
            • Tiles
            • Data Layout
            • Compression
            • Encryption
            • Tile Filters
            • Array Schema
            • Schema Evolution
            • Fragments
            • Fragment Metadata
            • Commits
            • Indexing
            • Array Metadata
            • Datetimes
            • Groups
            • Object Stores
          • Compute
            • Writes
            • Deletions
            • Consolidation
            • Vacuuming
            • Time Traveling
            • Reads
            • Query Conditions
            • Aggregates
            • User-Defined Functions
            • Distributed Compute
            • Concurrency
            • Parallelism
        • Storage Format Spec
      • Tutorials
        • Basics
          • Basic Dense Array
          • Basic Sparse Array
          • Array Metadata
          • Compression
          • Encryption
          • Data Layout
          • Tile Filters
          • Datetimes
          • Multiple Attributes
          • Variable-Length Attributes
          • String Dimensions
          • Nullable Attributes
          • Multi-Range Reads
          • Query Conditions
          • Aggregates
          • Deletions
          • Catching Errors
          • Configuration
          • Basic S3 Example
          • Basic TileDB Cloud
          • fromDataFrame
          • Palmer Penguins
        • Advanced
          • Schema Evolution
          • Advanced Writes
            • Write at a Timestamp
            • Get Fragment Info
            • Consolidation
              • Fragments
              • Fragment List
              • Consolidation Plan
              • Commits
              • Fragment Metadata
              • Array Metadata
            • Vacuuming
              • Fragments
              • Commits
              • Fragment Metadata
              • Array Metadata
          • Advanced Reads
            • Get Fragment Info
            • Time Traveling
              • Introduction
              • Fragments
              • Array Metadata
              • Schema Evolution
          • Array Upgrade
          • Backends
            • Amazon S3
            • Azure Blob Storage
            • Google Cloud Storage
            • MinIO
            • Lustre
          • Virtual Filesystem
          • User-Defined Functions
          • Distributed Compute
          • Result Estimation
          • Incomplete Queries
        • Management
          • Array Schema
          • Groups
          • Object Management
        • Performance
          • Summary of Factors
          • Dense vs. Sparse
          • Dimensions vs. Attributes
          • Compression
          • Tiling and Data Layout
          • Tuning Writes
          • Tuning Reads
      • API Reference
    • Tables
      • Introduction
      • Quickstart
      • Foundation
        • Data Model
        • Key Concepts
          • Indexes
          • Columnar Storage
          • Compression
          • Data Manipulation
          • Optimize Tables
          • ACID
          • Serverless SQL
          • SQL Connectors
          • Dataframes
          • CSV Ingestion
      • Tutorials
        • Basics
          • Ingestion with SQL
          • CSV Ingestion
          • Basic S3 Example
          • Running Locally
        • Advanced
          • Scalable Ingestion
          • Scalable Queries
      • API Reference
    • AI & ML
      • Vector Search
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Vector Search
            • Vector Databases
            • Algorithms
            • Distance Metrics
            • Updates
            • Deployment Methods
            • Architecture
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Ingestion & Querying
            • Updates
            • Deletions
            • Basic S3 Example
            • Running Locally
          • Advanced
            • Versioning
            • Time Traveling
            • Consolidation
            • Distributed Compute
            • RAG LLM
            • LLM Memory
            • File Search
            • Image Search
            • Protein Search
          • Performance
        • API Reference
      • ML Models
        • Introduction
        • Quickstart
        • Foundation
          • Basics
          • Storage
          • Cloud Execution
          • Why TileDB for Machine Learning
        • Tutorials
          • Ingestion
            • Data Ingestion
              • Dense Datasets
              • Sparse Datasets
            • ML Model Ingestion
          • Management
            • Array Schema
            • Machine Learning: Groups
            • Time Traveling
    • Life Sciences
      • Single-cell
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Data Structures
            • Use of Apache Arrow
            • Join IDs
            • State Management
            • TileDB Cloud URIs
          • SOMA API Specification
        • Tutorials
          • Data Ingestion
          • Bulk Ingestion Tutorial
          • Data Access
          • Distributed Compute
          • Basic S3 Example
          • Multi-Experiment Queries
          • Appending Data to a SOMA Experiment
          • Add New Measurements
          • SQL Queries
          • Running Locally
          • Shapes in TileDB-SOMA
          • Drug Discovery App
        • Spatial
          • Introduction
          • Foundation
            • Spatial Data Model
            • Data Structures
          • Tutorials
            • Spatial Data Ingestion
            • Access Spatial Data
            • Manage Coordinate Spaces
        • API Reference
      • Population Genomics
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • The N+1 Problem
            • Architecture
            • Arrays
            • Ingestion
            • Reads
            • Variant Statistics
            • Annotations
            • User-Defined Functions
            • Tables and SQL
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Basic Ingestion
            • Basic Queries
            • Export to VCF
            • Add New Samples
            • Deleting Samples
            • Basic S3 Example
            • Basic TileDB Cloud
          • Advanced
            • Scalable Ingestion
            • Scalable Queries
            • Query Transforms
            • Handling Large Queries
            • Annotations
              • Finding Annotations
              • Embedded Annotations
              • External Annotations
              • Annotation VCFs
              • Ingesting Annotations
            • Variant Statistics
            • Tables and SQL
            • User-Defined Functions
            • Sample Metadata
            • Split VCF
          • Performance
        • API Reference
          • Command Line Interface
          • Python API
          • Cloud API
      • Biomedical Imaging
        • Introduction
        • Foundation
          • Data Model
          • Key Concepts
            • Arrays
            • Ingestion
            • Reads
            • User Defined Functions
          • Storage Format Spec
        • Quickstart
        • Tutorials
          • Basics
            • Ingestion
            • Read
              • OpenSlide
              • TileDB-Py
          • Advanced
            • Batched Ingestion
            • Chunked Ingestion
            • Machine Learning
              • PyTorch
            • Napari
    • Files
  • API Reference
  • Self-Hosting
    • Installation
    • Upgrades
    • Administrative Tasks
    • Image Customization
      • Customize User-Defined Function Images
      • AWS ECR Container Registry
      • Customize Jupyter Notebook Images
    • Single Sign-On
      • Configure Single Sign-On
      • OpenID Connect
      • Okta SCIM
      • Microsoft Entra
  • Glossary

On this page

  • Overview
  • Getting started
  • Basic usage
  • Specifying a collection of experiments
  • Metadata filtering
  • Data selection
  • Post-processing with callbacks
  • Conclusion
  1. Structure
  2. Life Sciences
  3. Single-cell
  4. Tutorials
  5. Multi-Experiment Queries

Multi-Experiment Queries

life sciences
single cell (soma)
tutorials
queries
Learn how to query and process multiple SOMA experiments in parallel using the SOMA Experiment Collection Mapper.
Warning

This functionality is currently limited to Python.

Overview

The SOMA Experiment Collection Mapper, part of the tiledb-cloud-py package, allows researchers to apply the same query across multiple SOMA experiments simultaneously. In this tutorial, you will explore this feature’s key capabilities:

  • Parallel processing: Execute the same query across multiple experiments in parallel.
  • Flexible inputs: Specify experiments to query as a list of URIs for the individual experiment or a single URI pointing to a collection of experiments.
  • Data selectivity: Extract slices of data by specifying metadata queries on the obs or var axes, or both. Additionally, a subset of obs/var columns can be selected for inclusion in the result.
  • Interoperability: Results for each experiment are returned as an AnnData object, allowing for seamless integration with the Scanpy package.
  • Callback support: Optionally provide a custom function on the query result for each experiment to perform additional processing.

Getting started

Warning

Running this tutorial requires tiledb-cloud version >=0.12.19, which is not yet released.

The Experiment Collection Mapper is available as part of the tiledb-cloud-py package. To install the package, run:

pip install tiledb-cloud

To use the Experiment Collection Mapper, import the tiledb.cloud.experiment_collection_mapper module:

import matplotlib.pyplot as plt
import scanpy as sc
import tiledb.cloud
from tiledb.cloud.soma import build_collection_mapper_workflow_graph

print(f"tiledb.cloud version: {tiledb.cloud.__version__}")
tiledb.cloud version: 0.12.19.dev1+ge058580

The tiledb.cloud.soma submodule provides two functions:

  • build_collection_mapper_workflow_graph constructs and returns a TileDB Cloud [task graph][] object based on the input parameters, which can be inspected or modified before execution.
  • run_collection_mapper_workflow is a convenience function that constructs and executes the task graph.
Tip

Inspecting the task graph before execution can be useful for debugging or verifying the workflow for correctness before running it.

Basic usage

This tutorial uses a collection of tissue-specific datasets generated by the Tabula Sapiens consortium. Each of the 24 individual datasets has been converted into SOMA experiments and made available on TileDB Cloud.

In this example, you will access two of the Tabula sapiens datasets using the SOMA experiment collection mapper. The TileDB Cloud URIs for the two datasets are passed to the soma_experiment_uris argument as a dict, where the keys are the experiment names and the values are the URIs. The only other required arguments are measurement_name and X_layer_name.

graph = build_collection_mapper_workflow_graph(
    soma_experiment_uris={
        "Kidney": "tiledb://tiledb-inc/TS_Kidney",
        "Liver": "tiledb://tiledb-inc/TS_Liver",
    },
    measurement_name="RNA",
    X_layer_name="data",
)
[2024-07-18 20:02:50,911] [mapper] [build_collection_mapper_workflow_graph] [INFO] Retrieved 2 SOMA Experiment URIs
[2024-07-18 20:02:50,913] [mapper] [build_collection_mapper_workflow_graph] [INFO] Constructing task graph

This returns a TileDB Cloud DAG object, which represents the task graph.

Tip

Learn more about the DAG object in the [Task Graphs][task graph] section.

It’s often useful to inspect the task graph before execution, which can be done using the .visualize() method.

graph.visualize()

This simple task graph consists of two nodes, one for each experiment. The nodes are horizontally aligned, indicating that they will be executed in parallel. Hovering over a node displays the experiment name and its current status.

Use the .compute() method to execute the task graph. Following-up with a call to .wait() will block the cell until the task graph completes.

Note

TileDB-SOMA version 1.12.0 contains a performance improvement for exporting SOMA experiments to AnnData objects. If you are using an earlier version of TileDB-SOMA, consider upgrading to a more recent version.

graph.compute()
graph.wait()

Note the visualization updates in real-time as the task graph progresses. The color of the node changes to green when the task completes successfully. You can also monitor the progress of the task graph and inspect each node’s task in detail by navigating to the Task Graph Logs page on TileDB Cloud.

Now, access the results of the task graph, which returns a dict keyed using the same experiment names as the input. Each value is an AnnData object containing the corresponding experiment’s data.

graph.end_results_by_name()
{'Kidney': AnnData object with n_obs × n_vars = 9641 × 58870,
 'Liver': AnnData object with n_obs × n_vars = 5007 × 58870}

Specifying a collection of experiments

TileDB Cloud allows you to organize your assets into groups, allowing you to reference multiple assets with a single URI. For example, the Tabula Sapiens tissue-specific SOMA experiments are organized into a group named soma-exps-tabula-sapiens-by-tissue.

To leverage this functionality, the SOMA experiment collection mapper also supports passing a single URI pointing to a collection of SOMA experiments, which can be more convenient than specifying each experiment individually.

SOMA_COLLECTION_URI = "tiledb://tiledb-inc/soma-exps-tabula-sapiens-by-tissue"

graph = build_collection_mapper_workflow_graph(
    soma_collection_uri=SOMA_COLLECTION_URI, measurement_name="RNA", X_layer_name="data"
)

graph.visualize()
[2024-07-18 19:40:15,600] [mapper] [build_collection_mapper_workflow_graph] [INFO] Retrieving SOMA Experiment URIs from SOMACollection tiledb://tiledb-inc/soma-exps-tabula-sapiens-by-tissue
[2024-07-18 19:40:16,191] [mapper] [build_collection_mapper_workflow_graph] [INFO] Retrieved 24 SOMA Experiment URIs
[2024-07-18 19:40:16,191] [mapper] [build_collection_mapper_workflow_graph] [INFO] Constructing task graph

The experiment_names argument allows you to specify a subset of experiments to query from the collection.

graph = build_collection_mapper_workflow_graph(
    soma_collection_uri=SOMA_COLLECTION_URI,
    experiment_names=["TS_Kidney", "TS_Liver"],
    measurement_name="RNA",
    X_layer_name="data",
)

graph.visualize()
[2024-07-18 19:40:16,481] [mapper] [build_collection_mapper_workflow_graph] [INFO] Retrieving SOMA Experiment URIs from SOMACollection tiledb://tiledb-inc/soma-exps-tabula-sapiens-by-tissue
[2024-07-18 19:40:16,658] [mapper] [build_collection_mapper_workflow_graph] [INFO] Filtering SOMA Experiment URIs for specified names
[2024-07-18 19:40:16,659] [mapper] [build_collection_mapper_workflow_graph] [INFO] Retrieved 2 SOMA Experiment URIs
[2024-07-18 19:40:16,660] [mapper] [build_collection_mapper_workflow_graph] [INFO] Constructing task graph

Metadata filtering

Now you will see how to use the SOMA experiment collection mapper to apply the same query to multiple experiments in parallel.

The collection mapper supports the same options for filtering SOMA experiments as tiledbsoma.Experiment.axis_query() method. Experiments can be filtered based on attributes in the obs or var axes, or both.

You can determine the attribute names that are available for filtering by inspecting a SOMA experiment’s obs or var arrays’ schemas on TileDB Cloud (for example, the TS_Kidney experiment’s obs array and the var array for the RNA measurement).

In this example, you will filter each of the specified experiments to select cells annotated as macrophages and genes with highly variable expression by passing query conditions to the obs_query_string and var_query_string arguments, respectively.

This example also leverages the counts_only argument, which modifies the task graph to only return the counts of cells and genes that satisfy the query conditions. This can be especially useful for preliminary exploratory analysis and saves time and resources by avoiding the transfer of large amounts of data.

graph = build_collection_mapper_workflow_graph(
    soma_collection_uri=SOMA_COLLECTION_URI,
    experiment_names=["TS_Kidney", "TS_Liver"],
    measurement_name="RNA",
    X_layer_name="data",
    obs_query_string="cell_ontology_class == 'macrophage'",
    var_query_string="highly_variable == True",
    counts_only=True,
)

graph.visualize()
[2024-07-18 19:40:16,859] [mapper] [build_collection_mapper_workflow_graph] [INFO] Retrieving SOMA Experiment URIs from SOMACollection tiledb://tiledb-inc/soma-exps-tabula-sapiens-by-tissue
[2024-07-18 19:40:17,048] [mapper] [build_collection_mapper_workflow_graph] [INFO] Filtering SOMA Experiment URIs for specified names
[2024-07-18 19:40:17,049] [mapper] [build_collection_mapper_workflow_graph] [INFO] Retrieved 2 SOMA Experiment URIs
[2024-07-18 19:40:17,050] [mapper] [build_collection_mapper_workflow_graph] [INFO] Constructing task graph
graph.compute()
graph.wait()

Now retrieve the results as before.

graph.end_results_by_name()
{'TS_Kidney': [321, 2435], 'TS_Liver': [1381, 2435]}

This shows there are 1,381 macrophages in the liver and 321 in the kidney. The number of highly variable genes is 2,435 in both experiments.

Note

Re-running this task graph with counts_only=False would return AnnData objects for each experiment containing only the cells and genes that satisfy the query conditions.

Data selection

By default, the AnnData objects returned by the SOMA experiment collection mapper will contain all attributes present in the obs and var arrays. However, you can specify a subset of columns to include in the output using the obs_attrs and var_attrs arguments.

graph = build_collection_mapper_workflow_graph(
    soma_collection_uri=SOMA_COLLECTION_URI,
    experiment_names=["TS_Kidney", "TS_Liver"],
    measurement_name="RNA",
    X_layer_name="data",
    obs_query_string="cell_ontology_class == 'macrophage'",
    var_query_string="highly_variable == True",
    obs_attrs=["cell_id", "cell_ontology_class"],
    var_attrs=["gene_symbol", "means", "highly_variable"],
)

graph.visualize()
[2024-07-18 19:40:24,399] [mapper] [build_collection_mapper_workflow_graph] [INFO] Retrieving SOMA Experiment URIs from SOMACollection tiledb://tiledb-inc/soma-exps-tabula-sapiens-by-tissue
[2024-07-18 19:40:24,588] [mapper] [build_collection_mapper_workflow_graph] [INFO] Filtering SOMA Experiment URIs for specified names
[2024-07-18 19:40:24,590] [mapper] [build_collection_mapper_workflow_graph] [INFO] Retrieved 2 SOMA Experiment URIs
[2024-07-18 19:40:24,593] [mapper] [build_collection_mapper_workflow_graph] [INFO] Constructing task graph
graph.compute()
graph.wait()

Verify the resulting AnnData objects contain only the specified obs/var columns.

{k: v for k, v in results.items()}
{'TS_Kidney': AnnData object with n_obs × n_vars = 321 × 2435
     obs: 'cell_id', 'cell_ontology_class'
     var: 'gene_symbol', 'means', 'highly_variable',
 'TS_Liver': AnnData object with n_obs × n_vars = 1381 × 2435
     obs: 'cell_id', 'cell_ontology_class'
     var: 'gene_symbol', 'means', 'highly_variable'}

Post-processing with callbacks

The callback argument in the build_collection_mapper_workflow_graph() function allows you to apply custom functions to the AnnData objects as part of the workflow. This feature provides flexibility to incorporate additional analysis steps, such as dimensionality reduction, clustering, or differential expression analysis, directly within the task graph.

In this example, you will use the callback argument to perform a PCA on the selected cell types and visualize the results.

graph = build_collection_mapper_workflow_graph(
    soma_collection_uri=SOMA_COLLECTION_URI,
    experiment_names=["TS_Kidney", "TS_Liver"],
    measurement_name="RNA",
    X_layer_name="data",
    obs_query_string="cell_ontology_class in ['macrophage', 'nk cell']",
    var_query_string="highly_variable == True",
    obs_attrs=["cell_id", "cell_ontology_class"],
    var_attrs=["gene_symbol", "means", "highly_variable"],
    callback=sc.pp.pca,
    args_dict={"copy": True},
)

graph.visualize()
[2024-07-18 19:40:42,184] [mapper] [build_collection_mapper_workflow_graph] [INFO] Retrieving SOMA Experiment URIs from SOMACollection tiledb://tiledb-inc/soma-exps-tabula-sapiens-by-tissue
[2024-07-18 19:40:42,408] [mapper] [build_collection_mapper_workflow_graph] [INFO] Filtering SOMA Experiment URIs for specified names
[2024-07-18 19:40:42,410] [mapper] [build_collection_mapper_workflow_graph] [INFO] Retrieved 2 SOMA Experiment URIs
[2024-07-18 19:40:42,410] [mapper] [build_collection_mapper_workflow_graph] [INFO] Constructing task graph
graph.compute()
graph.wait()

Retrieve the results and note the presence of the new PCA items in the obsm, varm, and uns attributes of the AnnData objects.

results = graph.end_results_by_name()
results
{'TS_Kidney': AnnData object with n_obs × n_vars = 452 × 2435
     obs: 'cell_id', 'cell_ontology_class'
     var: 'gene_symbol', 'means', 'highly_variable'
     uns: 'pca'
     obsm: 'X_pca'
     varm: 'PCs',
 'TS_Liver': AnnData object with n_obs × n_vars = 1626 × 2435
     obs: 'cell_id', 'cell_ontology_class'
     var: 'gene_symbol', 'means', 'highly_variable'
     uns: 'pca'
     obsm: 'X_pca'
     varm: 'PCs'}

Now you can visualize the PCA results for each experiment.

fig, axs = plt.subplots(1, len(results), figsize=(12, 6))

for i, (name, adat) in enumerate(results.items()):
    sc.pl.pca(adat, ax=axs[i], show=False, color="cell_ontology_class")
    axs[i].set_title(name)

plt.tight_layout()
plt.show()
Figure 1

Conclusion

The SOMA Experiment Collection Mapper UDF is a versatile and powerful tool that enables efficient and scalable data processing across multiple SOMA experiments.

Basic S3 Example
Appending Data to a SOMA Experiment