import matplotlib.pyplot as plt
import scanpy as sc
import tiledb.cloud
from tiledb.cloud.soma import build_collection_mapper_workflow_graph
print(f"tiledb.cloud version: {tiledb.cloud.__version__}")
tiledb.cloud version: 0.12.19.dev1+ge058580
This functionality is currently limited to Python.
The SOMA Experiment Collection Mapper, part of the tiledb-cloud-py
package, allows researchers to apply the same query across multiple SOMA experiments simultaneously. In this tutorial, you will explore this feature’s key capabilities:
obs
or var
axes, or both. Additionally, a subset of obs
/var
columns can be selected for inclusion in the result.AnnData
object, allowing for seamless integration with the Scanpy
package.Running this tutorial requires tiledb-cloud version >=0.12.19, which is not yet released.
The Experiment Collection Mapper is available as part of the tiledb-cloud-py
package. To install the package, run:
To use the Experiment Collection Mapper, import the tiledb.cloud.experiment_collection_mapper
module:
The tiledb.cloud.soma
submodule provides two functions:
build_collection_mapper_workflow_graph
constructs and returns a TileDB Cloud [task graph][] object based on the input parameters, which can be inspected or modified before execution.run_collection_mapper_workflow
is a convenience function that constructs and executes the task graph.Inspecting the task graph before execution can be useful for debugging or verifying the workflow for correctness before running it.
This tutorial uses a collection of tissue-specific datasets generated by the Tabula Sapiens consortium. Each of the 24 individual datasets has been converted into SOMA experiments and made available on TileDB Cloud.
In this example, you will access two of the Tabula sapiens datasets using the SOMA experiment collection mapper. The TileDB Cloud URIs for the two datasets are passed to the soma_experiment_uris
argument as a dict
, where the keys are the experiment names and the values are the URIs. The only other required arguments are measurement_name
and X_layer_name
.
graph = build_collection_mapper_workflow_graph(
soma_experiment_uris={
"Kidney": "tiledb://tiledb-inc/TS_Kidney",
"Liver": "tiledb://tiledb-inc/TS_Liver",
},
measurement_name="RNA",
X_layer_name="data",
)
[2024-07-18 20:02:50,911] [mapper] [build_collection_mapper_workflow_graph] [INFO] Retrieved 2 SOMA Experiment URIs
[2024-07-18 20:02:50,913] [mapper] [build_collection_mapper_workflow_graph] [INFO] Constructing task graph
This returns a TileDB Cloud DAG
object, which represents the task graph.
Learn more about the DAG
object in the [Task Graphs][task graph] section.
It’s often useful to inspect the task graph before execution, which can be done using the .visualize()
method.
This simple task graph consists of two nodes, one for each experiment. The nodes are horizontally aligned, indicating that they will be executed in parallel. Hovering over a node displays the experiment name and its current status.
Use the .compute()
method to execute the task graph. Following-up with a call to .wait()
will block the cell until the task graph completes.
TileDB-SOMA version 1.12.0 contains a performance improvement for exporting SOMA experiments to AnnData objects. If you are using an earlier version of TileDB-SOMA, consider upgrading to a more recent version.
Note the visualization updates in real-time as the task graph progresses. The color of the node changes to green when the task completes successfully. You can also monitor the progress of the task graph and inspect each node’s task in detail by navigating to the Task Graph Logs page on TileDB Cloud.
Now, access the results of the task graph, which returns a dict
keyed using the same experiment names as the input. Each value is an AnnData
object containing the corresponding experiment’s data.
TileDB Cloud allows you to organize your assets into groups, allowing you to reference multiple assets with a single URI. For example, the Tabula Sapiens tissue-specific SOMA experiments are organized into a group named soma-exps-tabula-sapiens-by-tissue
.
To leverage this functionality, the SOMA experiment collection mapper also supports passing a single URI pointing to a collection of SOMA experiments, which can be more convenient than specifying each experiment individually.
SOMA_COLLECTION_URI = "tiledb://tiledb-inc/soma-exps-tabula-sapiens-by-tissue"
graph = build_collection_mapper_workflow_graph(
soma_collection_uri=SOMA_COLLECTION_URI, measurement_name="RNA", X_layer_name="data"
)
graph.visualize()
[2024-07-18 19:40:15,600] [mapper] [build_collection_mapper_workflow_graph] [INFO] Retrieving SOMA Experiment URIs from SOMACollection tiledb://tiledb-inc/soma-exps-tabula-sapiens-by-tissue
[2024-07-18 19:40:16,191] [mapper] [build_collection_mapper_workflow_graph] [INFO] Retrieved 24 SOMA Experiment URIs
[2024-07-18 19:40:16,191] [mapper] [build_collection_mapper_workflow_graph] [INFO] Constructing task graph
The experiment_names
argument allows you to specify a subset of experiments to query from the collection.
graph = build_collection_mapper_workflow_graph(
soma_collection_uri=SOMA_COLLECTION_URI,
experiment_names=["TS_Kidney", "TS_Liver"],
measurement_name="RNA",
X_layer_name="data",
)
graph.visualize()
[2024-07-18 19:40:16,481] [mapper] [build_collection_mapper_workflow_graph] [INFO] Retrieving SOMA Experiment URIs from SOMACollection tiledb://tiledb-inc/soma-exps-tabula-sapiens-by-tissue
[2024-07-18 19:40:16,658] [mapper] [build_collection_mapper_workflow_graph] [INFO] Filtering SOMA Experiment URIs for specified names
[2024-07-18 19:40:16,659] [mapper] [build_collection_mapper_workflow_graph] [INFO] Retrieved 2 SOMA Experiment URIs
[2024-07-18 19:40:16,660] [mapper] [build_collection_mapper_workflow_graph] [INFO] Constructing task graph
Now you will see how to use the SOMA experiment collection mapper to apply the same query to multiple experiments in parallel.
The collection mapper supports the same options for filtering SOMA experiments as tiledbsoma.Experiment.axis_query()
method. Experiments can be filtered based on attributes in the obs
or var
axes, or both.
You can determine the attribute names that are available for filtering by inspecting a SOMA experiment’s obs
or var
arrays’ schemas on TileDB Cloud (for example, the TS_Kidney
experiment’s obs
array and the var
array for the RNA
measurement).
In this example, you will filter each of the specified experiments to select cells annotated as macrophages and genes with highly variable expression by passing query conditions to the obs_query_string
and var_query_string
arguments, respectively.
This example also leverages the counts_only
argument, which modifies the task graph to only return the counts of cells and genes that satisfy the query conditions. This can be especially useful for preliminary exploratory analysis and saves time and resources by avoiding the transfer of large amounts of data.
graph = build_collection_mapper_workflow_graph(
soma_collection_uri=SOMA_COLLECTION_URI,
experiment_names=["TS_Kidney", "TS_Liver"],
measurement_name="RNA",
X_layer_name="data",
obs_query_string="cell_ontology_class == 'macrophage'",
var_query_string="highly_variable == True",
counts_only=True,
)
graph.visualize()
[2024-07-18 19:40:16,859] [mapper] [build_collection_mapper_workflow_graph] [INFO] Retrieving SOMA Experiment URIs from SOMACollection tiledb://tiledb-inc/soma-exps-tabula-sapiens-by-tissue
[2024-07-18 19:40:17,048] [mapper] [build_collection_mapper_workflow_graph] [INFO] Filtering SOMA Experiment URIs for specified names
[2024-07-18 19:40:17,049] [mapper] [build_collection_mapper_workflow_graph] [INFO] Retrieved 2 SOMA Experiment URIs
[2024-07-18 19:40:17,050] [mapper] [build_collection_mapper_workflow_graph] [INFO] Constructing task graph
Now retrieve the results as before.
This shows there are 1,381 macrophages in the liver and 321 in the kidney. The number of highly variable genes is 2,435 in both experiments.
Re-running this task graph with counts_only=False
would return AnnData
objects for each experiment containing only the cells and genes that satisfy the query conditions.
By default, the AnnData
objects returned by the SOMA experiment collection mapper will contain all attributes present in the obs
and var
arrays. However, you can specify a subset of columns to include in the output using the obs_attrs
and var_attrs
arguments.
graph = build_collection_mapper_workflow_graph(
soma_collection_uri=SOMA_COLLECTION_URI,
experiment_names=["TS_Kidney", "TS_Liver"],
measurement_name="RNA",
X_layer_name="data",
obs_query_string="cell_ontology_class == 'macrophage'",
var_query_string="highly_variable == True",
obs_attrs=["cell_id", "cell_ontology_class"],
var_attrs=["gene_symbol", "means", "highly_variable"],
)
graph.visualize()
[2024-07-18 19:40:24,399] [mapper] [build_collection_mapper_workflow_graph] [INFO] Retrieving SOMA Experiment URIs from SOMACollection tiledb://tiledb-inc/soma-exps-tabula-sapiens-by-tissue
[2024-07-18 19:40:24,588] [mapper] [build_collection_mapper_workflow_graph] [INFO] Filtering SOMA Experiment URIs for specified names
[2024-07-18 19:40:24,590] [mapper] [build_collection_mapper_workflow_graph] [INFO] Retrieved 2 SOMA Experiment URIs
[2024-07-18 19:40:24,593] [mapper] [build_collection_mapper_workflow_graph] [INFO] Constructing task graph
Verify the resulting AnnData
objects contain only the specified obs
/var
columns.
{'TS_Kidney': AnnData object with n_obs × n_vars = 321 × 2435
obs: 'cell_id', 'cell_ontology_class'
var: 'gene_symbol', 'means', 'highly_variable',
'TS_Liver': AnnData object with n_obs × n_vars = 1381 × 2435
obs: 'cell_id', 'cell_ontology_class'
var: 'gene_symbol', 'means', 'highly_variable'}
The callback
argument in the build_collection_mapper_workflow_graph()
function allows you to apply custom functions to the AnnData
objects as part of the workflow. This feature provides flexibility to incorporate additional analysis steps, such as dimensionality reduction, clustering, or differential expression analysis, directly within the task graph.
In this example, you will use the callback
argument to perform a PCA on the selected cell types and visualize the results.
graph = build_collection_mapper_workflow_graph(
soma_collection_uri=SOMA_COLLECTION_URI,
experiment_names=["TS_Kidney", "TS_Liver"],
measurement_name="RNA",
X_layer_name="data",
obs_query_string="cell_ontology_class in ['macrophage', 'nk cell']",
var_query_string="highly_variable == True",
obs_attrs=["cell_id", "cell_ontology_class"],
var_attrs=["gene_symbol", "means", "highly_variable"],
callback=sc.pp.pca,
args_dict={"copy": True},
)
graph.visualize()
[2024-07-18 19:40:42,184] [mapper] [build_collection_mapper_workflow_graph] [INFO] Retrieving SOMA Experiment URIs from SOMACollection tiledb://tiledb-inc/soma-exps-tabula-sapiens-by-tissue
[2024-07-18 19:40:42,408] [mapper] [build_collection_mapper_workflow_graph] [INFO] Filtering SOMA Experiment URIs for specified names
[2024-07-18 19:40:42,410] [mapper] [build_collection_mapper_workflow_graph] [INFO] Retrieved 2 SOMA Experiment URIs
[2024-07-18 19:40:42,410] [mapper] [build_collection_mapper_workflow_graph] [INFO] Constructing task graph
Retrieve the results and note the presence of the new PCA items in the obsm
, varm
, and uns
attributes of the AnnData
objects.
{'TS_Kidney': AnnData object with n_obs × n_vars = 452 × 2435
obs: 'cell_id', 'cell_ontology_class'
var: 'gene_symbol', 'means', 'highly_variable'
uns: 'pca'
obsm: 'X_pca'
varm: 'PCs',
'TS_Liver': AnnData object with n_obs × n_vars = 1626 × 2435
obs: 'cell_id', 'cell_ontology_class'
var: 'gene_symbol', 'means', 'highly_variable'
uns: 'pca'
obsm: 'X_pca'
varm: 'PCs'}
Now you can visualize the PCA results for each experiment.
The SOMA Experiment Collection Mapper UDF is a versatile and powerful tool that enables efficient and scalable data processing across multiple SOMA experiments.