SQL Queries

life sciences

single cell (soma)

tutorials

python

sql

Learn how to use SQL to query TileDB-SOMA experiments.

Note

TileDB’s SQL query support is now provided by TileDB Tables. The TileDB storage engine for MariaDB, imported as tiledb.sql, and also known as MyTile or TileDB-MariaDB, is now deprecated.

How to run this tutorial

You can run this tutorial only on TileDB Cloud. However, TileDB Cloud has a free tier. We strongly recommend that you sign up and run everything there, as that requires no installations or deployment.

TileDB Cloud’s serverless SQL capability supports running SQL queries directly on TileDB arrays. This tutorial shows how to leverage this feature to interact with SOMA experiments. You will learn how to perform queries that filter, aggregate, and transform the data in a way that should be familiar to any SQL afficionados.

Setup

Start by importing the necessary packages.

Python

import os

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import tiledb
import tiledb.cloud
import tiledbsoma
import tiledbsoma.io

tiledbsoma.show_package_versions()

tiledbsoma.__version__              1.11.4
TileDB-Py version                   0.29.0
TileDB core version (tiledb)        2.23.1
TileDB core version (libtiledbsoma) 2.23.1
python version                      3.9.19.final.0
OS version                          Linux 6.8.0-1013-aws

Dataset

This tutorial uses a dataset from the Tabula Sapiens consortium, which includes nearly 265,000 immune cells across various tissue types. The original H5AD file was downloaded from Figshare and converted into a SOMA experiment using the TileDB-SOMA API. The resulting SOMA experiment is hosted on TileDB Cloud (tabula-sapiens-immune).

This SOMA experiment is accessible programmatically using the following URI:

Python

SOMA_URI = "tiledb://TileDB-Inc/soma-exp-tabula-sapiens-immune"

Basic SQL query

Querying any of the underlying arrays within a SOMA experiment requires providing a URI, which you can retrieve with the tiledbsoma package. Start by opening the SOMA experiment:

Python

experiment = tiledbsoma.Experiment.open(SOMA_URI)
experiment

<Experiment 'tiledb://TileDB-Inc/soma-exp-tabula-sapiens-immune' (open for 'r') (2 items)
    'ms': 'tiledb://TileDB-Inc/f2f4d2dc-a557-490a-b923-25ca47404d67' (unopened)
    'obs': 'tiledb://TileDB-Inc/4b53d109-27aa-4e17-b02a-d61710719b7b' (unopened)>

Then retrieve the URI for the var array within the "RNA" measurement:

Python

var_uri = experiment.ms["RNA"].var.uri
var_uri

'tiledb://TileDB-Inc/f9763cfe-f4dd-4a23-a798-4bbee515cadb'

The tiledb.sql.execute() function enables executing SQL queries on the array without managing any underlying infrastructure. The function always returns a pandas DataFrame containing the query results.

Python

query = f"SELECT * FROM `{var_uri}`"

df_var = tiledb.cloud.sql.exec(query)
df_var

	soma_joinid	feature_name	std	mean	feature_reference	feature_biotype	highly_variable	dispersions_norm	ensemblid	dispersions	feature_type	feature_is_filtered	means	gene_id
0	0	DDX11L1	0.005574	0.000039	NCBITaxon:9606	gene	0	-0.573947	ENSG00000223972.5	0.835044	Gene Expression	0	6.398244e-05	ENSG00000223972
1	1	WASH7P	0.031731	0.001080	NCBITaxon:9606	gene	0	0.533203	ENSG00000227232.5	2.442280	Gene Expression	0	2.274395e-03	ENSG00000227232
2	2	MIR6859-1	0.005634	0.000033	NCBITaxon:9606	gene	0	-0.256874	ENSG00000278267.1	1.295335	Gene Expression	0	6.175251e-05	ENSG00000278267
3	3	MIR1302-2HG	0.008041	0.000048	NCBITaxon:9606	gene	0	0.680668	ENSG00000243485.5	2.656352	Gene Expression	0	1.372886e-04	ENSG00000243485
4	4	MIR1302-2	1.000000	0.000000	NCBITaxon:9606	gene	0	0.000000	ENSG00000284332.1	0.000000	Gene Expression	0	1.000000e-12	ENSG00000284332
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
58554	58554	MT-ND6	0.741395	0.590065	NCBITaxon:9606	gene	0	0.154140	ENSG00000198695.2	2.466404	Gene Expression	0	9.634841e-01	ENSG00000198695
58555	58555	MT-TE	0.301820	0.083929	NCBITaxon:9606	gene	0	-0.044396	ENSG00000210194.1	1.603787	Gene Expression	0	1.600667e-01	ENSG00000210194
58556	58556	MT-CYB	1.104192	3.874830	NCBITaxon:9606	gene	0	-0.499747	ENSG00000198727.2	4.765751	Gene Expression	0	4.367693e+00	ENSG00000198727
58557	58557	MT-TT	0.186848	0.040580	NCBITaxon:9606	gene	0	-0.719108	ENSG00000210195.2	0.624316	Gene Expression	0	6.573967e-02	ENSG00000210195
58558	58558	MT-TP	0.501265	0.204806	NCBITaxon:9606	gene	1	1.115482	ENSG00000210196.2	2.705558	Gene Expression	0	4.385102e-01	ENSG00000210196

58559 rows × 14 columns

Filter

Filter the data using the WHERE clause in the SQL query. For instance, the following snippet filters the obs array to retrieve cell type annotations for all cells collected from Asian males.

Python

query = f"""
SELECT soma_joinid, cell_type
FROM `{experiment.obs.uri}`
WHERE sex = 'male' AND ethnicity = 'Asian'
"""

df_obs = tiledb.cloud.sql.exec(query)
df_obs

	soma_joinid	cell_type
0	63240	monocyte
1	63241	monocyte
2	63242	plasma cell
3	63243	monocyte
4	63244	hematopoietic stem cell
...	...	...
2483	70996	hematopoietic stem cell
2484	70997	hematopoietic stem cell
2485	70998	plasma cell
2486	70999	hematopoietic stem cell
2487	71000	plasma cell

2488 rows × 2 columns

Full ANSI SQL is supported, so you can use complex queries to filter the data as needed.

Aggregate

Perform data aggregations using the GROUP BY clause in the SQL query.

This example calculates the mean and standard deviation of gene expression values across the subset of cells defined by the previous obs query. A HAVING clause filters out genes with fewer than 10 contributing cells to improve the robustness of the summaries.

Python

query = f"""
SELECT
    soma_dim_1,
    AVG(`soma_data`) AS avg,
    STD(`soma_data`) AS sd,
    COUNT(`soma_data`) AS n_cells
FROM
    `{experiment.ms["RNA"].X["data"].uri}`
WHERE
    soma_dim_0 IN ({",".join(["?"] * len(df_obs))})
GROUP BY
    soma_dim_1
HAVING
    COUNT(`soma_dim_1`) >= 100
"""

result = tiledb.cloud.sql.exec(query, parameters=df_obs.soma_joinid.to_list())

# Join with var to get feature names
result.set_index("soma_dim_1").join(df_var.set_index("soma_joinid")[["feature_name"]])

	avg	sd	n_cells	feature_name
soma_dim_1
1	2.961967	3.619777	130	WASH7P
19	4.578913	4.211768	155	WASH9P
32	0.144629	0.282825	391	MTND1P23
33	1.099287	1.039835	704	MTND2P28
34	0.558262	0.631345	725	MTCO1P12
...	...	...	...	...
58553	2.525545	0.886761	2274	MT-ND5
58554	2.138958	0.944408	1960	MT-ND6
58555	1.095660	1.450418	403	MT-TE
58556	3.147116	0.919982	2261	MT-CYB
58558	1.076467	1.181263	446	MT-TP

11448 rows × 4 columns

Distributed SQL queries

So far, the serverless SQL examples have all been executed on a single node. However, TileDB Cloud also supports distributed SQL queries, which scale operations by executing them across multiple nodes in parallel, speeding up processing for large datasets.

Construct a task graph composed of multiple nodes (i.e., SQL queries) that can be executed in parallel.

Tip

For more information on constructing and executing serverless pipelines, visit the Task Graphs section.

To see how this works, extend the previous aggregation example to calculate the gene-wise summary statistics within each cell type. Use a task graph to distribute the computations across each of the n cell types. Start by splitting the cell soma_joinid values into groups based on the cell_type column.

Python

obs_ids_by_cell_type = (
    df_obs.groupby(by=["cell_type"])["soma_joinid"].apply(list).to_dict()
)

obs_ids_by_cell_type

Next, create a new task graph. This initializes an empty structure that you can populate with SQL queries.

Python

graph = tiledb.cloud.dag.DAG(name="SOMA Distributed SQL")

Construct a SQL query for each cell type and add it to the task graph. The task graph visualization depicts one node for each task in the graph.

Python

for cell_type, obs_ids in obs_ids_by_cell_type.items():
    query = f"""
    SELECT
        soma_dim_1,
        AVG(`soma_data`) AS avg,
        STD(`soma_data`) AS sd,
        COUNT(`soma_data`) AS n_cells
    FROM
        `{experiment.ms["RNA"].X["data"].uri}`
    WHERE
        soma_dim_0 IN ({",".join(["?"] * len(obs_ids))})
    GROUP BY
        soma_dim_1
    HAVING
        COUNT(`soma_dim_1`) >= 20
    """
    graph.submit_sql(query, name=cell_type, parameters=obs_ids)

# Visualize the task graph
graph.visualize()

Execute the SQL queries across all nodes by calling the task graph’s compute() method.

Python

graph.compute()
graph.wait()

Hover over a node to reveal its name and status. Monitor the task graph’s progress on the TileDB Cloud console at https://cloud.tiledb.com/monitor/logs/taskgraphs.

Once all nodes complete, retrieve the results. The output of the task graph is a Python dictionary where the keys are the cell type names and the values are the corresponding DataFrame objects.

Python

results = graph.end_results_by_name()
results

{'CD4-positive, alpha-beta T cell':        soma_dim_1       avg        sd  n_cells
 0              32  0.133643  0.284189       26
 1              33  1.382537  1.058097      100
 2              34  0.528415  0.707839      106
 3              36  1.687168  2.084468       86
 4              38  1.136198  0.426818      400
 ...           ...       ...       ...      ...
 10314       58549  2.062484  0.669752      589
 10315       58553  2.610540  0.675011      600
 10316       58554  2.095343  0.776244      523
 10317       58556  3.343383  0.557026      595
 10318       58558  0.839910  0.799526       31
 
 [10319 rows x 4 columns],
 'CD8-positive, alpha-beta T cell':       soma_dim_1       avg        sd  n_cells
 0             19  8.239135  3.302543       20
 1             33  1.243538  1.036316      111
 2             34  0.577822  0.472923      128
 3             36  1.802457  1.796604      103
 4             38  1.213721  0.416080      290
 ...          ...       ...       ...      ...
 8226       58553  2.813216  0.760483      395
 8227       58554  2.114883  0.866776      345
 8228       58555  0.443804  0.420544       24
 8229       58556  3.560591  0.641512      397
 8230       58558  0.973196  0.876201       44
 
 [8231 rows x 4 columns],
 'common myeloid progenitor': Empty DataFrame
 Columns: [soma_dim_1, avg, sd, n_cells]
 Index: [],
 'erythroid progenitor cell':       soma_dim_1       avg        sd  n_cells
 0             19  3.472178  3.574505       25
 1             32  0.081280  0.098184       27
 2             33  1.122225  0.785690       48
 3             34  0.774514  0.422862       36
 4             36  3.911781  3.052800       32
 ...          ...       ...       ...      ...
 9807       58553  2.806982  0.573276      111
 9808       58554  2.607191  0.874146      109
 9809       58555  0.523289  0.711695       23
 9810       58556  3.572571  0.700155      112
 9811       58558  0.953760  0.791196       26
 
 [9812 rows x 4 columns],
 'granulocyte':       soma_dim_1       avg        sd  n_cells
 0             33  1.119820  1.309047       28
 1             38  1.031394  0.644518       77
 2             60  1.998267  1.130845       22
 3             85  1.671773  0.765915       37
 4             89  2.430884  1.169923       23
 ...          ...       ...       ...      ...
 4634       58553  1.894526  0.830283       97
 4635       58554  1.805720  1.026236       71
 4636       58555  1.904346  1.256736       31
 4637       58556  2.200663  0.965624       96
 4638       58558  2.111483  1.682203       26
 
 [4639 rows x 4 columns],
 'hematopoietic stem cell':       soma_dim_1       avg        sd  n_cells
 0              1  1.949936  2.389550       22
 1             32  0.104822  0.160233       50
 2             33  0.932473  0.716192       57
 3             34  0.834259  0.397988       56
 4             36  4.204934  2.590890       50
 ...          ...       ...       ...      ...
 7606       58553  2.346910  0.701245       97
 7607       58554  2.096076  0.832442       95
 7608       58555  0.911631  0.620238       49
 7609       58556  3.384588  0.797257       97
 7610       58558  1.028685  0.666576       48
 
 [7611 rows x 4 columns],
 'macrophage': Empty DataFrame
 Columns: [soma_dim_1, avg, sd, n_cells]
 Index: [],
 'mature NK T cell':       soma_dim_1       avg        sd  n_cells
 0             33  1.268136  1.134177       26
 1             34  0.734393  0.385386       35
 2             36  3.382323  2.975969       32
 3             38  1.374860  0.554642       78
 4             39  1.552466  1.382432       31
 ...          ...       ...       ...      ...
 4180       58548  2.424732  0.870553      112
 4181       58549  2.167723  0.731285      116
 4182       58553  2.728925  0.598465      116
 4183       58554  2.271593  0.729967       90
 4184       58556  3.297442  0.603839      119
 
 [4185 rows x 4 columns],
 'memory B cell':      soma_dim_1       avg        sd  n_cells
 0            38  1.243343  0.280968       23
 1           164  2.873089  0.833162       25
 2           229  2.478975  0.558780       37
 3           273  1.697664  0.393687       24
 4           297  1.474779  0.418201       20
 ..          ...       ...       ...      ...
 624       58548  2.735012  0.793998       39
 625       58549  2.383036  0.568788       38
 626       58553  2.775759  0.700229       39
 627       58554  2.230461  0.741468       34
 628       58556  3.563540  0.440269       39
 
 [629 rows x 4 columns],
 'monocyte':       soma_dim_1       avg        sd  n_cells
 0              1  6.696287  3.850051       22
 1             12  7.203373  2.584734       38
 2             32  0.464909  0.452464       45
 3             33  1.804697  1.838855       60
 4             34  1.090922  1.322580       65
 ...          ...       ...       ...      ...
 8502       58553  2.070877  1.122416      308
 8503       58554  2.102582  1.153940      222
 8504       58555  2.385278  2.290614       65
 8505       58556  2.562940  1.138733      311
 8506       58558  1.614884  1.600510       62
 
 [8507 rows x 4 columns],
 'naive B cell':      soma_dim_1       avg        sd  n_cells
 0            38  1.435278  0.590953       31
 1           229  2.280316  0.623179       32
 2           297  1.373284  0.374756       21
 3           623  1.985670  0.718431       23
 4           702  2.079123  0.816144       24
 ..          ...       ...       ...      ...
 587       58548  2.754192  0.916477       39
 588       58549  2.300668  0.744150       41
 589       58553  2.838845  0.811267       41
 590       58554  2.465030  1.056854       37
 591       58556  3.694568  0.587324       41
 
 [592 rows x 4 columns],
 'neutrophil':       soma_dim_1       avg        sd  n_cells
 0             11  9.607113  1.344685       26
 1             12  9.559652  1.649547       30
 2             34  0.600170  0.439815       23
 3             38  1.117853  0.797916       34
 4            136  4.326175  3.049126       41
 ...          ...       ...       ...      ...
 1464       58553  1.264619  0.983895       82
 1465       58554  2.472665  1.695481       48
 1466       58555  2.616217  1.278954       33
 1467       58556  1.251300  0.806523       65
 1468       58558  2.790017  1.757130       23
 
 [1469 rows x 4 columns],
 'plasma cell':        soma_dim_1       avg        sd  n_cells
 0               1  0.453420  0.576711       32
 1              19  1.286371  2.287474       37
 2              32  0.024147  0.028048      165
 3              33  0.719459  0.549580      235
 4              34  0.255642  0.185376      233
 ...           ...       ...       ...      ...
 10468       58553  2.731793  0.848978      371
 10469       58554  2.055919  0.980311      368
 10470       58555  0.293131  0.305294      136
 10471       58556  3.074683  0.838316      372
 10472       58558  0.503710  0.460197      153
 
 [10473 rows x 4 columns],
 'plasmablast': Empty DataFrame
 Columns: [soma_dim_1, avg, sd, n_cells]
 Index: []}

Note

When aggregating data, empty DataFrames may appear for certain cell types if no cells meet the specified query criteria.

Concatenate the results into a single DataFrame and annotate the soma_dim_1 column with the corresponding gene names.

Python

result = (
    pd.concat([df.assign(cell_type=name) for name, df in results.items()])
    .set_index("soma_dim_1")
    .join(df_var.set_index("soma_joinid")[["feature_name"]])
)

result

	avg	sd	n_cells	cell_type	feature_name
soma_dim_1
32	0.133643	0.284189	26	CD4-positive, alpha-beta T cell	MTND1P23
33	1.382537	1.058097	100	CD4-positive, alpha-beta T cell	MTND2P28
34	0.528415	0.707839	106	CD4-positive, alpha-beta T cell	MTCO1P12
36	1.687168	2.084468	86	CD4-positive, alpha-beta T cell	MTCO2P12
38	1.136198	0.426818	400	CD4-positive, alpha-beta T cell	MTATP6P1
...	...	...	...	...	...
58553	2.731793	0.848978	371	plasma cell	MT-ND5
58554	2.055919	0.980311	368	plasma cell	MT-ND6
58555	0.293131	0.305294	136	plasma cell	MT-TE
58556	3.074683	0.838316	372	plasma cell	MT-CYB
58558	0.503710	0.460197	153	plasma cell	MT-TP

66467 rows × 5 columns

Visualize the expression profile for any given gene across different cell types to compare how the gene behaves in various cellular contexts.

Python

GENE_SYMBOL = "MYL6"

plt.figure(figsize=(5, 6))

# Create the bar plot
sns.barplot(
    data=result.query(f"feature_name == '{GENE_SYMBOL}'"),
    x="avg",
    y="cell_type",
    hue="avg",
    palette="viridis",
    legend=False,
)

plt.tight_layout()

# Show the plot
plt.show()

Summary

This tutorial covered how to run SQL queries on TileDB-SOMA experiments using TileDB Cloud and how to scale these operations efficiently with Task Graphs. For additional examples and information on Task Graphs, explore the SOMA Distributed Compute tutorial.