1. Structure
  2. Life Sciences
  3. Single-cell
  4. Tutorials
  5. SQL Queries
  • Home
  • What is TileDB?
  • Get Started
  • Explore Content
  • Accounts
    • Individual Accounts
      • Apply for the Free Tier
      • Profile
        • Overview
        • Cloud Credentials
        • Storage Paths
        • REST API Tokens
        • Credits
    • Organization Admins
      • Create an Organization
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
    • Organization Members
      • Organization Invitations
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
  • Catalog
    • Introduction
    • Data
      • Arrays
      • Tables
      • Single-Cell (SOMA)
      • Genomics (VCF)
      • Biomedical Imaging
      • Vector Search
      • Files
    • Code
      • Notebooks
      • Dashboards
      • User-Defined Functions
      • Task Graphs
      • ML Models
    • Groups
    • Marketplace
    • Search
  • Collaborate
    • Introduction
    • Organizations
    • Access Control
      • Introduction
      • Share Assets
      • Asset Permissions
      • Public Assets
    • Logging
    • Marketplace
  • Analyze
    • Introduction
    • Slice Data
    • Multi-Region Redirection
    • Notebooks
      • Launch a Notebook
      • Usage
      • Widgets
      • Notebook Image Dependencies
    • Dashboards
      • Dashboards
      • Streamlit
    • Preview
    • User-Defined Functions
    • Task Graphs
    • Serverless SQL
    • Monitor
      • Task Log
      • Task Graph Log
  • Scale
    • Introduction
    • Task Graphs
    • API Usage
  • Structure
    • Why Structure Is Important
    • Arrays
      • Introduction
      • Quickstart
      • Foundation
        • Array Data Model
        • Key Concepts
          • Storage
            • Arrays
            • Dimensions
            • Attributes
            • Cells
            • Domain
            • Tiles
            • Data Layout
            • Compression
            • Encryption
            • Tile Filters
            • Array Schema
            • Schema Evolution
            • Fragments
            • Fragment Metadata
            • Commits
            • Indexing
            • Array Metadata
            • Datetimes
            • Groups
            • Object Stores
          • Compute
            • Writes
            • Deletions
            • Consolidation
            • Vacuuming
            • Time Traveling
            • Reads
            • Query Conditions
            • Aggregates
            • User-Defined Functions
            • Distributed Compute
            • Concurrency
            • Parallelism
        • Storage Format Spec
      • Tutorials
        • Basics
          • Basic Dense Array
          • Basic Sparse Array
          • Array Metadata
          • Compression
          • Encryption
          • Data Layout
          • Tile Filters
          • Datetimes
          • Multiple Attributes
          • Variable-Length Attributes
          • String Dimensions
          • Nullable Attributes
          • Multi-Range Reads
          • Query Conditions
          • Aggregates
          • Deletions
          • Catching Errors
          • Configuration
          • Basic S3 Example
          • Basic TileDB Cloud
          • fromDataFrame
          • Palmer Penguins
        • Advanced
          • Schema Evolution
          • Advanced Writes
            • Write at a Timestamp
            • Get Fragment Info
            • Consolidation
              • Fragments
              • Fragment List
              • Consolidation Plan
              • Commits
              • Fragment Metadata
              • Array Metadata
            • Vacuuming
              • Fragments
              • Commits
              • Fragment Metadata
              • Array Metadata
          • Advanced Reads
            • Get Fragment Info
            • Time Traveling
              • Introduction
              • Fragments
              • Array Metadata
              • Schema Evolution
          • Array Upgrade
          • Backends
            • Amazon S3
            • Azure Blob Storage
            • Google Cloud Storage
            • MinIO
            • Lustre
          • Virtual Filesystem
          • User-Defined Functions
          • Distributed Compute
          • Result Estimation
          • Incomplete Queries
        • Management
          • Array Schema
          • Groups
          • Object Management
        • Performance
          • Summary of Factors
          • Dense vs. Sparse
          • Dimensions vs. Attributes
          • Compression
          • Tiling and Data Layout
          • Tuning Writes
          • Tuning Reads
      • API Reference
    • Tables
      • Introduction
      • Quickstart
      • Foundation
        • Data Model
        • Key Concepts
          • Indexes
          • Columnar Storage
          • Compression
          • Data Manipulation
          • Optimize Tables
          • ACID
          • Serverless SQL
          • SQL Connectors
          • Dataframes
          • CSV Ingestion
      • Tutorials
        • Basics
          • Ingestion with SQL
          • CSV Ingestion
          • Basic S3 Example
          • Running Locally
        • Advanced
          • Scalable Ingestion
          • Scalable Queries
      • API Reference
    • AI & ML
      • Vector Search
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Vector Search
            • Vector Databases
            • Algorithms
            • Distance Metrics
            • Updates
            • Deployment Methods
            • Architecture
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Ingestion & Querying
            • Updates
            • Deletions
            • Basic S3 Example
            • Running Locally
          • Advanced
            • Versioning
            • Time Traveling
            • Consolidation
            • Distributed Compute
            • RAG LLM
            • LLM Memory
            • File Search
            • Image Search
            • Protein Search
          • Performance
        • API Reference
      • ML Models
        • Introduction
        • Quickstart
        • Foundation
          • Basics
          • Storage
          • Cloud Execution
          • Why TileDB for Machine Learning
        • Tutorials
          • Ingestion
            • Data Ingestion
              • Dense Datasets
              • Sparse Datasets
            • ML Model Ingestion
          • Management
            • Array Schema
            • Machine Learning: Groups
            • Time Traveling
    • Life Sciences
      • Single-cell
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Data Structures
            • Use of Apache Arrow
            • Join IDs
            • State Management
            • TileDB Cloud URIs
          • SOMA API Specification
        • Tutorials
          • Data Ingestion
          • Bulk Ingestion Tutorial
          • Data Access
          • Distributed Compute
          • Basic S3 Example
          • Multi-Experiment Queries
          • Appending Data to a SOMA Experiment
          • Add New Measurements
          • SQL Queries
          • Running Locally
          • Shapes in TileDB-SOMA
          • Drug Discovery App
        • Spatial
          • Introduction
          • Foundation
            • Spatial Data Model
            • Data Structures
          • Tutorials
            • Spatial Data Ingestion
            • Access Spatial Data
            • Manage Coordinate Spaces
        • API Reference
      • Population Genomics
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • The N+1 Problem
            • Architecture
            • Arrays
            • Ingestion
            • Reads
            • Variant Statistics
            • Annotations
            • User-Defined Functions
            • Tables and SQL
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Basic Ingestion
            • Basic Queries
            • Export to VCF
            • Add New Samples
            • Deleting Samples
            • Basic S3 Example
            • Basic TileDB Cloud
          • Advanced
            • Scalable Ingestion
            • Scalable Queries
            • Query Transforms
            • Handling Large Queries
            • Annotations
              • Finding Annotations
              • Embedded Annotations
              • External Annotations
              • Annotation VCFs
              • Ingesting Annotations
            • Variant Statistics
            • Tables and SQL
            • User-Defined Functions
            • Sample Metadata
            • Split VCF
          • Performance
        • API Reference
          • Command Line Interface
          • Python API
          • Cloud API
      • Biomedical Imaging
        • Introduction
        • Foundation
          • Data Model
          • Key Concepts
            • Arrays
            • Ingestion
            • Reads
            • User Defined Functions
          • Storage Format Spec
        • Quickstart
        • Tutorials
          • Basics
            • Ingestion
            • Read
              • OpenSlide
              • TileDB-Py
          • Advanced
            • Batched Ingestion
            • Chunked Ingestion
            • Machine Learning
              • PyTorch
            • Napari
    • Files
  • API Reference
  • Self-Hosting
    • Installation
    • Upgrades
    • Administrative Tasks
    • Image Customization
      • Customize User-Defined Function Images
      • AWS ECR Container Registry
      • Customize Jupyter Notebook Images
    • Single Sign-On
      • Configure Single Sign-On
      • OpenID Connect
      • Okta SCIM
      • Microsoft Entra
  • Glossary

On this page

  • Setup
  • Dataset
  • Basic SQL query
  • Filter
  • Aggregate
  • Distributed SQL queries
  • Summary
  1. Structure
  2. Life Sciences
  3. Single-cell
  4. Tutorials
  5. SQL Queries

SQL Queries

life sciences
single cell (soma)
tutorials
python
sql
Learn how to use SQL to query TileDB-SOMA experiments.
Note

TileDB’s SQL query support is now provided by TileDB Tables. The TileDB storage engine for MariaDB, imported as tiledb.sql, and also known as MyTile or TileDB-MariaDB, is now deprecated.

How to run this tutorial

You can run this tutorial only on TileDB Cloud. However, TileDB Cloud has a free tier. We strongly recommend that you sign up and run everything there, as that requires no installations or deployment.

TileDB Cloud’s serverless SQL capability supports running SQL queries directly on TileDB arrays. This tutorial shows how to leverage this feature to interact with SOMA experiments. You will learn how to perform queries that filter, aggregate, and transform the data in a way that should be familiar to any SQL afficionados.

Setup

Start by importing the necessary packages.

  • Python
import os

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import tiledb
import tiledb.cloud
import tiledbsoma
import tiledbsoma.io

tiledbsoma.show_package_versions()
tiledbsoma.__version__              1.11.4
TileDB-Py version                   0.29.0
TileDB core version (tiledb)        2.23.1
TileDB core version (libtiledbsoma) 2.23.1
python version                      3.9.19.final.0
OS version                          Linux 6.8.0-1013-aws

Dataset

This tutorial uses a dataset from the Tabula Sapiens consortium, which includes nearly 265,000 immune cells across various tissue types. The original H5AD file was downloaded from Figshare and converted into a SOMA experiment using the TileDB-SOMA API. The resulting SOMA experiment is hosted on TileDB Cloud (tabula-sapiens-immune).

This SOMA experiment is accessible programmatically using the following URI:

  • Python
SOMA_URI = "tiledb://TileDB-Inc/soma-exp-tabula-sapiens-immune"

Basic SQL query

Querying any of the underlying arrays within a SOMA experiment requires providing a URI, which you can retrieve with the tiledbsoma package. Start by opening the SOMA experiment:

  • Python
experiment = tiledbsoma.Experiment.open(SOMA_URI)
experiment
<Experiment 'tiledb://TileDB-Inc/soma-exp-tabula-sapiens-immune' (open for 'r') (2 items)
    'ms': 'tiledb://TileDB-Inc/f2f4d2dc-a557-490a-b923-25ca47404d67' (unopened)
    'obs': 'tiledb://TileDB-Inc/4b53d109-27aa-4e17-b02a-d61710719b7b' (unopened)>

Then retrieve the URI for the var array within the "RNA" measurement:

  • Python
var_uri = experiment.ms["RNA"].var.uri
var_uri
'tiledb://TileDB-Inc/f9763cfe-f4dd-4a23-a798-4bbee515cadb'

The tiledb.sql.execute() function enables executing SQL queries on the array without managing any underlying infrastructure. The function always returns a pandas DataFrame containing the query results.

  • Python
query = f"SELECT * FROM `{var_uri}`"

df_var = tiledb.cloud.sql.exec(query)
df_var
soma_joinid feature_name std mean feature_reference feature_biotype highly_variable dispersions_norm ensemblid dispersions feature_type feature_is_filtered means gene_id
0 0 DDX11L1 0.005574 0.000039 NCBITaxon:9606 gene 0 -0.573947 ENSG00000223972.5 0.835044 Gene Expression 0 6.398244e-05 ENSG00000223972
1 1 WASH7P 0.031731 0.001080 NCBITaxon:9606 gene 0 0.533203 ENSG00000227232.5 2.442280 Gene Expression 0 2.274395e-03 ENSG00000227232
2 2 MIR6859-1 0.005634 0.000033 NCBITaxon:9606 gene 0 -0.256874 ENSG00000278267.1 1.295335 Gene Expression 0 6.175251e-05 ENSG00000278267
3 3 MIR1302-2HG 0.008041 0.000048 NCBITaxon:9606 gene 0 0.680668 ENSG00000243485.5 2.656352 Gene Expression 0 1.372886e-04 ENSG00000243485
4 4 MIR1302-2 1.000000 0.000000 NCBITaxon:9606 gene 0 0.000000 ENSG00000284332.1 0.000000 Gene Expression 0 1.000000e-12 ENSG00000284332
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
58554 58554 MT-ND6 0.741395 0.590065 NCBITaxon:9606 gene 0 0.154140 ENSG00000198695.2 2.466404 Gene Expression 0 9.634841e-01 ENSG00000198695
58555 58555 MT-TE 0.301820 0.083929 NCBITaxon:9606 gene 0 -0.044396 ENSG00000210194.1 1.603787 Gene Expression 0 1.600667e-01 ENSG00000210194
58556 58556 MT-CYB 1.104192 3.874830 NCBITaxon:9606 gene 0 -0.499747 ENSG00000198727.2 4.765751 Gene Expression 0 4.367693e+00 ENSG00000198727
58557 58557 MT-TT 0.186848 0.040580 NCBITaxon:9606 gene 0 -0.719108 ENSG00000210195.2 0.624316 Gene Expression 0 6.573967e-02 ENSG00000210195
58558 58558 MT-TP 0.501265 0.204806 NCBITaxon:9606 gene 1 1.115482 ENSG00000210196.2 2.705558 Gene Expression 0 4.385102e-01 ENSG00000210196

58559 rows × 14 columns

Filter

Filter the data using the WHERE clause in the SQL query. For instance, the following snippet filters the obs array to retrieve cell type annotations for all cells collected from Asian males.

  • Python
query = f"""
SELECT soma_joinid, cell_type
FROM `{experiment.obs.uri}`
WHERE sex = 'male' AND ethnicity = 'Asian'
"""

df_obs = tiledb.cloud.sql.exec(query)
df_obs
soma_joinid cell_type
0 63240 monocyte
1 63241 monocyte
2 63242 plasma cell
3 63243 monocyte
4 63244 hematopoietic stem cell
... ... ...
2483 70996 hematopoietic stem cell
2484 70997 hematopoietic stem cell
2485 70998 plasma cell
2486 70999 hematopoietic stem cell
2487 71000 plasma cell

2488 rows × 2 columns

Full ANSI SQL is supported, so you can use complex queries to filter the data as needed.

Aggregate

Perform data aggregations using the GROUP BY clause in the SQL query.

This example calculates the mean and standard deviation of gene expression values across the subset of cells defined by the previous obs query. A HAVING clause filters out genes with fewer than 10 contributing cells to improve the robustness of the summaries.

  • Python
query = f"""
SELECT
    soma_dim_1,
    AVG(`soma_data`) AS avg,
    STD(`soma_data`) AS sd,
    COUNT(`soma_data`) AS n_cells
FROM
    `{experiment.ms["RNA"].X["data"].uri}`
WHERE
    soma_dim_0 IN ({",".join(["?"] * len(df_obs))})
GROUP BY
    soma_dim_1
HAVING
    COUNT(`soma_dim_1`) >= 100
"""

result = tiledb.cloud.sql.exec(query, parameters=df_obs.soma_joinid.to_list())

# Join with var to get feature names
result.set_index("soma_dim_1").join(df_var.set_index("soma_joinid")[["feature_name"]])
avg sd n_cells feature_name
soma_dim_1
1 2.961967 3.619777 130 WASH7P
19 4.578913 4.211768 155 WASH9P
32 0.144629 0.282825 391 MTND1P23
33 1.099287 1.039835 704 MTND2P28
34 0.558262 0.631345 725 MTCO1P12
... ... ... ... ...
58553 2.525545 0.886761 2274 MT-ND5
58554 2.138958 0.944408 1960 MT-ND6
58555 1.095660 1.450418 403 MT-TE
58556 3.147116 0.919982 2261 MT-CYB
58558 1.076467 1.181263 446 MT-TP

11448 rows × 4 columns

Distributed SQL queries

So far, the serverless SQL examples have all been executed on a single node. However, TileDB Cloud also supports distributed SQL queries, which scale operations by executing them across multiple nodes in parallel, speeding up processing for large datasets.

Construct a task graph composed of multiple nodes (i.e., SQL queries) that can be executed in parallel.

Tip

For more information on constructing and executing serverless pipelines, visit the Task Graphs section.

To see how this works, extend the previous aggregation example to calculate the gene-wise summary statistics within each cell type. Use a task graph to distribute the computations across each of the n cell types. Start by splitting the cell soma_joinid values into groups based on the cell_type column.

  • Python
obs_ids_by_cell_type = (
    df_obs.groupby(by=["cell_type"])["soma_joinid"].apply(list).to_dict()
)

obs_ids_by_cell_type

Next, create a new task graph. This initializes an empty structure that you can populate with SQL queries.

  • Python
graph = tiledb.cloud.dag.DAG(name="SOMA Distributed SQL")

Construct a SQL query for each cell type and add it to the task graph. The task graph visualization depicts one node for each task in the graph.

  • Python
for cell_type, obs_ids in obs_ids_by_cell_type.items():
    query = f"""
    SELECT
        soma_dim_1,
        AVG(`soma_data`) AS avg,
        STD(`soma_data`) AS sd,
        COUNT(`soma_data`) AS n_cells
    FROM
        `{experiment.ms["RNA"].X["data"].uri}`
    WHERE
        soma_dim_0 IN ({",".join(["?"] * len(obs_ids))})
    GROUP BY
        soma_dim_1
    HAVING
        COUNT(`soma_dim_1`) >= 20
    """
    graph.submit_sql(query, name=cell_type, parameters=obs_ids)

# Visualize the task graph
graph.visualize()

Execute the SQL queries across all nodes by calling the task graph’s compute() method.

  • Python
graph.compute()
graph.wait()

Hover over a node to reveal its name and status. Monitor the task graph’s progress on the TileDB Cloud console at https://cloud.tiledb.com/monitor/logs/taskgraphs.

Once all nodes complete, retrieve the results. The output of the task graph is a Python dictionary where the keys are the cell type names and the values are the corresponding DataFrame objects.

  • Python
results = graph.end_results_by_name()
results
{'CD4-positive, alpha-beta T cell':        soma_dim_1       avg        sd  n_cells
 0              32  0.133643  0.284189       26
 1              33  1.382537  1.058097      100
 2              34  0.528415  0.707839      106
 3              36  1.687168  2.084468       86
 4              38  1.136198  0.426818      400
 ...           ...       ...       ...      ...
 10314       58549  2.062484  0.669752      589
 10315       58553  2.610540  0.675011      600
 10316       58554  2.095343  0.776244      523
 10317       58556  3.343383  0.557026      595
 10318       58558  0.839910  0.799526       31
 
 [10319 rows x 4 columns],
 'CD8-positive, alpha-beta T cell':       soma_dim_1       avg        sd  n_cells
 0             19  8.239135  3.302543       20
 1             33  1.243538  1.036316      111
 2             34  0.577822  0.472923      128
 3             36  1.802457  1.796604      103
 4             38  1.213721  0.416080      290
 ...          ...       ...       ...      ...
 8226       58553  2.813216  0.760483      395
 8227       58554  2.114883  0.866776      345
 8228       58555  0.443804  0.420544       24
 8229       58556  3.560591  0.641512      397
 8230       58558  0.973196  0.876201       44
 
 [8231 rows x 4 columns],
 'common myeloid progenitor': Empty DataFrame
 Columns: [soma_dim_1, avg, sd, n_cells]
 Index: [],
 'erythroid progenitor cell':       soma_dim_1       avg        sd  n_cells
 0             19  3.472178  3.574505       25
 1             32  0.081280  0.098184       27
 2             33  1.122225  0.785690       48
 3             34  0.774514  0.422862       36
 4             36  3.911781  3.052800       32
 ...          ...       ...       ...      ...
 9807       58553  2.806982  0.573276      111
 9808       58554  2.607191  0.874146      109
 9809       58555  0.523289  0.711695       23
 9810       58556  3.572571  0.700155      112
 9811       58558  0.953760  0.791196       26
 
 [9812 rows x 4 columns],
 'granulocyte':       soma_dim_1       avg        sd  n_cells
 0             33  1.119820  1.309047       28
 1             38  1.031394  0.644518       77
 2             60  1.998267  1.130845       22
 3             85  1.671773  0.765915       37
 4             89  2.430884  1.169923       23
 ...          ...       ...       ...      ...
 4634       58553  1.894526  0.830283       97
 4635       58554  1.805720  1.026236       71
 4636       58555  1.904346  1.256736       31
 4637       58556  2.200663  0.965624       96
 4638       58558  2.111483  1.682203       26
 
 [4639 rows x 4 columns],
 'hematopoietic stem cell':       soma_dim_1       avg        sd  n_cells
 0              1  1.949936  2.389550       22
 1             32  0.104822  0.160233       50
 2             33  0.932473  0.716192       57
 3             34  0.834259  0.397988       56
 4             36  4.204934  2.590890       50
 ...          ...       ...       ...      ...
 7606       58553  2.346910  0.701245       97
 7607       58554  2.096076  0.832442       95
 7608       58555  0.911631  0.620238       49
 7609       58556  3.384588  0.797257       97
 7610       58558  1.028685  0.666576       48
 
 [7611 rows x 4 columns],
 'macrophage': Empty DataFrame
 Columns: [soma_dim_1, avg, sd, n_cells]
 Index: [],
 'mature NK T cell':       soma_dim_1       avg        sd  n_cells
 0             33  1.268136  1.134177       26
 1             34  0.734393  0.385386       35
 2             36  3.382323  2.975969       32
 3             38  1.374860  0.554642       78
 4             39  1.552466  1.382432       31
 ...          ...       ...       ...      ...
 4180       58548  2.424732  0.870553      112
 4181       58549  2.167723  0.731285      116
 4182       58553  2.728925  0.598465      116
 4183       58554  2.271593  0.729967       90
 4184       58556  3.297442  0.603839      119
 
 [4185 rows x 4 columns],
 'memory B cell':      soma_dim_1       avg        sd  n_cells
 0            38  1.243343  0.280968       23
 1           164  2.873089  0.833162       25
 2           229  2.478975  0.558780       37
 3           273  1.697664  0.393687       24
 4           297  1.474779  0.418201       20
 ..          ...       ...       ...      ...
 624       58548  2.735012  0.793998       39
 625       58549  2.383036  0.568788       38
 626       58553  2.775759  0.700229       39
 627       58554  2.230461  0.741468       34
 628       58556  3.563540  0.440269       39
 
 [629 rows x 4 columns],
 'monocyte':       soma_dim_1       avg        sd  n_cells
 0              1  6.696287  3.850051       22
 1             12  7.203373  2.584734       38
 2             32  0.464909  0.452464       45
 3             33  1.804697  1.838855       60
 4             34  1.090922  1.322580       65
 ...          ...       ...       ...      ...
 8502       58553  2.070877  1.122416      308
 8503       58554  2.102582  1.153940      222
 8504       58555  2.385278  2.290614       65
 8505       58556  2.562940  1.138733      311
 8506       58558  1.614884  1.600510       62
 
 [8507 rows x 4 columns],
 'naive B cell':      soma_dim_1       avg        sd  n_cells
 0            38  1.435278  0.590953       31
 1           229  2.280316  0.623179       32
 2           297  1.373284  0.374756       21
 3           623  1.985670  0.718431       23
 4           702  2.079123  0.816144       24
 ..          ...       ...       ...      ...
 587       58548  2.754192  0.916477       39
 588       58549  2.300668  0.744150       41
 589       58553  2.838845  0.811267       41
 590       58554  2.465030  1.056854       37
 591       58556  3.694568  0.587324       41
 
 [592 rows x 4 columns],
 'neutrophil':       soma_dim_1       avg        sd  n_cells
 0             11  9.607113  1.344685       26
 1             12  9.559652  1.649547       30
 2             34  0.600170  0.439815       23
 3             38  1.117853  0.797916       34
 4            136  4.326175  3.049126       41
 ...          ...       ...       ...      ...
 1464       58553  1.264619  0.983895       82
 1465       58554  2.472665  1.695481       48
 1466       58555  2.616217  1.278954       33
 1467       58556  1.251300  0.806523       65
 1468       58558  2.790017  1.757130       23
 
 [1469 rows x 4 columns],
 'plasma cell':        soma_dim_1       avg        sd  n_cells
 0               1  0.453420  0.576711       32
 1              19  1.286371  2.287474       37
 2              32  0.024147  0.028048      165
 3              33  0.719459  0.549580      235
 4              34  0.255642  0.185376      233
 ...           ...       ...       ...      ...
 10468       58553  2.731793  0.848978      371
 10469       58554  2.055919  0.980311      368
 10470       58555  0.293131  0.305294      136
 10471       58556  3.074683  0.838316      372
 10472       58558  0.503710  0.460197      153
 
 [10473 rows x 4 columns],
 'plasmablast': Empty DataFrame
 Columns: [soma_dim_1, avg, sd, n_cells]
 Index: []}
Note

When aggregating data, empty DataFrames may appear for certain cell types if no cells meet the specified query criteria.

Concatenate the results into a single DataFrame and annotate the soma_dim_1 column with the corresponding gene names.

  • Python
result = (
    pd.concat([df.assign(cell_type=name) for name, df in results.items()])
    .set_index("soma_dim_1")
    .join(df_var.set_index("soma_joinid")[["feature_name"]])
)

result
avg sd n_cells cell_type feature_name
soma_dim_1
32 0.133643 0.284189 26 CD4-positive, alpha-beta T cell MTND1P23
33 1.382537 1.058097 100 CD4-positive, alpha-beta T cell MTND2P28
34 0.528415 0.707839 106 CD4-positive, alpha-beta T cell MTCO1P12
36 1.687168 2.084468 86 CD4-positive, alpha-beta T cell MTCO2P12
38 1.136198 0.426818 400 CD4-positive, alpha-beta T cell MTATP6P1
... ... ... ... ... ...
58553 2.731793 0.848978 371 plasma cell MT-ND5
58554 2.055919 0.980311 368 plasma cell MT-ND6
58555 0.293131 0.305294 136 plasma cell MT-TE
58556 3.074683 0.838316 372 plasma cell MT-CYB
58558 0.503710 0.460197 153 plasma cell MT-TP

66467 rows × 5 columns

Visualize the expression profile for any given gene across different cell types to compare how the gene behaves in various cellular contexts.

  • Python
GENE_SYMBOL = "MYL6"

plt.figure(figsize=(5, 6))

# Create the bar plot
sns.barplot(
    data=result.query(f"feature_name == '{GENE_SYMBOL}'"),
    x="avg",
    y="cell_type",
    hue="avg",
    palette="viridis",
    legend=False,
)

plt.tight_layout()

# Show the plot
plt.show()

Summary

This tutorial covered how to run SQL queries on TileDB-SOMA experiments using TileDB Cloud and how to scale these operations efficiently with Task Graphs. For additional examples and information on Task Graphs, explore the SOMA Distributed Compute tutorial.

Add New Measurements
Running Locally