1. Structure
  2. Life Sciences
  3. Single-cell
  4. Tutorials
  5. Shapes in TileDB-SOMA
  • Home
  • What is TileDB?
  • Get Started
  • Explore Content
  • Accounts
    • Individual Accounts
      • Apply for the Free Tier
      • Profile
        • Overview
        • Cloud Credentials
        • Storage Paths
        • REST API Tokens
        • Credits
    • Organization Admins
      • Create an Organization
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
    • Organization Members
      • Organization Invitations
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
  • Catalog
    • Introduction
    • Data
      • Arrays
      • Tables
      • Single-Cell (SOMA)
      • Genomics (VCF)
      • Biomedical Imaging
      • Vector Search
      • Files
    • Code
      • Notebooks
      • Dashboards
      • User-Defined Functions
      • Task Graphs
      • ML Models
    • Groups
    • Marketplace
    • Search
  • Collaborate
    • Introduction
    • Organizations
    • Access Control
      • Introduction
      • Share Assets
      • Asset Permissions
      • Public Assets
    • Logging
    • Marketplace
  • Analyze
    • Introduction
    • Slice Data
    • Multi-Region Redirection
    • Notebooks
      • Launch a Notebook
      • Usage
      • Widgets
      • Notebook Image Dependencies
    • Dashboards
      • Dashboards
      • Streamlit
    • Preview
    • User-Defined Functions
    • Task Graphs
    • Serverless SQL
    • Monitor
      • Task Log
      • Task Graph Log
  • Scale
    • Introduction
    • Task Graphs
    • API Usage
  • Structure
    • Why Structure Is Important
    • Arrays
      • Introduction
      • Quickstart
      • Foundation
        • Array Data Model
        • Key Concepts
          • Storage
            • Arrays
            • Dimensions
            • Attributes
            • Cells
            • Domain
            • Tiles
            • Data Layout
            • Compression
            • Encryption
            • Tile Filters
            • Array Schema
            • Schema Evolution
            • Fragments
            • Fragment Metadata
            • Commits
            • Indexing
            • Array Metadata
            • Datetimes
            • Groups
            • Object Stores
          • Compute
            • Writes
            • Deletions
            • Consolidation
            • Vacuuming
            • Time Traveling
            • Reads
            • Query Conditions
            • Aggregates
            • User-Defined Functions
            • Distributed Compute
            • Concurrency
            • Parallelism
        • Storage Format Spec
      • Tutorials
        • Basics
          • Basic Dense Array
          • Basic Sparse Array
          • Array Metadata
          • Compression
          • Encryption
          • Data Layout
          • Tile Filters
          • Datetimes
          • Multiple Attributes
          • Variable-Length Attributes
          • String Dimensions
          • Nullable Attributes
          • Multi-Range Reads
          • Query Conditions
          • Aggregates
          • Deletions
          • Catching Errors
          • Configuration
          • Basic S3 Example
          • Basic TileDB Cloud
          • fromDataFrame
          • Palmer Penguins
        • Advanced
          • Schema Evolution
          • Advanced Writes
            • Write at a Timestamp
            • Get Fragment Info
            • Consolidation
              • Fragments
              • Fragment List
              • Consolidation Plan
              • Commits
              • Fragment Metadata
              • Array Metadata
            • Vacuuming
              • Fragments
              • Commits
              • Fragment Metadata
              • Array Metadata
          • Advanced Reads
            • Get Fragment Info
            • Time Traveling
              • Introduction
              • Fragments
              • Array Metadata
              • Schema Evolution
          • Array Upgrade
          • Backends
            • Amazon S3
            • Azure Blob Storage
            • Google Cloud Storage
            • MinIO
            • Lustre
          • Virtual Filesystem
          • User-Defined Functions
          • Distributed Compute
          • Result Estimation
          • Incomplete Queries
        • Management
          • Array Schema
          • Groups
          • Object Management
        • Performance
          • Summary of Factors
          • Dense vs. Sparse
          • Dimensions vs. Attributes
          • Compression
          • Tiling and Data Layout
          • Tuning Writes
          • Tuning Reads
      • API Reference
    • Tables
      • Introduction
      • Quickstart
      • Foundation
        • Data Model
        • Key Concepts
          • Indexes
          • Columnar Storage
          • Compression
          • Data Manipulation
          • Optimize Tables
          • ACID
          • Serverless SQL
          • SQL Connectors
          • Dataframes
          • CSV Ingestion
      • Tutorials
        • Basics
          • Ingestion with SQL
          • CSV Ingestion
          • Basic S3 Example
          • Running Locally
        • Advanced
          • Scalable Ingestion
          • Scalable Queries
      • API Reference
    • AI & ML
      • Vector Search
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Vector Search
            • Vector Databases
            • Algorithms
            • Distance Metrics
            • Updates
            • Deployment Methods
            • Architecture
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Ingestion & Querying
            • Updates
            • Deletions
            • Basic S3 Example
            • Running Locally
          • Advanced
            • Versioning
            • Time Traveling
            • Consolidation
            • Distributed Compute
            • RAG LLM
            • LLM Memory
            • File Search
            • Image Search
            • Protein Search
          • Performance
        • API Reference
      • ML Models
        • Introduction
        • Quickstart
        • Foundation
          • Basics
          • Storage
          • Cloud Execution
          • Why TileDB for Machine Learning
        • Tutorials
          • Ingestion
            • Data Ingestion
              • Dense Datasets
              • Sparse Datasets
            • ML Model Ingestion
          • Management
            • Array Schema
            • Machine Learning: Groups
            • Time Traveling
    • Life Sciences
      • Single-cell
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Data Structures
            • Use of Apache Arrow
            • Join IDs
            • State Management
            • TileDB Cloud URIs
          • SOMA API Specification
        • Tutorials
          • Data Ingestion
          • Bulk Ingestion Tutorial
          • Data Access
          • Distributed Compute
          • Basic S3 Example
          • Multi-Experiment Queries
          • Appending Data to a SOMA Experiment
          • Add New Measurements
          • SQL Queries
          • Running Locally
          • Shapes in TileDB-SOMA
          • Drug Discovery App
        • Spatial
          • Introduction
          • Foundation
            • Spatial Data Model
            • Data Structures
          • Tutorials
            • Spatial Data Ingestion
            • Access Spatial Data
            • Manage Coordinate Spaces
        • API Reference
      • Population Genomics
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • The N+1 Problem
            • Architecture
            • Arrays
            • Ingestion
            • Reads
            • Variant Statistics
            • Annotations
            • User-Defined Functions
            • Tables and SQL
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Basic Ingestion
            • Basic Queries
            • Export to VCF
            • Add New Samples
            • Deleting Samples
            • Basic S3 Example
            • Basic TileDB Cloud
          • Advanced
            • Scalable Ingestion
            • Scalable Queries
            • Query Transforms
            • Handling Large Queries
            • Annotations
              • Finding Annotations
              • Embedded Annotations
              • External Annotations
              • Annotation VCFs
              • Ingesting Annotations
            • Variant Statistics
            • Tables and SQL
            • User-Defined Functions
            • Sample Metadata
            • Split VCF
          • Performance
        • API Reference
          • Command Line Interface
          • Python API
          • Cloud API
      • Biomedical Imaging
        • Introduction
        • Foundation
          • Data Model
          • Key Concepts
            • Arrays
            • Ingestion
            • Reads
            • User Defined Functions
          • Storage Format Spec
        • Quickstart
        • Tutorials
          • Basics
            • Ingestion
            • Read
              • OpenSlide
              • TileDB-Py
          • Advanced
            • Batched Ingestion
            • Chunked Ingestion
            • Machine Learning
              • PyTorch
            • Napari
    • Files
  • API Reference
  • Self-Hosting
    • Installation
    • Upgrades
    • Administrative Tasks
    • Image Customization
      • Customize User-Defined Function Images
      • AWS ECR Container Registry
      • Customize Jupyter Notebook Images
    • Single Sign-On
      • Configure Single Sign-On
      • OpenID Connect
      • Okta SCIM
      • Microsoft Entra
  • Glossary

On this page

  • The shape feature
  • When and how to resize at the experiment level
  • How to upgrade older experiments
  • Advanced usage
    • Dataframes with non-standard index columns
    • Use resize at the dataframe/array level with the SOMA API
    • TileDB-SOMA shape and domain in comparison to other TileDB terminology
  1. Structure
  2. Life Sciences
  3. Single-cell
  4. Tutorials
  5. Shapes in TileDB-SOMA

Shapes in TileDB-SOMA

life sciences
single cell (soma)
tutorials
python
r
shapes
Arrays shapes in the TileDB-SOMA data model.

The TileDB-SOMA team is proud to support an intuitive and extensible notion of shape with the release of TileDB-SOMA 1.15.

In this notebook, you will learn how to use shapes for the dataframes and arrays within your SOMA experiments, when and how you can resize, and options for experiments created in TileDB-SOMA versions before 1.15.

The dataset used is from Peripheral Blood Mononuclear Cells (PBMC3K), which is freely available from 10X Genomics.

The shape feature

Like other tutorials in this series, the SOMA data model brings across many familiar concepts from AnnData. This includes the ability to ask component dataframes and arrays what their shapes are.

First, import the necessary libraries and open an example experiment.

This is data ingested to TileDB-SOMA from PBMC3K.

  • Python
  • R
import tiledbsoma

uri = "tiledb://TileDB-Inc/shapes-example-processed"
exp = tiledbsoma.Experiment.open(uri)
library(tiledbsoma)

uri <- "tiledb://TileDB-Inc/shapes-example-processed"
exp <- SOMAExperimentOpen(uri)

The obs dataframe has a domain, which is a soft limit on what values you may write to it. You’ll get an exception like Query: A range was set outside of the current domain if you try to read or write soma_joinid values outside this range. This is an important data-integrity reassurance.

The domain seen here matches with the data populated inside of it. This will usually be the case, unless you created the dataframe but haven’t written any data to it yet. In that case, it’s empty, but it still has a domain.

If you have more data (more cells) to add to the experiment later, you will be able resize the obs, up to the maxdomain, which is a hard limit.

  • Python
  • R
exp.obs.domain
((0, 2637),)
exp$obs$domain()
$soma_joinid =
  1. 0
  2. 2637
  • Python
  • R
exp.obs.maxdomain
((0, 9223372036854773758),)
exp$obs$maxdomain()
$soma_joinid
integer64
[1] 0                   9223372036854773758
  • Python
  • R
exp.obs.read().concat().to_pandas()
soma_joinid obs_id n_genes percent_mito n_counts louvain
0 0 AAACATACAACCAC-1 781 0.030178 2419.0 CD4 T cells
1 1 AAACATTGAGCTAC-1 1352 0.037936 4903.0 B cells
2 2 AAACATTGATCAGC-1 1131 0.008897 3147.0 CD4 T cells
3 3 AAACCGTGCTTCCG-1 960 0.017431 2639.0 CD14+ Monocytes
4 4 AAACCGTGTATGCG-1 522 0.012245 980.0 NK cells
... ... ... ... ... ... ...
2633 2633 TTTCGAACTCTCAT-1 1155 0.021104 3459.0 CD14+ Monocytes
2634 2634 TTTCTACTGAGGCA-1 1227 0.009294 3443.0 B cells
2635 2635 TTTCTACTTCCTCG-1 622 0.021971 1684.0 B cells
2636 2636 TTTGCATGAGAGGC-1 454 0.020548 1022.0 B cells
2637 2637 TTTGCATGCCTCAC-1 724 0.008065 1984.0 CD4 T cells

2638 rows × 6 columns

as.data.frame(exp$obs$read()$concat())
A data.frame: 2638 × 6
soma_joinid obs_id n_genes percent_mito n_counts louvain
<int> <chr> <int> <dbl> <dbl> <fct>
0 AAACATACAACCAC-1 781 0.030177759 2419 CD4 T cells
1 AAACATTGAGCTAC-1 1352 0.037935957 4903 B cells
2 AAACATTGATCAGC-1 1131 0.008897362 3147 CD4 T cells
3 AAACCGTGCTTCCG-1 960 0.017430846 2639 CD14+ Monocytes
4 AAACCGTGTATGCG-1 522 0.012244898 980 NK cells
⋮ ⋮ ⋮ ⋮ ⋮ ⋮
2633 TTTCGAACTCTCAT-1 1155 0.021104366 3459 CD14+ Monocytes
2634 TTTCTACTGAGGCA-1 1227 0.009294220 3443 B cells
2635 TTTCTACTTCCTCG-1 622 0.021971496 1684 B cells
2636 TTTGCATGAGAGGC-1 454 0.020547945 1022 B cells
2637 TTTGCATGCCTCAC-1 724 0.008064516 1984 CD4 T cells

You’ll learn more about this on experiment-level resizes throughout this tutorial, as well as in the tutorial on TileDB-SOMA’s append mode.

The var dataframe’s domain is similar:

  • Python
  • R
var = exp.ms["RNA"].var
var.domain
((0, 1837),)
var <- exp$ms$get("RNA")$var
var$domain()
$soma_joinid =
  1. 0
  2. 1837
  • Python
  • R
var.maxdomain
((0, 9223372036854773968),)
var$maxdomain()
$soma_joinid
integer64
[1] 0                   9223372036854773968

Likewise, the N-dimensional arrays within the experiment have their shapes as well.

An important difference: while the dataframe domain gives you the inclusive lower and upper bounds for soma_joinid writes, the shape for the N-dimensional arrays is the upper bound plus 1.

Since there are 2638 cells and 1838 genes here, X’s shape reflects that.

  • Python
  • R
exp.obs.domain
((0, 2637),)
exp$obs$domain()
$soma_joinid =
  1. 0
  2. 2637
  • Python
  • R
exp.ms["RNA"].var.domain
((0, 1837),)
exp$ms$get("RNA")$var$domain()
$soma_joinid =
  1. 0
  2. 1837
  • Python
  • R
exp.ms["RNA"].X["data"].shape
(2638, 1838)
exp$ms$get("RNA")$X$get("data")$shape()
integer64
[1] 2638 1838
  • Python
  • R
exp.ms["RNA"].X["data"].maxshape
(9223372036854773759, 9223372036854773759)
exp$ms$get("RNA")$X$get("data")$maxshape()
integer64
[1] 9223372036854773759 9223372036854773759

The other N-dimensional arrays are similar:

  • Python
  • R
obsm = exp.ms["RNA"].obsm
list(obsm.keys())
['X_draw_graph_fr', 'X_pca', 'X_tsne', 'X_umap']
obsm <- exp$ms$get("RNA")$obsm
obsm$names()
  1. 'X_draw_graph_fr'
  2. 'X_pca'
  3. 'X_tsne'
  4. 'X_umap'
  • Python
  • R
obsp = exp.ms["RNA"].obsp
list(obsp.keys())
['connectivities', 'distances']
obsp <- exp$ms$get("RNA")$obsp
obsp$names()
  1. 'connectivities'
  2. 'distances'
  • Python
  • R
[
    obsm["X_pca"].shape,
    obsm["X_pca"].maxshape,
]
[(2638, 50), (9223372036854773759, 9223372036854773759)]
list(
  obsm$get("X_pca")$shape(),
  obsm$get("X_pca")$maxshape()
)
[[1]]
integer64
[1] 2638 50  

[[2]]
integer64
[1] 9223372036854773759 9223372036854773759
  • Python
  • R
[
    obsp["distances"].shape,
    obsp["distances"].maxshape,
]
[(2638, 2638), (9223372036854773759, 9223372036854773759)]
list(
  obsp$get("distances")$shape(),
  obsp$get("distances")$maxshape()
)
[[1]]
integer64
[1] 2638 2638

[[2]]
integer64
[1] 9223372036854773759 9223372036854773759

In particular, the X array in this experiment — and in most experiments — is sparse. That means the matrix doesn’t need a number in every row or cell. Still, the shape serves as a soft limit for reads and writes: you’ll get an exception trying to read or write outside of these bounds. (Specifically, the message you’ll see is Query: A range was set outside of the current domain.)

As a convenience, you can see all the experiment’s objects’ shapes at once as follows:

  • Python
  • R
import tiledbsoma.io

tiledbsoma.io.show_experiment_shapes(exp.uri)
[DataFrame] obs
  URI tiledb://TileDB-Inc/4e63acce-71cc-4d42-96b8-0815bf7fc497
  non_empty_domain     ((0, 2637),)
  domain               ((0, 2637),)
  maxdomain            ((0, 9223372036854773758),)
  upgraded             True

[DataFrame] ms/RNA/var
  URI tiledb://TileDB-Inc/95998d1a-82f9-4555-adc9-dfdee2f057f0
  non_empty_domain     ((0, 1837),)
  domain               ((0, 1837),)
  maxdomain            ((0, 9223372036854773968),)
  upgraded             True

[SparseNDArray] ms/RNA/X/data
  URI tiledb://TileDB-Inc/68acd3b3-fb31-4089-8242-f72f35288ab6
  used_shape           ((0, 2637), (0, 1837))
  shape                (2638, 1838)
  maxshape             (9223372036854773759, 9223372036854773759)
  upgraded             True

...

[SparseNDArray] ms/RNA/obsm/X_pca
  URI tiledb://TileDB-Inc/e147bdff-4066-45ca-90d3-e0041ee4259b
  used_shape           ((0, 2637), (0, 49))
  shape                (2638, 50)
  maxshape             (9223372036854773759, 9223372036854773759)
  upgraded             True

  ...

[SparseNDArray] ms/RNA/obsp/distances
  URI tiledb://TileDB-Inc/b37fb332-6e31-4a08-8138-272f196081d9
  used_shape           ((0, 2637), (0, 2637))
  shape                (2638, 2638)
  maxshape             (9223372036854773759, 9223372036854773759)
  upgraded             True

[SparseNDArray] ms/RNA/varm/PCs
  URI tiledb://TileDB-Inc/7b2849bb-5804-469c-95e1-c5bf52aa6266
  used_shape           ((0, 1837), (0, 49))
  shape                (1838, 50)
  maxshape             (9223372036854773759, 9223372036854773759)
  upgraded             True

  ...

(Not currently implemented in R.)

As with AnnData, as a general rule you’ll see the following:

  • An X array’s shape is nobs x nvar.
  • An obsm array’s shape is nobs x some number, maybe 50.
  • An obsp array’s shape is nobs x nobs.
  • A varm array’s shape is nvar x some number, maybe 50.
  • A varp array’s shape is nvar x nvar.

When and how to resize at the experiment level

The primary reason you’d resize a dataframe or an array within an experiment is to append more data. For example, say you have an experiment with the results of Monday’s lab run on a sample of 100,000 cells. Then maybe on Tuesday, you’ll want to add that day’s lab run of another 70,000 cells to the same experiment, for a new total of 170,000 cells. It’s also possible that Tuesday’s data might include some infrequently expressed genes that didn’t appear in Monday’s data.

Because the shapes are soft limits, reading or writing beyond which will result in an exception, you’d need to resize the experiment to accommodate new shapes for the dataframes and arrays in the experiment to allow for new nobs = 170,000.

Visit the append-mode tutorial for information on how to resize experiments by using tiledbsoma.io.register_anndatas and tiledbsoma.io.resize_experiment

While you can resize each dataframe and array in the experiment one at a time (refer to Advanced usage), the most common case is tiledbsoma.io.resize_experiment, which exists to make this quick and convenient.

Note

resize_experiment is available only in Python, because the append-mode feature only exists currently in Python.

How to upgrade older experiments

Experiments created by TileDB-SOMA 1.15 and later will look as shown previously. The following code block shows an experiment created using TileDB-SOMA 1.14.5. This is the same PBMC3K dataset as before, except it’s the unprocessed version: this has fewer component arrays, which keeps the display here more compact.

Note

Experiment-level upgrade is applicable only to the TileDB-SOMA Python API. This is because TileDB-SOMA experiments created n R before TileDB-SOMA 1.15 have their array shape already the same as maxshape, so these can’t be expanded more.

  • Python
  • R
import tiledbsoma.io

uri = "tiledb://TileDB-Inc/shapes-example-pre-1.15-not-upgraded"
pre_115_exp = tiledbsoma.Experiment.open(uri)
uri <- "tiledb://TileDB-Inc/shapes-example-pre-1.15-not-upgraded"
pre_115_exp <- SOMAExperimentOpen(uri)

Compare the shapes from before TileDB-SOMA 1.15 to TileDB-SOMA 1.15:

  • Python
  • R
pre_115_exp.obs.domain
((0, 2147483646),)
pre_115_exp$obs$domain()
$soma_joinid =
  1. 0
  2. 2147483646
  • Python
  • R
pre_115_exp.obs.maxdomain
((0, 2147483646),)
pre_115_exp$obs$maxdomain()
$soma_joinid =
  1. 0
  2. 2147483646
  • Python
  • R
pre_115_exp.obs.tiledbsoma_has_upgraded_domain
False
pre_115_exp$obs$tiledbsoma_has_upgraded_domain()
FALSE
  • Python
  • R
[
    pre_115_exp.ms["RNA"].X["data"].shape,
    pre_115_exp.ms["RNA"].X["data"].maxshape,
    pre_115_exp.ms["RNA"].X["data"].tiledbsoma_has_upgraded_shape,
]
[(2147483646, 2147483646), (2147483646, 2147483646), False]
X <- pre_115_exp$ms$get("RNA")$X$get("data")
list(
  X$shape(),
  X$maxshape(),
  X$tiledbsoma_has_upgraded_shape()
)
[[1]]
integer64
[1] 2147483646 2147483646

[[2]]
integer64
[1] 2147483646 2147483646

[[3]]
[1] FALSE

Note that for the pre-1.15 experiment, the shape is large — like the maxshape — and tiledbsoma_has_upgraded_domain is False.

To make the old experiment look like the new experiment, call upgrade_experiment_shapes, and reopen.

For purposes of this document, we show the results of having done that.

Note that show_experiment_shapes and upgrade_experiment_shapes are currently only implemented in Python.

Before upgrading:

tiledbsoma.io.show_experiment_shapes(
    "tiledb://TileDB-Inc/shapes-example-pre-1.15-upgraded"
)
[DataFrame] obs
  URI tiledb://TileDB-Inc/85bdf23b-e0fe-4494-9012-c9102fc6be90
  non_empty_domain     ((0, 2699),)
  domain               ((0, 2147483646),)
  maxdomain            ((0, 2147483646),)
  upgraded             False

[DataFrame] ms/RNA/var
  URI tiledb://TileDB-Inc/45bd8385-dd82-40f6-a428-2a85c8626afe
  non_empty_domain     ((0, 13713),)
  domain               ((0, 2147483646),)
  maxdomain            ((0, 2147483646),)
  upgraded             False

[SparseNDArray] ms/RNA/X/data
  URI tiledb://TileDB-Inc/b714d8f6-9283-4191-8e2d-9b41c4007ee1
  used_shape           ((0, 2699), (0, 13713))
  shape                (2147483646, 2147483646)
  maxshape             (2147483646, 2147483646)
  upgraded             False
True

Applying the upgrade:

tiledbsoma.io.upgrade_experiment_shapes(
    "tiledb://TileDB-Inc/shapes-example-pre-1.15-upgraded", verbose=True
)
[DataFrame] obs
  URI tiledb://TileDB-Inc/85bdf23b-e0fe-4494-9012-c9102fc6be90
  Applying tiledbsoma_upgrade_soma_joinid_shape(2700)

[DataFrame] ms/RNA/var
  URI tiledb://TileDB-Inc/45bd8385-dd82-40f6-a428-2a85c8626afe
  Applying tiledbsoma_upgrade_soma_joinid_shape(13714)

[SparseNDArray] ms/RNA/X/data
  URI tiledb://TileDB-Inc/b714d8f6-9283-4191-8e2d-9b41c4007ee1
  Applying tiledbsoma_upgrade_shape((2700, 13714))
True

After the upgrade:

tio.show_experiment_shapes("tiledb://TileDB-Inc/shapes-example-pre-1.15-upgraded")
[DataFrame] obs
  URI tiledb://TileDB-Inc/85bdf23b-e0fe-4494-9012-c9102fc6be90
  non_empty_domain     ((0, 2699),)
  domain               ((0, 2699),)
  maxdomain            ((0, 2147483646),)
  upgraded             True

[DataFrame] ms/RNA/var
  URI tiledb://TileDB-Inc/45bd8385-dd82-40f6-a428-2a85c8626afe
  non_empty_domain     ((0, 13713),)
  domain               ((0, 13713),)
  maxdomain            ((0, 2147483646),)
  upgraded             True

[SparseNDArray] ms/RNA/X/data
  URI tiledb://TileDB-Inc/b714d8f6-9283-4191-8e2d-9b41c4007ee1
  used_shape           ((0, 2699), (0, 13713))
  shape                (2700, 13714)
  maxshape             (2147483646, 2147483646)
  upgraded             True
  • Python
  • R
pre_115_exp = tiledbsoma.open("tiledb://TileDB-Inc/shapes-example-pre-1.15-upgraded")
pre_115_exp <- SOMAExperimentOpen("tiledb://TileDB-Inc/shapes-example-pre-1.15-upgraded")
  • Python
  • R
[
    pre_115_exp.ms["RNA"].X["data"].shape,
    pre_115_exp.ms["RNA"].X["data"].maxshape,
    pre_115_exp.ms["RNA"].X["data"].tiledbsoma_has_upgraded_shape,
]
[(2700, 13714), (2147483646, 2147483646), True]
X <- pre_115_exp$ms$get("RNA")$X$get("data")
list(
  X$shape(),
  X$maxshape(),
  X$tiledbsoma_has_upgraded_shape()
)
[[1]]
integer64
[1] 2700  13714

[[2]]
integer64
[1] 2147483646 2147483646

[[3]]
[1] TRUE

To run a pre-check, you can do the following:

  • Python
  • R
tiledbsoma.io.upgrade_experiment_shapes(the_uri, check_only=True)

(Not currently implemented in R.)

This won’t change anything. It’ll only tell you if the operation will be possible.

Advanced usage

Dataframes with non-standard index columns

In the SOMA data model, the SparseNDArray and DenseNDArray objects always have int64 dimensions named soma_dim_0, soma_dim_1, and up, and they have a numeric soma_data attribute for the contents of the array.

  • Python
  • R
exp.ms["RNA"].X["data"].schema
soma_dim_0: int64 not null
soma_dim_1: int64 not null
soma_data: float not null
X$schema()
Schema
soma_dim_0: int64 not null
soma_dim_1: int64 not null
soma_data: double not null

For dataframes, though, while there must be a soma_joinid column of type int64, you can have additional index columns, or soma_joinid may be a non-index column.

This means that in the most common case, you can think of a dataframe has having a shape just as the N-dimensional arrays do.

  • Python
  • R
exp.obs.schema
soma_joinid: int64 not null
obs_id: large_string
n_genes: int64
percent_mito: float
n_counts: float
louvain: dictionary<values=string, indices=int32, ordered=0>
exp$obs$schema()
Schema
soma_joinid: int64 not null
obs_id: large_string
n_genes: int64
percent_mito: float
n_counts: float
louvain: dictionary<values=string, indices=int32>
  • Python
  • R
exp.obs.index_column_names
('soma_joinid',)
exp$obs$index_column_names()
'soma_joinid'

That being said, dataframes are capable of more than that, via the index-column names you specify at creation time.

Create some dataframes, with the same data, but different choices of index-column names.

  • Python
  • R
import tempfile

sdfuri1 = tempfile.mktemp()
sdfuri2 = tempfile.mktemp()
sdfuri1 <- tempfile()
sdfuri2 <- tempfile()
  • Python
  • R
import pyarrow as pa

schema = pa.schema(
    [
        ("soma_joinid", pa.int64()),
        ("mystring", pa.string()),
        ("myint", pa.int32()),
        ("myfloat", pa.float32()),
    ]
)

data = pa.Table.from_pydict(
    {
        "soma_joinid": [0, 1],
        "mystring": ["hello", "world"],
        "myint": [33, 44],
        "myfloat": [4.5, 5.5],
    }
)
library(arrow)

schema <- arrow::schema(
  arrow::field("soma_joinid", arrow::int64(), nullable = FALSE),
  arrow::field("mystring", arrow::large_utf8(), nullable = FALSE),
  arrow::field("myint", arrow::int32(), nullable = FALSE),
  arrow::field("myfloat", arrow::float32(), nullable = FALSE)
)

data <- arrow::arrow_table(
  soma_joinid = c(0, 1),
  mystring = c("hello", "world"),
  myint = c(33, 44),
  myfloat = c(4.5, 5.5)
)
  • Python
  • R
with tiledbsoma.DataFrame.create(
    sdfuri1,
    schema=schema,
    index_column_names=["soma_joinid", "mystring"],
    domain=[(0, 9), None],
) as sdf1:
    sdf1.write(data)
sdf1 <- SOMADataFrameCreate(
  sdfuri1,
  schema = schema,
  index_column_names = c("soma_joinid", "mystring"),
  domain = list(soma_joinid = c(0, 9), mystring = NULL)
)
sdf1$write(data)
sdf1$close()

Now inspect the domain and maxdomain for these dataframes.

  • Python
  • R
sdf1 = tiledbsoma.DataFrame.open(sdfuri1)
sdf1 <- SOMADataFrameOpen(sdfuri1)
  • Python
  • R
sdf1.index_column_names
('soma_joinid', 'mystring')
sdf1$index_column_names()
  1. 'soma_joinid'
  2. 'mystring'

Notice the soma_joinid slot of the dataframe’s domain is as requested.

Another point is that domain cannot be specified for string-type index columns.

You can set them at creation time in one of two ways:

  • Python
  • R
domain = ([(0, 9), None],)
# or
domain = ([(0, 9), ("", "")],)
    domain=list(soma_joinid=c(0, 9), mystring=NULL),
    # or
    domain=list(soma_joinid=c(0, 9), mystring=c('', '')),

In either case, the domain slot for a string-typed index column will read back as a pair of empty strings:

  • Python
  • R
sdf1.domain
((0, 9), ('', ''))
sdf1$domain()
$soma_joinid
  1. 0
  2. 9
$mystring
  1. ''
  2. ''
  • Python
  • R
sdf1.maxdomain
((0, 9223372036854775796), ('', ''))
sdf1$maxdomain()
$soma_joinid
integer64
[1] 0                   9223372036854773759

$mystring
[1] "" ""

Now inspect the other dataframe. Here, soma_joinid isn’t an index column at all. This is fine, as long as within the data you write to it, the index-column values uniquely identify each row.

  • Python
  • R
with tiledbsoma.DataFrame.create(
    sdfuri2,
    schema=schema,
    index_column_names=["myfloat", "myint"],
    domain=[(0, 999), (-1000, 1000)],
) as sdf2:
    sdf2.write(data)
sdf2 <- SOMADataFrameCreate(
  sdfuri2,
  schema = schema,
  index_column_names = c("myfloat", "myint"),
  domain = list(myfloat = c(0, 999), myint = c(-1000, 1000))
)
sdf2$write(data)
sdf2$close()
  • Python
  • R
sdf2 = tiledbsoma.DataFrame.open(sdfuri2)
sdf2 <- SOMADataFrameOpen(sdfuri2)
  • Python
  • R
sdf2.index_column_names
('myfloat', 'myint')
sdf2$index_column_names()
  1. 'myfloat'
  2. 'myint'

The domain reads back as written.

  • Python
  • R
sdf2.domain
((0.0, 999.0), (-1000, 1000))
sdf2$domain()
$myfloat
  1. 0
  2. 999
$myint
  1. -1000
  2. 1000
  • Python
  • R
sdf2.maxdomain
((-3.4028234663852886e+38, 3.4028234663852886e+38), (-2147483648, 2147481645))
sdf2$maxdomain()
$myfloat
  1. -3.40282346638529e+38
  2. 3.40282346638529e+38
$myint
  1. -2147483647
  2. 2147481599

Use resize at the dataframe/array level with the SOMA API

Earlier in this tutorial, you learned a fast and convenient way to resize all the dataframes and arrays within an experiment.

However, should you choose to do so, you can apply these one dataframe or array at a time.

For N-dimensional arrays that have been upgraded, or that were created using TileDB-SOMA 1.15 or later, do the following:

  • If the array’s tiledbsoma_has_upgraded_shape method reports False, invoke the tiledbsoma_upgrade_shape method.
  • Otherwise, invoke the .resize method.

Note: for purposes of this document, two experiments are shown: a before and an after. For your purposes, you would use a single experiment, and operate on only that.

Unpack a pre-1.15 experiment:

  • Python
  • R
pre_115_exp = tiledbsoma.Experiment.open(
    "tiledb://TileDB-Inc/shapes-example-pre-1.15-not-upgraded"
)
X = pre_115_exp.ms["RNA"].X["data"]
pre_115_exp <- SOMAExperimentOpen("tiledb://TileDB-Inc/shapes-example-pre-1.15-not-upgraded")
X <- pre_115_exp$ms$get("RNA")$X$get("data")

Notice that the X array has not been upgraded, and that its shape reports the same as maxshape:

  • Python
  • R
X.tiledbsoma_has_upgraded_shape
False
X$tiledbsoma_has_upgraded_shape()
FALSE
  • Python
  • R
X.shape
(2147483646, 2147483646)
X$shape()
integer64
[1] 2147483646 2147483646

Now give the X array the new-style shape. First, consult its non-empty domain to find get a report of what data have already been successfully written there:

  • Python
  • R
X.non_empty_domain()
((0, 2699), (0, 13713))
X$non_empty_domain()
$soma_dim_0
  1. 0
  2. 2699
$soma_dim_1
  1. 0
  2. 13713
  • Python
  • R
with tiledbsoma.Experiment.open(
    "tiledb://TileDB-Inc/shapes-example-pre-1.15-upgraded", "w"
) as exp:
    exp.ms["RNA"].X["data"].tiledbsoma_upgrade_shape(
        [X.non_empty_domain()[0][1] + 1, X.non_empty_domain()[1][1] + 1],
        check_only=True,  # Omit this when operating on live data
    )
ned <- X$non_empty_domain(max_only = TRUE)
exp <- SOMAExperimentOpen("tiledb://TileDB-Inc/shapes-example-pre-1.15-upgraded", "WRITE")
exp$ms$get("RNA")$X$get("data")$tiledbsoma_upgrade_shape(
  c(ned[[1]], ned[[2]]),
  check_only = TRUE # Omit this when operating on live data
)
exp$close()

Next, reopen the experiment to find out what happened:

  • Python
  • R
pre_115_exp = tiledbsoma.Experiment.open(
    "tiledb://TileDB-Inc/shapes-example-pre-1.15-upgraded"
)
X = pre_115_exp.ms["RNA"].X["data"]
pre_115_exp <- SOMAExperimentOpen("tiledb://TileDB-Inc/shapes-example-pre-1.15-upgraded")
X <- pre_115_exp$ms$get("RNA")$X$get("data")
  • Python
  • R
X.tiledbsoma_has_upgraded_shape
True
X$tiledbsoma_has_upgraded_shape()
TRUE
  • Python
  • R
X.shape
(2700, 13714)
X$shape()
integer64
[1] 2700  13714
  • Python
  • R
X.maxshape
(2147483646, 2147483646)
X$maxshape()
integer64
[1] 2147483646 2147483646

If you want, you can resize it even more:

  • Python
  • R
with tiledbsoma.Experiment.open(
    "tiledb://TileDB-Inc/shapes-example-pre-1.15-upgraded", "w"
) as exp:
    # Omit check_only=True when operating on live data
    exp.ms["RNA"].X["data"].resize([7200, 1848], check_only=True)
exp <- SOMAExperimentOpen("tiledb://TileDB-Inc/shapes-example-pre-1.15-upgraded", "WRITE")
exp$ms$get("RNA")$X$get("data")$tiledbsoma_upgrade_shape(
  c(7200, 1848),
  check_only = TRUE # Omit this when operating on live data
)
exp$close()

For dataframes, the process is similar. If you want to expand only the soft limits for soma_joinid, you can use these methods instead:

  • If the dataframe’s tiledbsoma_has_upgraded_domain reports False, invoke .tiledbsoma_upgrade_domain
  • Otherwise, invoke the .change_domain method.
  • Python
  • R
pre_115_exp = tiledbsoma.Experiment.open(
    "tiledb://TileDB-Inc/shapes-example-pre-1.15-not-upgraded"
)
pre_115_exp.obs.tiledbsoma_has_upgraded_domain
False
pre_115_exp <- SOMAExperimentOpen("tiledb://TileDB-Inc/shapes-example-pre-1.15-not-upgraded")
pre_115_exp$obs$tiledbsoma_has_upgraded_domain()
FALSE
  • Python
  • R
pre_115_exp.obs.domain
((0, 2147483646),)
pre_115_exp$obs$domain()
$soma_joinid =
  1. 0
  2. 2147483646
  • Python
  • R
pre_115_exp.obs.maxdomain
((0, 2147483646),)
pre_115_exp$obs$maxdomain()
$soma_joinid =
  1. 0
  2. 2147483646
  • Python
  • R
pre_115_exp.obs.non_empty_domain()
((0, 2699),)
pre_115_exp$obs$non_empty_domain()
$soma_joinid =
  1. 0
  2. 2699
  • Python
with tiledbsoma.Experiment.open(pre_115_exp.uri, "w") as exp:
    exp.obs.tiledbsoma_upgrade_domain(
        [[0, pre_115_exp.obs.non_empty_domain()[0][1] + 1]],
        check_only=True,  # Omit check_only=True when operating on live data
    )
  • Python
  • R
pre_115_exp = tiledbsoma.Experiment.open(
    "tiledb://TileDB-Inc/shapes-example-pre-1.15-upgraded"
)
pre_115_exp <- SOMAExperimentOpen("tiledb://TileDB-Inc/shapes-example-pre-1.15-upgraded")
  • Python
  • R
pre_115_exp.obs.tiledbsoma_has_upgraded_domain
True
pre_115_exp$obs$tiledbsoma_has_upgraded_domain()
TRUE
  • Python
  • R
pre_115_exp.obs.domain
((0, 2699),)
pre_115_exp$obs$domain()
$soma_joinid =
  1. 0
  2. 2699
  • Python
  • R
pre_115_exp.obs.maxdomain
((0, 2147483646),)
pre_115_exp$obs$maxdomain()
$soma_joinid =
  1. 0
  2. 2147483646

TileDB-SOMA shape and domain in comparison to other TileDB terminology

TileDB-SOMA uses TileDB to implement the SOMA specification. You may find terminology corresponding to both TileDB and SOMA. This document has made use of SOMA terminology only. However, if you are familiar with broader TileDB concepts, here are the mappings.

  • Core domain:
    • This has always existed.
    • This is immutable: it cannot be changed either larger or smaller once a dataframe or array has been created.
    • A SOMA DataFrame’s maxdomain is implemented by core domain.
    • A SOMA SparseNDArray or DenseNDArray’s maxshape is implemented by core domain.
    • It’s a runtime error to read or write data outside these boundaries.
    • This is a hard limit, in that it can’t be increased.
  • Core current_domain:
    • This was introduced in 2024 as of version 2.26 of the open-source core of TileDB, and is available in TileDB-SOMA as of version 1.15.
    • This is mutable: it can’t be made smaller after dataframe or array creation, but you can make it larger, up to the core domain (SOMA maxdomain/maxshape).
    • A SOMA DataFrame’s domain is implemented by core current_domain.
    • A SOMA SparseNDArray or DenseNDArray’s shape is implemented by core current_domain.
    • TileDB-SOMA will throw a runtime error if you try to read or write data outside these boundaries: you will see the error message A range was set outside of the current domain.
    • This is a soft limit, in that may be increased up to the hard limit.
  • Dataframes/arrays created by TileDB-SOMA 1.14 or lower:
    • These will necessarily have core domain (SOMA maxdomain and maxshape, respectively).
    • These won’t have the core current_domain.
    • When you ask for a SOMA dataset’s domain or shape, you get the same value as maxdomain or maxshape.
    • Their tiledbsoma_has_upgraded_domain() and tiledbsoma_has_upgraded_shape() methods return False.
    • Using the upgrade feature mentioned previously, you can apply a core current_domain.
  • Dataframes and arrays created by TileDB-SOMA 1.15 and later, or that have been upgraded:
    • These will necessarily have the core domain (SOMA maxdomain and maxshape, respectively).
    • These will also have the core current_domain (SOMA domain and shape, respectively).
    • Their tiledbsoma_has_upgraded_domain() and tiledbsoma_has_upgraded_shape() methods return True.
Running Locally
Drug Discovery App