Use of Apache Arrow

life sciences

single cell (soma)

foundation

python

apache arrow

Learn about the Apache Arrow data types used in SOMA and how to work with them.

SOMA uses Apache Arrow for its in-memory type system. This page explains why Arrow is used and what this means for working with SOMA data structures.

Why Arrow?

Arrow is a widely adopted open standard for in-memory data representation. It provides a rich set of data types and is designed for high performance and interoperability across languages. Using Arrow, TileDB-SOMA can leverage these benefits and ensure that data is consistently represented across different systems and tools.

Practical implications

While SOMA data structures are typically created automatically by converting data from other formats (e.g., Seurat, AnnData, etc.), it’s also possible to create them manually, in which case Arrow data types must be specified.

DataFrame example

To demonstrate this, you will create a SOMADataFrame with a user-defined schema, which must be specified as an Arrow schema.

Start by importing the necessary libraries:

Python
R

import tempfile

import pyarrow as pa
import tiledbsoma

library(arrow)
library(tiledbsoma)

Define a URI to store the data use (this tutorial uses tempfile to create a temporary directory):

Python
R

URI = tempfile.mkdtemp(prefix="soma-dataframe-")
URI

URI <- tempfile(pattern = "soma-dataframe-")
URI

Define the schema

Create the Arrow schema, which defines the Arrow data types for each column.

Python
R

schema = pa.schema(
    [
        ("soma_joinid", pa.int64()),
        ("foo", pa.float64()),
        ("bar", pa.large_string()),
        ("baz", pa.bool_()),
    ]
)

schema

soma_joinid: int64
foo: double
bar: large_string
baz: bool

schema <- schema(
  soma_joinid = int64(),
  foo = float64(),
  bar = large_utf8(),
  baz = bool()
)

schema

Schema
soma_joinid: int64
foo: double
bar: large_string
baz: bool

Now, use this schema create a new SOMADataFrame:

Python
R

sdf = tiledbsoma.DataFrame.create(
    uri=URI, schema=schema, index_column_names=["soma_joinid"]
)

sdf <- tiledbsoma::SOMADataFrameCreate(
  uri = URI,
  schema = schema,
  index_column_names = "soma_joinid"
)

sdf

<SOMADataFrame>
  uri: /tmp/RtmplquLEM/soma-dataframe-abe75f7db8e0 
  dimensions: soma_joinid 
  attributes: foo, bar, baz

This produced an empty SOMADataFrame with a TileDB schema that matches the provided Arrow schema.

TileDB-SOMA is strongly typed, which means all requests for a given Arrow type must be fulfilled or throw an error. This ensures that the API is self-consistent and predictable. For example, as you’ve seen, SOMA creation operations require an Arrow schema. Thus, the schema accessor returns the same type.

Python
R

sdf.schema

soma_joinid: int64 not null
foo: double
bar: large_string
baz: bool

sdf$schema()

Schema
soma_joinid: int64 not null
foo: double
bar: string
baz: bool

Perform a write

Similarly, when writing data to a SOMA object, it must be provided in the correct Arrow type. In this case, you will create a synthetic Arrow Table with the same schema used to create the SOMADataFrame.

Python
R

tbl = pa.Table.from_pydict(
    {
        "soma_joinid": [0, 1, 2, 3, 4],
        "foo": [4.1, 5.2, 6.3, 7.4, 8.5],
        "bar": ["apple", "ball", "cat", "dog", "egg"],
        "baz": [True, False, False, True, False],
    },
    schema=schema,
)

tbl

pyarrow.Table
soma_joinid: int64
foo: double
bar: large_string
baz: bool
----
soma_joinid: [[0,1,2,3,4]]
foo: [[4.1,5.2,6.3,7.4,8.5]]
bar: [["apple","ball","cat","dog","egg"]]
baz: [[true,false,false,true,false]]

tbl <- Table$create(
  soma_joinid = c(0, 1, 2, 3, 4),
  foo = c(4.1, 5.2, 6.3, 7.4, 8.5),
  bar = c("apple", "ball", "cat", "dog", "egg"),
  baz = c(TRUE, FALSE, FALSE, TRUE, FALSE),
  schema = schema
)

tbl

Table
5 rows x 4 columns
$soma_joinid <int64>
$foo <double>
$bar <large_string>
$baz <bool>

This table can now be written to the SOMADataFrame:

Python
R

sdf.write(tbl)

<DataFrame '/var/folders/nr/1dsl0n155wj7wv083km8t1540000gn/T/soma-dataframe-ml14qysp' (open for 'w')>

sdf$write(tbl)

Remember to close the table when you are done with it:

Python
R

sdf.close()

sdf$close()

Additional resources

Refer to the SOMA API specification for more technical details about SOMA’s use of Arrow.