Vacuum Consolidated Fragment Metadata

arrays

tutorials

python

fragment metadata

vacuuming

Learn how to vacuum fragment metadata after fragment metadata consolidation.

How to run this tutorial

You can run this tutorial in two ways:

Locally on your machine.
On TileDB Cloud.

However, since TileDB Cloud has a free tier, we strongly recommend that you sign up and run everything there, as that requires no installations or deployment.

This tutorial demonstrates how to vacuum fragment metadata of arrays, after consolidation takes place. Before running this tutorial, it is recommended that you read the following sections:

First, import the necessary libraries, set the array URI (i.e., its path, which in this tutorial will be on local storage), and delete any previously created arrays with the same name.

Python
R

# Import necessary libraries
import os.path
import shutil

import numpy as np
import tiledb

# Set array URI
array_uri = os.path.expanduser("~/vacuuming_fragment_meta")

# Delete array if it already exists
if os.path.exists(array_uri):
    shutil.rmtree(array_uri)

# Import necessary libraries
library(tiledb)

# Set array URI
array_uri <- path.expand("~/vacuuming_fragment_meta_r")

# Delete array if it already exists
if (file.exists(array_uri)) {
  unlink(array_uri, recursive = TRUE)
}

Next, create the array by specifying its schema. This example uses a sparse array, but the functionality is similar for dense arrays.

Python
R

# Create the two dimensions
d1 = tiledb.Dim(name="d1", domain=(0, 3), tile=2, dtype=np.int32)
d2 = tiledb.Dim(name="d2", domain=(0, 3), tile=2, dtype=np.int32)

# Create a domain using the two dimensions
dom = tiledb.Domain(d1, d2)

# Create an attribute
a = tiledb.Attr(name="a", dtype=np.int32)

# Create the array schema with `sparse=True`.
sch = tiledb.ArraySchema(domain=dom, sparse=True, attrs=[a])

# Create the array on disk (it will initially be empty)
tiledb.Array.create(array_uri, sch)

# Create the two dimensions
d1 <- tiledb_dim("d1", c(0L, 3L), 2L, "INT32")
d2 <- tiledb_dim("d2", c(0L, 3L), 2L, "INT32")

# Create a domain using the two dimensions
dom <- tiledb_domain(dims = c(d1, d2))

# Create an attribute
a <- tiledb_attr("a", type = "INT32")

# Create the array schema, setting `sparse = TRUE`
sch <- tiledb_array_schema(dom, a, sparse = TRUE)

# Create the array on disk (it will initially be empty)
arr <- tiledb_array_create(array_uri, sch)

Write some data to the array.

Python
R

# Prepare some data in numpy arrays
d1_data = np.array([2, 0, 3], dtype=np.int32)
d2_data = np.array([0, 1, 1], dtype=np.int32)
a_data = np.array([4, 1, 6], dtype=np.int32)

# Open the array in write mode and write the data in COO format
with tiledb.open(array_uri, "w") as A:
    A[d1_data, d2_data] = a_data

# Prepare some data in an array
d1_data <- c(2L, 0L, 3L)
d2_data <- c(0L, 1L, 1L)
a_data <- c(4L, 1L, 6L)

# Open the array for writing and write data to the array
arr <- tiledb_array(
  uri = array_uri,
  query_type = "WRITE",
  return_as = "data.frame"
)
arr[d1_data, d2_data] <- a_data

# Close the array
arr <- tiledb_array_close(arr)

Perform a second write, so that two fragments are generated.

Python
R

# Prepare some data in numpy arrays
d1_data = np.array([2, 0, 1], dtype=np.int32)
d2_data = np.array([2, 3, 3], dtype=np.int32)
a_data = np.array([5, 2, 3], dtype=np.int32)

# Open the array in write mode and write the data in COO format
with tiledb.open(array_uri, "w") as A:
    A[d1_data, d2_data] = a_data

# Prepare some data in an array
d1_data <- c(2L, 0L, 3L)
d2_data <- c(2L, 3L, 3L)
a_data <- c(5L, 2L, 3L)

# Open the array for writing and write data to the array
arr <- tiledb_array_open(
  arr,
  type = "WRITE"
)
arr[d1_data, d2_data] <- a_data

# Close the array
arr <- tiledb_array_close(arr)

Now consolidate the array, passing fragment_meta as the value of configuration parameter sm.consolidation.mode.

Python
R

# Consolidate
config = tiledb.Config({"sm.consolidation.mode": "fragment_meta"})
tiledb.consolidate(array_uri, config=config)

# Consolidate
cfg <- tiledb_config(
  config = c(
    "sm.consolidation.mode" = "fragment_meta"
  )
)

array_consolidate(array_uri, cfg = cfg)

Perform a third write, as you need to consolidate a second time to demonstrate the vacuuming functionality for fragment metadata.

Python
R

# Prepare some data in numpy arrays
d1_data = np.array([0], dtype=np.int32)
d2_data = np.array([0], dtype=np.int32)
a_data = np.array([0], dtype=np.int32)

# Open the array in write mode and write the data in COO format
with tiledb.open(array_uri, "w") as A:
    A[d1_data, d2_data] = a_data

# Prepare some data in an array
d1_data <- c(0L)
d2_data <- c(0L)
a_data <- c(0L)

# Open the array for writing and write data to the array
arr <- tiledb_array_open(
  arr,
  type = "WRITE"
)
arr[d1_data, d2_data] <- a_data

# Close the array
arr <- tiledb_array_close(arr)

Consolidate the fragment metadata once again.

Python
R

# Consolidate
config = tiledb.Config({"sm.consolidation.mode": "fragment_meta"})
tiledb.consolidate(array_uri, config=config)

# Consolidate
cfg <- tiledb_config(
  config = c(
    "sm.consolidation.mode" = "fragment_meta"
  )
)

array_consolidate(array_uri, cfg = cfg)

Inspecting the file hierarchy of the array, observe the two files inside the __fragment_meta directory: one from the first consolidation operation and one from the second.

/Users/stavrospapadopoulos/vacuuming_fragment_meta
├── __commits
│   ├── __1716157715780_1716157715780_46056e69bc7252f6465a33807b5f5cbc_21.wrt
│   ├── __1716157715788_1716157715788_50cf09972a02607b967c911a37f26991_21.wrt
│   └── __1716157715803_1716157715803_1bcd9a8607ba94fd01084384e8848fe0_21.wrt
├── __fragment_meta
│   ├── __1716157715780_1716157715788_185d305eba80aadf0c97cdc8c3f4947c_21.meta
│   └── __1716157715780_1716157715803_15d36422e125b66eeb0391c601188222_21.meta
├── __fragments
│   ├── __1716157715780_1716157715780_46056e69bc7252f6465a33807b5f5cbc_21
│   │   ├── __fragment_metadata.tdb
│   │   ├── a0.tdb
│   │   ├── d0.tdb
│   │   └── d1.tdb
│   ├── __1716157715788_1716157715788_50cf09972a02607b967c911a37f26991_21
│   │   ├── __fragment_metadata.tdb
│   │   ├── a0.tdb
│   │   ├── d0.tdb
│   │   └── d1.tdb
│   └── __1716157715803_1716157715803_1bcd9a8607ba94fd01084384e8848fe0_21
│       ├── __fragment_metadata.tdb
│       ├── a0.tdb
│       ├── d0.tdb
│       └── d1.tdb
├── __labels
├── __meta
└── __schema
    ├── __1716157715776_1716157715776_00000002693ee06535fe291535511403
    └── __enumerations

11 directories, 18 files

Read the array and observe that all the data from the three writes are in the result, as expected.

Python
R

# Read array
with tiledb.open(array_uri, "r") as A:
    print(A[:])

OrderedDict({'a': array([0, 1, 2, 3, 4, 6, 5], dtype=int32), 'd1': array([0, 0, 0, 1, 2, 3, 2], dtype=int32), 'd2': array([0, 1, 3, 3, 0, 1, 2], dtype=int32)})

# Open the array in read mode
arr <- tiledb_array_open(arr, type = "READ")

# Show the entire array
cat("Entire array:\n")
print(arr[])

arr <- tiledb_array_close(arr)

Entire array:
  d1 d2 a
1  0  0 0
2  0  1 1
3  2  0 4
4  3  1 6
5  0  3 2
6  2  2 5
7  3  3 3

Now perform a vacuuming operation, by passing fragment_meta as the value of configuration parameter sm.vacuum.mode.

Python
R

# Vacuum
config = tiledb.Config({"sm.vacuum.mode": "fragment_meta"})
tiledb.vacuum(array_uri, config=config)

# Vacuum
cfg <- tiledb_config(
  config = c(
    "sm.vacuum.mode" = "array_meta"
  )
)

array_vacuum(array_uri, cfg = cfg)

Inspecting the file hierarchy of the array again, observe the single file in the __fragment_meta directory, which contains the fragment metadata from all the three writes performed above.

/Users/stavrospapadopoulos/vacuuming_fragment_meta
├── __commits
│   ├── __1716157715780_1716157715780_46056e69bc7252f6465a33807b5f5cbc_21.wrt
│   ├── __1716157715788_1716157715788_50cf09972a02607b967c911a37f26991_21.wrt
│   └── __1716157715803_1716157715803_1bcd9a8607ba94fd01084384e8848fe0_21.wrt
├── __fragment_meta
│   └── __1716157715780_1716157715803_15d36422e125b66eeb0391c601188222_21.meta
├── __fragments
│   ├── __1716157715780_1716157715780_46056e69bc7252f6465a33807b5f5cbc_21
│   │   ├── __fragment_metadata.tdb
│   │   ├── a0.tdb
│   │   ├── d0.tdb
│   │   └── d1.tdb
│   ├── __1716157715788_1716157715788_50cf09972a02607b967c911a37f26991_21
│   │   ├── __fragment_metadata.tdb
│   │   ├── a0.tdb
│   │   ├── d0.tdb
│   │   └── d1.tdb
│   └── __1716157715803_1716157715803_1bcd9a8607ba94fd01084384e8848fe0_21
│       ├── __fragment_metadata.tdb
│       ├── a0.tdb
│       ├── d0.tdb
│       └── d1.tdb
├── __labels
├── __meta
└── __schema
    ├── __1716157715776_1716157715776_00000002693ee06535fe291535511403
    └── __enumerations

11 directories, 17 files

Reading the array again, all the data from the two writes is returned, again as expected. Vacuuming (in combination with consolidation), helps significantly boost performance in the presence of multiple write operations. For more details, visit the Performance: Tuning Writes section.

Python
R

# Read array
with tiledb.open(array_uri, "r") as A:
    print(A[:])

OrderedDict({'a': array([0, 1, 2, 3, 4, 6, 5], dtype=int32), 'd1': array([0, 0, 0, 1, 2, 3, 2], dtype=int32), 'd2': array([0, 1, 3, 3, 0, 1, 2], dtype=int32)})

# Open the array in read mode
arr <- tiledb_array_open(arr, type = "READ")

# Show the entire array
cat("Entire array:\n")
print(arr[])

arr <- tiledb_array_close(arr)

Entire array:
  d1 d2 a
1  0  0 0
2  0  1 1
3  2  0 4
4  3  1 6
5  0  3 2
6  2  2 5
7  3  3 3

Clean up in the end by deleting the array.

Python
R

# Delete the array
if os.path.exists(array_uri):
    shutil.rmtree(array_uri)

# Delete the array
if (file.exists(array_uri)) {
  unlink(array_uri, recursive = TRUE)
}