Learn how to consolidate fragment metadata on your arrays.
How to run this tutorial
You can run this tutorial in two ways:
Locally on your machine.
On TileDB Cloud.
However, since TileDB Cloud has a free tier, we strongly recommend that you sign up and run everything there, as that requires no installations or deployment.
This tutorial demonstrates how to consolidate fragment metadata of arrays. Before running this tutorial, it is recommended that you read the following sections:
First, import the necessary libraries, set the array URI (i.e., its path, which in this tutorial will be on local storage), and delete any previously created arrays with the same name.
# Import necessary librariesimport tiledbimport numpy as npimport shutilimport os.path# Set array URIarray_uri = os.path.expanduser("~/consolidation_fragment_meta")# Delete array if it already existsif os.path.exists(array_uri): shutil.rmtree(array_uri)
# Import necessary librarieslibrary(tiledb)# Set array URIarray_uri <-path.expand("~/consolidation_fragment_meta_r")# Delete array if it already existsif (file.exists(array_uri)) {unlink(array_uri, recursive =TRUE)}
Next, create the array by specifying its schema. This example uses a sparse array, but the functionality is similar for dense arrays. Some differences in consolidation exist between sparse and dense arrays, and other sections of Academy cover those differences.
# Create the two dimensionsd1 = tiledb.Dim(name="d1", domain=(0, 3), tile=2, dtype=np.int32)d2 = tiledb.Dim(name="d2", domain=(0, 3), tile=2, dtype=np.int32)# Create a domain using the two dimensionsdom = tiledb.Domain(d1, d2)# Create an attributea = tiledb.Attr(name="a", dtype=np.int32)# Create the array schema with `sparse=True`.sch = tiledb.ArraySchema(domain=dom, sparse=True, attrs=[a])# Create the array on disk (it will initially be empty)tiledb.Array.create(array_uri, sch)
# Create the two dimensionsd1 <-tiledb_dim("d1", c(0L, 3L), 2L, "INT32")d2 <-tiledb_dim("d2", c(0L, 3L), 2L, "INT32")# Create a domain using the two dimensionsdom <-tiledb_domain(dims =c(d1, d2))# Create an attributea <-tiledb_attr("a", type ="INT32")# Create the array schema, setting `sparse = TRUE`sch <-tiledb_array_schema(dom, a, sparse =TRUE)# Create the array on disk (it will initially be empty)arr <-tiledb_array_create(array_uri, sch)
# Prepare some data in numpy arraysd1_data = np.array([2, 0, 3], dtype=np.int32)d2_data = np.array([0, 1, 1], dtype=np.int32)a_data = np.array([4, 1, 6], dtype=np.int32)# Open the array in write mode and write the data in COO formatwith tiledb.open(array_uri, "w") as A: A[d1_data, d2_data] = a_data
# Prepare some data in an arrayd1_data <-c(2L, 0L, 3L)d2_data <-c(0L, 1L, 1L)a_data <-c(4L, 1L, 6L)# Open the array for writing and write data to the arrayarr <-tiledb_array(uri = array_uri,query_type ="WRITE",return_as ="data.frame")arr[d1_data, d2_data] <- a_data# Close the arrayarr <-tiledb_array_close(arr)
Perform a second write, so that two fragments are generated.
# Prepare some data in numpy arraysd1_data = np.array([2, 0, 1], dtype=np.int32)d2_data = np.array([2, 3, 3], dtype=np.int32)a_data = np.array([5, 2, 3], dtype=np.int32)# Open the array in write mode and write the data in COO formatwith tiledb.open(array_uri, "w") as A: A[d1_data, d2_data] = a_data
# Prepare some data in an arrayd1_data <-c(2L, 0L, 3L)d2_data <-c(2L, 3L, 3L)a_data <-c(5L, 2L, 3L)# Open the array for writing and write data to the arrayarr <-tiledb_array_open( arr,type ="WRITE")arr[d1_data, d2_data] <- a_data# Close the arrayarr <-tiledb_array_close(arr)
The array is a folder in the path specified in array_uri. The contents are explained in other sections of the Academy, but notice that there are two fragment directories in directory __fragments, and two commit files in the __commits directory. They are all prefixed by _t1_t1 and _t2_t2, where t1 and t2 are the timestamps at which those fragments were created by the two write operations above.
# Open the array in read modearr <-tiledb_array_open(arr, type ="READ")# Show the entire arraycat("Entire array:\n")print(arr[])arr <-tiledb_array_close(arr)
Inspecting the file hierarchy of the array again, observe that TileDB created a new file inside the __fragment_meta directory, prefixed by _t1_t2 and with suffix .meta. This contains information about the metadata of both fragments present in the array.
Read the array again and observe that nothing changed. The same results are present there. Consolidation (in combination with vacuuming), helps significantly boost performance in the presence of multiple write operations. For more details, visit the Performance: Tuning Writes section.
# Open the array in read modearr <-tiledb_array_open(arr, type ="READ")# Show the entire arraycat("Entire array:\n")print(arr[])arr <-tiledb_array_close(arr)