Learn how to vacuum fragment metadata after fragment metadata consolidation.
How to run this tutorial
You can run this tutorial in two ways:
Locally on your machine.
On TileDB Cloud.
However, since TileDB Cloud has a free tier, we strongly recommend that you sign up and run everything there, as that requires no installations or deployment.
This tutorial demonstrates how to vacuum fragment metadata of arrays, after consolidation takes place. Before running this tutorial, it is recommended that you read the following sections:
First, import the necessary libraries, set the array URI (i.e., its path, which in this tutorial will be on local storage), and delete any previously created arrays with the same name.
# Create the two dimensionsd1 = tiledb.Dim(name="d1", domain=(0, 3), tile=2, dtype=np.int32)d2 = tiledb.Dim(name="d2", domain=(0, 3), tile=2, dtype=np.int32)# Create a domain using the two dimensionsdom = tiledb.Domain(d1, d2)# Create an attributea = tiledb.Attr(name="a", dtype=np.int32)# Create the array schema with `sparse=True`.sch = tiledb.ArraySchema(domain=dom, sparse=True, attrs=[a])# Create the array on disk (it will initially be empty)tiledb.Array.create(array_uri, sch)
# Create the two dimensionsd1 <-tiledb_dim("d1", c(0L, 3L), 2L, "INT32")d2 <-tiledb_dim("d2", c(0L, 3L), 2L, "INT32")# Create a domain using the two dimensionsdom <-tiledb_domain(dims =c(d1, d2))# Create an attributea <-tiledb_attr("a", type ="INT32")# Create the array schema, setting `sparse = TRUE`sch <-tiledb_array_schema(dom, a, sparse =TRUE)# Create the array on disk (it will initially be empty)arr <-tiledb_array_create(array_uri, sch)
# Prepare some data in numpy arraysd1_data = np.array([2, 0, 3], dtype=np.int32)d2_data = np.array([0, 1, 1], dtype=np.int32)a_data = np.array([4, 1, 6], dtype=np.int32)# Open the array in write mode and write the data in COO formatwith tiledb.open(array_uri, "w") as A: A[d1_data, d2_data] = a_data
# Prepare some data in an arrayd1_data <-c(2L, 0L, 3L)d2_data <-c(0L, 1L, 1L)a_data <-c(4L, 1L, 6L)# Open the array for writing and write data to the arrayarr <-tiledb_array(uri = array_uri,query_type ="WRITE",return_as ="data.frame")arr[d1_data, d2_data] <- a_data# Close the arrayarr <-tiledb_array_close(arr)
Perform a second write, so that two fragments are generated.
# Prepare some data in numpy arraysd1_data = np.array([2, 0, 1], dtype=np.int32)d2_data = np.array([2, 3, 3], dtype=np.int32)a_data = np.array([5, 2, 3], dtype=np.int32)# Open the array in write mode and write the data in COO formatwith tiledb.open(array_uri, "w") as A: A[d1_data, d2_data] = a_data
# Prepare some data in an arrayd1_data <-c(2L, 0L, 3L)d2_data <-c(2L, 3L, 3L)a_data <-c(5L, 2L, 3L)# Open the array for writing and write data to the arrayarr <-tiledb_array_open( arr,type ="WRITE")arr[d1_data, d2_data] <- a_data# Close the arrayarr <-tiledb_array_close(arr)
Now consolidate the array, passing fragment_meta as the value of configuration parameter sm.consolidation.mode.
# Prepare some data in numpy arraysd1_data = np.array([0], dtype=np.int32)d2_data = np.array([0], dtype=np.int32)a_data = np.array([0], dtype=np.int32)# Open the array in write mode and write the data in COO formatwith tiledb.open(array_uri, "w") as A: A[d1_data, d2_data] = a_data
# Prepare some data in an arrayd1_data <-c(0L)d2_data <-c(0L)a_data <-c(0L)# Open the array for writing and write data to the arrayarr <-tiledb_array_open( arr,type ="WRITE")arr[d1_data, d2_data] <- a_data# Close the arrayarr <-tiledb_array_close(arr)
Inspecting the file hierarchy of the array, observe the two files inside the __fragment_meta directory: one from the first consolidation operation and one from the second.
# Open the array in read modearr <-tiledb_array_open(arr, type ="READ")# Show the entire arraycat("Entire array:\n")print(arr[])arr <-tiledb_array_close(arr)
Inspecting the file hierarchy of the array again, observe the single file in the __fragment_meta directory, which contains the fragment metadata from all the three writes performed above.
Reading the array again, all the data from the two writes is returned, again as expected. Vacuuming (in combination with consolidation), helps significantly boost performance in the presence of multiple write operations. For more details, visit the Performance: Tuning Writes section.
# Open the array in read modearr <-tiledb_array_open(arr, type ="READ")# Show the entire arraycat("Entire array:\n")print(arr[])arr <-tiledb_array_close(arr)