You can specify exactly which fragments will be consolidated during the consolidation process.
How to run this tutorial
You can run this tutorial in two ways:
Locally on your machine.
On TileDB Cloud.
However, since TileDB Cloud has a free tier, we strongly recommend that you sign up and run everything there, as that requires no installations or deployment.
This tutorial demonstrates how to consolidate fragments of arrays, by specifying explicitly which fragments will participate in consolidation. The example code below will create an array with three fragments, and will choose the last two for consolidation (leaving the first one intact).
Before running this tutorial, it is recommended that you read the following sections:
First, import the necessary libraries, set the array URI (i.e., its path, which in this tutorial will be on local storage), and delete any previously created arrays with the same name.
# Import necessary librariesimport tiledbimport numpy as npimport shutilimport os.path# Set array URIarray_uri = os.path.expanduser("~/consolidation_list")# Delete array if it already existsif os.path.exists(array_uri): shutil.rmtree(array_uri)
Next, create the array by specifying its schema. This example uses a sparse array, but the functionality is similar for dense arrays. Some differences in consolidation exist between sparse and dense arrays, and other sections of Academy cover those differences.
# Create the two dimensionsd1 = tiledb.Dim(name="d1", domain=(0, 3), tile=2, dtype=np.int32)d2 = tiledb.Dim(name="d2", domain=(0, 3), tile=2, dtype=np.int32)# Create a domain using the two dimensionsdom = tiledb.Domain(d1, d2)# Create an attributea = tiledb.Attr(name="a", dtype=np.int32)# Create the array schema with `sparse=True`.sch = tiledb.ArraySchema(domain=dom, sparse=True, attrs=[a])# Create the array on disk (it will initially be empty)tiledb.Array.create(array_uri, sch)
# Prepare some data in numpy arraysd1_data = np.array([2, 0], dtype=np.int32)d2_data = np.array([0, 1], dtype=np.int32)a_data = np.array([4, 1], dtype=np.int32)# Open the array in write mode and write the data in COO formatwith tiledb.open(array_uri, "w") as A: A[d1_data, d2_data] = a_data
Perform the second write, and save the name of the generated fragment.
# Prepare some data in numpy arraysd1_data = np.array([1], dtype=np.int32)d2_data = np.array([3], dtype=np.int32)a_data = np.array([3], dtype=np.int32)# This will hold the fragment URIs to consolidatefragment_uris = []# Open the array in write mode and write the data in COO formatwith tiledb.open(array_uri, "w") as A: A[d1_data, d2_data] = a_datafor key in A.last_write_info.keys(): fragment_uris.append(key.rsplit("/", 1)[-1])
Perform the third write, and save the name of the generated fragment as well.
# Prepare some data in numpy arraysd1_data = np.array([2], dtype=np.int32)d2_data = np.array([2], dtype=np.int32)a_data = np.array([5], dtype=np.int32)# Open the array in write mode and write the data in COO formatwith tiledb.open(array_uri, "w") as A: A[d1_data, d2_data] = a_datafor key in A.last_write_info.keys(): fragment_uris.append(key.rsplit("/", 1)[-1])
Inspect the array folder to confirm that three fragments have been created.
Now run consolidation, passing an explicit list of fragments to participate in consolidation (that will be the last two written fragments). Also perform vacuuming, so that the two original fragments that participated in consolidation get deleted.
# Print fragments to consolidate (the 2nd and 3rd fragment)print("Fragments to consolidate: ", fragment_uris)# Consolidate using the fragments from the consolidation plantiledb.consolidate( array_uri, config=tiledb.Config({"sm.consolidation.mode": "fragments"}), fragment_uris=fragment_uris,)# Vacuumtiledb.vacuum(array_uri, config=tiledb.Config({"sm.vacuum.mode": "fragments"}))
Fragments to consolidate: ['__1718448408109_1718448408109_6742b7a70711889a3dc9bac0b80297b8_21', '__1718448408116_1718448408116_32e385beba39ac403f4eb4abdf20b5b9_21']
Inspecting the file hierarchy of the array again, observe that only two fragments exist; the fragment from the first write (which is intact), and the fragment generated by consolidating the fragments from the second and third writes.