Learn how to generate a consolidation plan and use it for consolidation.
How to run this tutorial
You can run this tutorial in two ways:
Locally on your machine.
On TileDB Cloud.
However, since TileDB Cloud has a free tier, we strongly recommend that you sign up and run everything there, as that requires no installations or deployment.
This tutorial demonstrates how to generate a consolidation plan and use it for consolidation. This tutorial uses a sparse array. Note that the functionality is not applicable to dense arrays. Before running this tutorial, it is recommended that you read the following sections:
First, import the necessary libraries, set the array URI (i.e., its path, which in this tutorial will be on local storage), and delete any previously created arrays with the same name.
# Create the two dimensionsd1 = tiledb.Dim(name="d1", domain=(0, 3), tile=2, dtype=np.int32)d2 = tiledb.Dim(name="d2", domain=(0, 3), tile=2, dtype=np.int32)# Create a domain using the two dimensionsdom = tiledb.Domain(d1, d2)# Create an attributea = tiledb.Attr(name="a", dtype=np.int32)# Create the array schema with `sparse=True`.sch = tiledb.ArraySchema(domain=dom, sparse=True, attrs=[a])# Create the array on disk (it will initially be empty)tiledb.Array.create(array_uri, sch)
# Prepare some data in numpy arraysd1_data = np.array([2, 0], dtype=np.int32)d2_data = np.array([0, 1], dtype=np.int32)a_data = np.array([4, 1], dtype=np.int32)# Open the array in write mode and write the data in COO formatwith tiledb.open(array_uri, "w") as A: A[d1_data, d2_data] = a_data
# Prepare some data in numpy arraysd1_data = np.array([1], dtype=np.int32)d2_data = np.array([3], dtype=np.int32)a_data = np.array([3], dtype=np.int32)# Open the array in write mode and write the data in COO formatwith tiledb.open(array_uri, "w") as A: A[d1_data, d2_data] = a_data
# Prepare some data in numpy arraysd1_data = np.array([2], dtype=np.int32)d2_data = np.array([2], dtype=np.int32)a_data = np.array([5], dtype=np.int32)# Open the array in write mode and write the data in COO formatwith tiledb.open(array_uri, "w") as A: A[d1_data, d2_data] = a_data
Inspect the array folder to confirm that three fragments have been created.
Generate the consolidation plan. That creates a single set of fragments (i.e, a single “node”, representing the set of fragments that should be submitted to a single worker for consolidation).
# Get consolidation planA = tiledb.open(array_uri, "r")fragment_size =10000# desirable fragment size in bytescons_plan = tiledb.ConsolidationPlan(tiledb.default_ctx(), A, fragment_size)# Output information about the consolidation planprint(cons_plan.dump())print(cons_plan)print("\n")print("Number of nodes: ", cons_plan.num_nodes)print("Number of fragments in node #1: ", cons_plan[0]["num_fragments"])print("Fragments URIs in node #1", cons_plan[0]["fragment_uris"])# Close the arrayA.close()
{
"nodes": [
{
"uris" : [
{
"uri" : "__1718414474463_1718414474463_0c1f5a89c19deaa17071686bcc64d4a8_21"
},
{
"uri" : "__1718414474478_1718414474478_236806f61ff15d9ffecca1b38d920513_21"
}
]
}
]
}
{'fragments': {'node_0': {'fragment_uris': ['__1718414474463_1718414474463_0c1f5a89c19deaa17071686bcc64d4a8_21',
'__1718414474478_1718414474478_236806f61ff15d9ffecca1b38d920513_21'],
'num_fragments': 2}},
'num_nodes': 1}
Number of nodes: 1
Number of fragments in node #1: 2
Fragments URIs in node #1 ['__1718414474463_1718414474463_0c1f5a89c19deaa17071686bcc64d4a8_21', '__1718414474478_1718414474478_236806f61ff15d9ffecca1b38d920513_21']
Now run consolidation, passing the fragment URIs from the consolidation plan. Also perform vacuuming, so that the two original fragments that participated in consolidation get deleted.
# Consolidate using the fragments from the consolidation plantiledb.consolidate( array_uri, config=tiledb.Config({"sm.consolidation.mode": "fragments"}), fragment_uris=cons_plan[0]["fragment_uris"],)# Vacuumtiledb.vacuum(array_uri, config=tiledb.Config({"sm.vacuum.mode": "fragments"}))
Inspecting the file hierarchy of the array again, observe that only two fragments exist: the fragment from the first write (which is intact), and the fragment generated by consolidating the fragments determined by the consolidation plan.