Data Manipulation
Inserts, updates, upserts, and deletes are the four most common data manipulation methods. TileDB supports inserts, upserts, and deletes.
Inserts
Inserts are the most basic form of data manipulation. This typically involves a SQL statement, such as an INSERT INTO
clause or a CSV load operation. TileDB has support for inserts in many APIs, depending on which API you use, such as the SQL
or dataframe
APIs. In most tabular databases, insert operations have four distinct behaviors, depending on the presence of a primary key and auto increment settings:
- When a primary key exists, throw an error if a duplicate is detected.
- When a primary key exists, don’t throw an error, and ignore the new record if a duplicate exists.
- If no primary key exists, insert records without checking for duplicates.
- If an auto increment key exists, insert the new record with incremental value for primary key.
TileDB supports cases 2, 3, and 4. TileDB currently doesn’t support erroring when a duplicate record exists. We designed TileDB for fast, efficient, and concurrent write operations, to minimize the checks required during a write, and to avoid locking. This results in fast support for inserting new records and then either returning all records or deduplicating upon the read operations.
Upserts
An upsert, also know as an on duplicate insert, is a technique to insert data into a table if it doesn’t exist and update it if it does. TileDB uses the primary key to identify existing records. As described in Indexes, TileDB supports uniqueness for both sparse arrays when duplicates are disabled and dense arrays. In these arrays, any insert will automatically behave as an upsert. TileDB avoids costly write time checks for duplicates, and instead, upon write, all new values are inserted into new fragments. Upon read queries, TileDB will automatically return the latest records for any cell for a query. This behavior allows for a high-performance, lock-free multi-reader and multi-writer, which also support time traveling and versioning. When you open and query a dataset, you can specify a timestamp range, allowing you to retrieve past values.
Sparse arrays with duplicates enabled don’t support upserts. When you enable duplicates, any new values inserted don’t logically override older values, and instead, TileDB returns all matching records to a query.
Deletes
Deletes allow for the logical and physical purging of records for the dataset. TileDB deletes are applied logically until materialized with consolidation.