Serverless SQL
TileDB Cloud has built-in support for serverless SQL. Serverless in this context means that TileDB pushes the SQL query and all compute to the TileDB Cloud service without having to spin up any specific machine or cluster explicitly, and TileDB returns only the results to the client. This minimizes data movement, and ensures data stays in the assigned region for compliance, removing database deployment hassles. This can improve performance and reduce idle compute.
TileDB Cloud supports serverless SQL in Python, R, Java, and C#. You can use TileDB Cloud to run individual queries, or devise more sophisticated distributed computing algorithms via task graphs.
Serverless SQL works as follows. TileDB Cloud receives your SQL query and executes it on a stateless worker that runs a warm MariaDB instance using the MyTile storage engine. When using TileDB-Cloud-Py v0.4.0 or later, TileDB can either return results directly to you or write the results to a new or existing cloud storage array. Any array access happens on the same worker running the SQL query to optimize performance.
TileDB Cloud sends the results to the client in either JSON, or the Apache Arrow format when returned directly. In Python, TileDB converts the results into a pandas dataframe. This is most suitable for small results, such as aggregations or limit
queries.
Writing results to an array on cloud storage is necessary to allow processing of SQL queries with large results, without overwhelming the user (who may be on their laptop). Users can always open the created TileDB array and fetch the data they need afterwards.
Each TileDB Cloud worker running a SQL query uses 16 CPUs and has a limit of 2 GB of RAM. Thus, you must consider sharding a SQL query, so that each result fits in 2 GB of memory (refer to Task Graphs for more information).