What is TileDB?
TileDB is foundational software designed by scientists for scientific discovery. TileDB structures all data types, including data that does not easily fit into relational databases. Built on a powerful shape-shifting array database, TileDB handles the complexities of non-traditional “unstructured” multimodal data, such as genomic variants, bulk and single-cell transcriptomics, proteomics, biomedical imaging, as well as the frontier data of the future.
Used by big pharma and biotechs to power their multiomic FAIR data platforms, TileDB is the destination for scientific breakthroughs where frontier multimodal data is driving drug discovery.
Discovery in life sciences relies on frontier data
Drug discovery is very expensive and inefficient: The cost of bringing a new drug to market costs $2.9 B and takes 10-15 years, and 90% of drug candidates don’t get approved. The bar is getting higher and higher. New drugs are required to outperform existing ones, and the number of drugs brought to market for every billion dollars spent is going down by half every nine years. Lastly, we’re barely scratching the surface for discovering new drugs: FDA-approved drugs target only about 670 of the known 20,000-25,000 genes, leaving a huge opportunity for other druggable targets.
Valuable and expensive, emerging frontier data, like population genomics, bioimaging, proteomics, single-cell, and spatial transcriptomics, are powering modern research to improve discovery. By combining various omics data sets with existing literature and assays, scientists can uncover novel relationships, identify biomarkers, unravel complex disease mechanisms, and ultimately drive precision medicine approaches in drug discovery and in the clinic.
However, this multimodal, predominantly “unstructured” data (i.e., data that cannot be natively modeled as structured relational tables) is large scale, multi-dimensional, and notoriously difficult to use, and stresses the limits of general-purpose data and compute management. As a result, this data needs to be managed and computed in ways that abstract it away from scientists, who now need to lean on research informatics teams in order to glean biological insights. Collaboration between these teams often suffers and is backlogged due to data and workflow complexity, as well as due to the lack of robust tools that scale and perform for both those groups. Backlogged analyses can mean that life-changing drug and target discoveries are delayed or altogether missed. In addition, this valuable data is often siloed, and cataloged in ways that make it hard to reuse.
The status quo cannot handle frontier data
Current approaches that could potentially handle frontier data fall into the following categories:
- Databases: Examples include Snowflake, AWS Redshift, Databricks, and Google Big Query. They have the power to model and analyze data securely and extremely efficiently. However, the vast majority of databases can handle only structured (mainly tabular) data. Despite its importance, tabular data accounts for only a small fraction of the scientific data in an organization. Force-fitting non-tabular data into a tabular database results in poor performance experience and unnecessary usage complexity.
- File managers: Examples include Dropbox, Box, Google Drive, and all vanilla cloud object stores (e.g., Amazon S3). They have the ability to store any type of data in the form of binary files. However, they do not provide any context, semantics, or specialized metadata about the underlying data modalities, which makes it very difficult to search and locate data relevant to a specific scientific workflow. In addition, processing large flat binary files is usually extremely inefficient, especially as compared to rigorous database approaches.
- Data catalogs: Examples include Collibra, Alation, and Atlan. They add more meaningful information about the data they are cataloging, exposing important relationships across the different data types and making search much more effective. However, data catalogs do not have the computational power of databases. Only cataloging complex data without the ability to rapidly analyze it introduces a big architectural gap that impedes discovery.
- Scientific platforms: Examples include DNA Nexus, Velsera, Illumina Connected Analytics, and Lifebit. They are specialized for the scientific domain, offering deep understanding and functionality around the data modalities used in the Life Sciences industry (which may include some database, file management, and data catalog functionality). However, these solutions are not architected to be powerful database systems, often treating compute, performance, and scale as afterthoughts. In addition, they are typically tailored for a narrow set of data modalities, treating everything else as mere binary files. This forces organizations to adopt multiple solutions for a larger coverage of their modalities.
With an unlimited number of point solutions available, teams need to select the necessary infrastructure to build DIY support for bench scientists, data scientists, and data engineers who need to collaborate to gather, catalog, govern, and analyze all the complex data to reach the desired scientific breakthroughs. In addition to the extra licensing costs, such an approach mandates building and maintaining a very large team enforced with harmonizing the deployment and use of all these systems, and ensuring secure and compliant collaboration.
A new data platform designed for discovery
TileDB is a novel data solution that combines all the strong attributes of databases, scientific solutions, file managers, and data catalogs, overcoming their limitations. It is ideal for governing all data and code in an organization, offering a strong infrastructure for secure collaboration and analysis, while enabling users to interoperate with their favorite tools without compromising security and compliance.
Centralize and catalog: Store and search across all your data modalities from a single source of truth: from PDFs, XLS and CSV files to frontier -omics data like genomic variants, single-cell, vectors, and imaging and proteomic files, plus any modality that emerges in the future.
Collaborate securely: Eliminate silos without compromising data governance. Data sharing is easy and safe with logging and data sharing across all members in the organization, ensuring FAIR data practices and compliance with HIPAA and SOC 2 regulations.
- Connect scientists, data scientists, and data engineers in one place.
- Collaborate on federated data without ever moving it.
- Take advantage of advanced access control, logging, holistic catalogs, and governance.
- Build confidently on a platform that adheres to FAIR principles and is both SOC 2 Type 2- and HIPAA-compliant.
Analyze with ease: Interact with all your data with flexible ways to discover insights. Using TileDB gives you:
- Programmatic access for data wrangling.
- Notebooks for data science.
- Flexible dashboards for visualization.
- Interoperability with popular analysis tools.
- LLM integrations and copilots.
- Extensibility to add more tools.
Scale economically: Control costs through a highly performant, cloud-optimized array storage format and compute engine, which utilize efficient compression and indexing techniques.
- Super efficient array engine - TileDB achieves unprecedented performance for any modality via its cloud-optimized array storage format and compute engine, which utilize efficient compression and indexing techniques.
- Powerful distributed compute - TileDB harnesses the distributed power of the cloud to achieve biobank- and atlas-scale, using parallel algorithms and “task graphs”, implementing complex workflows in a serverless fashion at the lowest cost.
Structure the “unstructured”: Remove “unstructured” from your vocabulary. Underlying arrays shape-shift to structure all data, regardless of modality.
- TileDB is a single platform that can support all data and metadata through the entire data life cycle – one that stores raw data in a unified common data structure, as well as intermediate data products.
- TileDB uses multi-dimensional arrays, which is a fundamental data structure that shape-shifts to perfectly capture any diverse data type and workload and is optimized for cloud object stores.
With TileDB, teams comprising scientists, data scientists, data engineers, and ML engineers can:
- Discover insights faster via the TileDB catalog discoverability and collaboration across all data.
- Simplify the data infrastructure, consolidating functionalities under a secure data system.
- Analyze multimodal -omics data with unprecedented speed and scale.
- Deploy generative AI and ML tooling like vector search to accelerate insights.
- Collaborate securely without governance risks, or versioning and traceability challenges.
History, vision, and why “TileDB”
The original TileDB core engine was developed at Intel Labs and MIT between 2014 and 2017. TileDB, Inc. was formed in 2017 to further develop and commercialize the product. TileDB is backed by high-profile VC investors, and operated by a diverse and brilliant team (see our About page for more information). It is headquartered in Cambridge, MA, USA, maintains physical offices in Cambridge and NYC, has a subsidiary in Greece, and numerous people work remotely at TileDB from all over the world.
Our vision is to help organizations discover breakthroughs through the effective use of all their scientific data, no matter how diverse and complex. TileDB’s technology is built around a flexible, shape-shifting data model (multi-dimensional arrays), which allows it to efficiently capture all data modalities. As such, it is applicable to challenging data problems in a broad spectrum of application domains. However, currently, the company’s focus is in Life Sciences, where the data modalities are truly complex and the scale of the frontier data (such as multi-omics) massive. TileDB is an ideal solution to unlock all the discovery potential in this important for the world application domain!
The name “TileDB” consists of two parts:
- Tile: This is the atomic unit of storage for TileDB’s array format. It is a group of multi-dimensional values that can be perceived as a hyper-rectangle, looking like a tile in the 2D plane. Tiles and the internal index structures built around them are what give TileDB superpowers in complex data management and analysis.
- DB: This stands for “database”, which in our context is the system that stores, manages, and processes data. We chose to call TileDB a database because of the rigor and power that databases bring to the data world (e.g., as opposed to file managers or data catalogs). TileDB indeed combines the long-standing, battle-tested benefits of databases, but evolves them in that it targets at handling multiple (if not all) data and code modalities, way beyond the traditional tables.
What’s next?
If TileDB is a great product for you, we recommend continuing your journey as follows:
- The Get Started section will bring you up to speed with the TileDB product and prepare you to become a power user.
- The Array Model will provide all the theoretical background behind multi-dimensional arrays and why they are first-class citizens in TileDB.
- Contact us in case you have further questions.
Welcome to the TileDB family, enjoy!