1. Structure
  2. Arrays
  3. Tutorials
  4. Advanced
  5. Backends
  6. Azure Blob Storage
  • Home
  • What is TileDB?
  • Get Started
  • Explore Content
  • Accounts
    • Individual Accounts
      • Apply for the Free Tier
      • Profile
        • Overview
        • Cloud Credentials
        • Storage Paths
        • REST API Tokens
        • Credits
    • Organization Admins
      • Create an Organization
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
    • Organization Members
      • Organization Invitations
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
  • Catalog
    • Introduction
    • Data
      • Arrays
      • Tables
      • Single-Cell (SOMA)
      • Genomics (VCF)
      • Biomedical Imaging
      • Vector Search
      • Files
    • Code
      • Notebooks
      • Dashboards
      • User-Defined Functions
      • Task Graphs
      • ML Models
    • Groups
    • Marketplace
    • Search
  • Collaborate
    • Introduction
    • Organizations
    • Access Control
      • Introduction
      • Share Assets
      • Asset Permissions
      • Public Assets
    • Logging
    • Marketplace
  • Analyze
    • Introduction
    • Slice Data
    • Multi-Region Redirection
    • Notebooks
      • Launch a Notebook
      • Usage
      • Widgets
      • Notebook Image Dependencies
    • Dashboards
      • Dashboards
      • Streamlit
    • Preview
    • User-Defined Functions
    • Task Graphs
    • Serverless SQL
    • Monitor
      • Task Log
      • Task Graph Log
  • Scale
    • Introduction
    • Task Graphs
    • API Usage
  • Structure
    • Why Structure Is Important
    • Arrays
      • Introduction
      • Quickstart
      • Foundation
        • Array Data Model
        • Key Concepts
          • Storage
            • Arrays
            • Dimensions
            • Attributes
            • Cells
            • Domain
            • Tiles
            • Data Layout
            • Compression
            • Encryption
            • Tile Filters
            • Array Schema
            • Schema Evolution
            • Fragments
            • Fragment Metadata
            • Commits
            • Indexing
            • Array Metadata
            • Datetimes
            • Groups
            • Object Stores
          • Compute
            • Writes
            • Deletions
            • Consolidation
            • Vacuuming
            • Time Traveling
            • Reads
            • Query Conditions
            • Aggregates
            • User-Defined Functions
            • Distributed Compute
            • Concurrency
            • Parallelism
        • Storage Format Spec
      • Tutorials
        • Basics
          • Basic Dense Array
          • Basic Sparse Array
          • Array Metadata
          • Compression
          • Encryption
          • Data Layout
          • Tile Filters
          • Datetimes
          • Multiple Attributes
          • Variable-Length Attributes
          • String Dimensions
          • Nullable Attributes
          • Multi-Range Reads
          • Query Conditions
          • Aggregates
          • Deletions
          • Catching Errors
          • Configuration
          • Basic S3 Example
          • Basic TileDB Cloud
          • fromDataFrame
          • Palmer Penguins
        • Advanced
          • Schema Evolution
          • Advanced Writes
            • Write at a Timestamp
            • Get Fragment Info
            • Consolidation
              • Fragments
              • Fragment List
              • Consolidation Plan
              • Commits
              • Fragment Metadata
              • Array Metadata
            • Vacuuming
              • Fragments
              • Commits
              • Fragment Metadata
              • Array Metadata
          • Advanced Reads
            • Get Fragment Info
            • Time Traveling
              • Introduction
              • Fragments
              • Array Metadata
              • Schema Evolution
          • Array Upgrade
          • Backends
            • Amazon S3
            • Azure Blob Storage
            • Google Cloud Storage
            • MinIO
            • Lustre
          • Virtual Filesystem
          • User-Defined Functions
          • Distributed Compute
          • Result Estimation
          • Incomplete Queries
        • Management
          • Array Schema
          • Groups
          • Object Management
        • Performance
          • Summary of Factors
          • Dense vs. Sparse
          • Dimensions vs. Attributes
          • Compression
          • Tiling and Data Layout
          • Tuning Writes
          • Tuning Reads
      • API Reference
    • Tables
      • Introduction
      • Quickstart
      • Foundation
        • Data Model
        • Key Concepts
          • Indexes
          • Columnar Storage
          • Compression
          • Data Manipulation
          • Optimize Tables
          • ACID
          • Serverless SQL
          • SQL Connectors
          • Dataframes
          • CSV Ingestion
      • Tutorials
        • Basics
          • Ingestion with SQL
          • CSV Ingestion
          • Basic S3 Example
          • Running Locally
        • Advanced
          • Scalable Ingestion
          • Scalable Queries
      • API Reference
    • AI & ML
      • Vector Search
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Vector Search
            • Vector Databases
            • Algorithms
            • Distance Metrics
            • Updates
            • Deployment Methods
            • Architecture
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Ingestion & Querying
            • Updates
            • Deletions
            • Basic S3 Example
            • Running Locally
          • Advanced
            • Versioning
            • Time Traveling
            • Consolidation
            • Distributed Compute
            • RAG LLM
            • LLM Memory
            • File Search
            • Image Search
            • Protein Search
          • Performance
        • API Reference
      • ML Models
        • Introduction
        • Quickstart
        • Foundation
          • Basics
          • Storage
          • Cloud Execution
          • Why TileDB for Machine Learning
        • Tutorials
          • Ingestion
            • Data Ingestion
              • Dense Datasets
              • Sparse Datasets
            • ML Model Ingestion
          • Management
            • Array Schema
            • Machine Learning: Groups
            • Time Traveling
    • Life Sciences
      • Single-cell
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Data Structures
            • Use of Apache Arrow
            • Join IDs
            • State Management
            • TileDB Cloud URIs
          • SOMA API Specification
        • Tutorials
          • Data Ingestion
          • Bulk Ingestion Tutorial
          • Data Access
          • Distributed Compute
          • Basic S3 Example
          • Multi-Experiment Queries
          • Appending Data to a SOMA Experiment
          • Add New Measurements
          • SQL Queries
          • Running Locally
          • Shapes in TileDB-SOMA
          • Drug Discovery App
        • Spatial
          • Introduction
          • Foundation
            • Spatial Data Model
            • Data Structures
          • Tutorials
            • Spatial Data Ingestion
            • Access Spatial Data
            • Manage Coordinate Spaces
        • API Reference
      • Population Genomics
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • The N+1 Problem
            • Architecture
            • Arrays
            • Ingestion
            • Reads
            • Variant Statistics
            • Annotations
            • User-Defined Functions
            • Tables and SQL
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Basic Ingestion
            • Basic Queries
            • Export to VCF
            • Add New Samples
            • Deleting Samples
            • Basic S3 Example
            • Basic TileDB Cloud
          • Advanced
            • Scalable Ingestion
            • Scalable Queries
            • Query Transforms
            • Handling Large Queries
            • Annotations
              • Finding Annotations
              • Embedded Annotations
              • External Annotations
              • Annotation VCFs
              • Ingesting Annotations
            • Variant Statistics
            • Tables and SQL
            • User-Defined Functions
            • Sample Metadata
            • Split VCF
          • Performance
        • API Reference
          • Command Line Interface
          • Python API
          • Cloud API
      • Biomedical Imaging
        • Introduction
        • Foundation
          • Data Model
          • Key Concepts
            • Arrays
            • Ingestion
            • Reads
            • User Defined Functions
          • Storage Format Spec
        • Quickstart
        • Tutorials
          • Basics
            • Ingestion
            • Read
              • OpenSlide
              • TileDB-Py
          • Advanced
            • Batched Ingestion
            • Chunked Ingestion
            • Machine Learning
              • PyTorch
            • Napari
    • Files
  • API Reference
  • Self-Hosting
    • Installation
    • Upgrades
    • Administrative Tasks
    • Image Customization
      • Customize User-Defined Function Images
      • AWS ECR Container Registry
      • Customize Jupyter Notebook Images
    • Single Sign-On
      • Configure Single Sign-On
      • OpenID Connect
      • Okta SCIM
      • Microsoft Entra
  • Glossary

On this page

  • Setup
  • Authenticating to Azure
    • Microsoft Entra ID
    • Shared key
    • Shared access signature
  • Physical organization
  • Performance
  • Advanced
  1. Structure
  2. Arrays
  3. Tutorials
  4. Advanced
  5. Backends
  6. Azure Blob Storage

Azure Blob Storage

arrays
tutorials
storage backends
object stores
azure blob storage
Learn how to integrate Azure Blob Storage with TileDB.

After configuring TileDB to work with Azure, your TileDB programs will function properly without any API change! Instead of using local file system paths for referencing files (e.g. arrays, groups, VFS files), you must format your URIs to start with azure://. For instance, if you wish to create (and subsequently write and read) an array on Azure, you use a URI of the format azure://<storage-container>/<your-array-name> for the array name.

Warning

TileDB does not support storage accounts with hierarchical namespaces.

Setup

  1. Sign into the Azure portal, creating a new account if necessary.

  2. On the Azure portal, select the Storage accounts service.

  3. Select the +Add button to navigate to the Create a storage account form.

  4. Complete the form and create the storage account. You may use a Standard or Premium Block Blob account type.

    The Create a storage account form The Create a storage account form

  5. In your application, set the vfs.azure.storage_account_name config option or the AZURE_STORAGE_ACCOUNT environment variable to the name of your storage account name.

Alternatively, you can directly set the endpoint you use to connect to Azure.

Authenticating to Azure

TileDB supports authenticating to Azure through Microsoft Entra ID, access keys, and shared access signature tokens.

Microsoft Entra ID

Microsoft Entra ID is the recommended way to authenticate to Azure and provides superior security and fine-grained access compared to shared keys. It is enabled by default, and you do not need to specifically configure TileDB to use it. Credentials are obtained automatically from the following sources in order:

  • Environment variables.
  • The Azure CLI.
  • Managed identities for Azure compute resources. Only system-assigned managed identities are currently supported.
  • Workload identities for Kubernetes.

When the Azure backend gets initialized, it attempts to obtain credentials by the above sources. If no credentials can be obtained, TileDB will fall back to anonymous authentication.

Manually selecting which authentication method to use is not currently supported.

Microsoft Entra ID will not be used if any of the following conditions apply:

  • The vfs.azure.storage_account_key or vfs.azure.storage_sas_token configuration options are specified.
  • The AZURE_STORAGE_KEY or AZURE_STORAGE_SAS_TOKEN environment variables are specified.
  • A custom endpoint is specified that is not using HTTPS.
Warning

TileDB does not currently support the following features when connecting to Azure with Microsoft Entra ID:

  • Selecting a specific credentials source without trying to authenticate with the others.
  • Authenticating with a service principal specified in config options instead of environment variables.
  • Authenticating with a user-assigned managed identity.
Note

Make sure to assign the right roles to the identity to use with TileDB. The general Reader and Contributor roles do not provide access to data inside the storage accounts. You need to assign the Storage Blob Data Reader or the Storage Blob Data Contributor roles in order to read or write data, respectively.

Shared key

Warning

Authentication with shared keys is considered insecure. You are recommended to use Microsoft Entra ID.

  1. Once your storage account has been created, navigate to its landing page. From the left menu, select the Access keys option. Copy the Storage account name and one of the auto-generated Keys.

    The Access keys page The Access keys page

  2. Set the following keys in a configuration object (visit the Configuration section) or environment variable. Use the storage account name and key from the last step.

Parameter Environment variable Default value
"vfs.azure.storage_account_name" AZURE_STORAGE_ACCOUNT ""
"vfs.azure.storage_account_key" AZURE_STORAGE_KEY ""

Shared access signature

  1. Navigate to the new storage account landing page. From the left menu, select the Shared Access Signature option.

  2. Use all checked defaults, and select Allowed resource types → Container

  3. Set an appropriate expiration date (note: SAS tokens cannot be revoked)

  4. Select Generate SAS and connection string

  5. Copy the SAS Token (second entry) and use in the TileDB config or environment variable:

    The Shared access signature page The Shared access signature page

You can configure the following parameters.

Parameter Environment variable Default value
"vfs.azure.storage_sas_token" AZURE_STORAGE_SAS_TOKEN ""

Physical organization

So far, you learned that TileDB stores arrays and groups as directories. Azure Blob Storage has no concept of a directory, similar to other object stores. However, Azure uses the / character in the object URIs which allows the same conceptual organization as a directory hierarchy in local storage. At a physical level, TileDB stores all files on Azure that it would create locally as objects. For instance, for array azure://container/path/to/array, TileDB creates array schema object azure://container/path/to/array/schema/__<timestamp>_<timestamp>_<uuid> and other files and objects. Since Blob Storage has no concept of a directory, nothing distinctive persist on Azure for directories (for example, azure://container/path/to/array/meta/ doesn’t exist as an object).

Performance

TileDB writes the various fragment files as append-only objects using the block-list upload API of the Azure SDK for C++. In addition to enabling appends, this API renders the TileDB writes to Azure particularly amenable to optimizations via parallelization. Since TileDB updates arrays only by writing (appending to) new files (i.e., it never updates a file in-place), TileDB does not need to download entire objects, update them, and re-upload them to Azure. This leads to excellent write performance.

TileDB reads utilize the range GET blob request API of the Azure SDK, which retrieves only the requested (contiguous) bytes from a file/object, rather than downloading the entire file from the cloud. This results in extremely fast subarray reads, especially because of the array tiling. Recall that a tile (which groups cell values that are stored contiguously in the file) is the atomic unit of I/O. The range GET API enables reading each tile from Azure in a single request. Finally, TileDB performs all reads in parallel using multiple threads, which is a tunable configuration parameter.

Advanced

By default, the blob endpoint will be set to https://foo.blob.core.windows.net, where foo is the storage account name, as set by the vfs.azure.storage_account_name config option, or the AZURE_STORAGE_ACCOUNT environment variable. You can use the vfs.azure.blob_endpoint config parameter to override the default blob endpoint.

Parameter Default value
"vfs.azure.blob_endpoint" ""
Note

If the custom endpoint contains a SAS token, the vfs.azure.storage_sas_token option must not be specified.

Amazon S3
Google Cloud Storage