1. Structure
  2. Arrays
  3. Tutorials
  4. Advanced
  5. Backends
  6. Amazon S3
  • Home
  • What is TileDB?
  • Get Started
  • Explore Content
  • Accounts
    • Individual Accounts
      • Apply for the Free Tier
      • Profile
        • Overview
        • Cloud Credentials
        • Storage Paths
        • REST API Tokens
        • Credits
    • Organization Admins
      • Create an Organization
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
    • Organization Members
      • Organization Invitations
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
  • Catalog
    • Introduction
    • Data
      • Arrays
      • Tables
      • Single-Cell (SOMA)
      • Genomics (VCF)
      • Biomedical Imaging
      • Vector Search
      • Files
    • Code
      • Notebooks
      • Dashboards
      • User-Defined Functions
      • Task Graphs
      • ML Models
    • Groups
    • Marketplace
    • Search
  • Collaborate
    • Introduction
    • Organizations
    • Access Control
      • Introduction
      • Share Assets
      • Asset Permissions
      • Public Assets
    • Logging
    • Marketplace
  • Analyze
    • Introduction
    • Slice Data
    • Multi-Region Redirection
    • Notebooks
      • Launch a Notebook
      • Usage
      • Widgets
      • Notebook Image Dependencies
    • Dashboards
      • Dashboards
      • Streamlit
    • Preview
    • User-Defined Functions
    • Task Graphs
    • Serverless SQL
    • Monitor
      • Task Log
      • Task Graph Log
  • Scale
    • Introduction
    • Task Graphs
    • API Usage
  • Structure
    • Why Structure Is Important
    • Arrays
      • Introduction
      • Quickstart
      • Foundation
        • Array Data Model
        • Key Concepts
          • Storage
            • Arrays
            • Dimensions
            • Attributes
            • Cells
            • Domain
            • Tiles
            • Data Layout
            • Compression
            • Encryption
            • Tile Filters
            • Array Schema
            • Schema Evolution
            • Fragments
            • Fragment Metadata
            • Commits
            • Indexing
            • Array Metadata
            • Datetimes
            • Groups
            • Object Stores
          • Compute
            • Writes
            • Deletions
            • Consolidation
            • Vacuuming
            • Time Traveling
            • Reads
            • Query Conditions
            • Aggregates
            • User-Defined Functions
            • Distributed Compute
            • Concurrency
            • Parallelism
        • Storage Format Spec
      • Tutorials
        • Basics
          • Basic Dense Array
          • Basic Sparse Array
          • Array Metadata
          • Compression
          • Encryption
          • Data Layout
          • Tile Filters
          • Datetimes
          • Multiple Attributes
          • Variable-Length Attributes
          • String Dimensions
          • Nullable Attributes
          • Multi-Range Reads
          • Query Conditions
          • Aggregates
          • Deletions
          • Catching Errors
          • Configuration
          • Basic S3 Example
          • Basic TileDB Cloud
          • fromDataFrame
          • Palmer Penguins
        • Advanced
          • Schema Evolution
          • Advanced Writes
            • Write at a Timestamp
            • Get Fragment Info
            • Consolidation
              • Fragments
              • Fragment List
              • Consolidation Plan
              • Commits
              • Fragment Metadata
              • Array Metadata
            • Vacuuming
              • Fragments
              • Commits
              • Fragment Metadata
              • Array Metadata
          • Advanced Reads
            • Get Fragment Info
            • Time Traveling
              • Introduction
              • Fragments
              • Array Metadata
              • Schema Evolution
          • Array Upgrade
          • Backends
            • Amazon S3
            • Azure Blob Storage
            • Google Cloud Storage
            • MinIO
            • Lustre
          • Virtual Filesystem
          • User-Defined Functions
          • Distributed Compute
          • Result Estimation
          • Incomplete Queries
        • Management
          • Array Schema
          • Groups
          • Object Management
        • Performance
          • Summary of Factors
          • Dense vs. Sparse
          • Dimensions vs. Attributes
          • Compression
          • Tiling and Data Layout
          • Tuning Writes
          • Tuning Reads
      • API Reference
    • Tables
      • Introduction
      • Quickstart
      • Foundation
        • Data Model
        • Key Concepts
          • Indexes
          • Columnar Storage
          • Compression
          • Data Manipulation
          • Optimize Tables
          • ACID
          • Serverless SQL
          • SQL Connectors
          • Dataframes
          • CSV Ingestion
      • Tutorials
        • Basics
          • Ingestion with SQL
          • CSV Ingestion
          • Basic S3 Example
          • Running Locally
        • Advanced
          • Scalable Ingestion
          • Scalable Queries
      • API Reference
    • AI & ML
      • Vector Search
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Vector Search
            • Vector Databases
            • Algorithms
            • Distance Metrics
            • Updates
            • Deployment Methods
            • Architecture
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Ingestion & Querying
            • Updates
            • Deletions
            • Basic S3 Example
            • Running Locally
          • Advanced
            • Versioning
            • Time Traveling
            • Consolidation
            • Distributed Compute
            • RAG LLM
            • LLM Memory
            • File Search
            • Image Search
            • Protein Search
          • Performance
        • API Reference
      • ML Models
        • Introduction
        • Quickstart
        • Foundation
          • Basics
          • Storage
          • Cloud Execution
          • Why TileDB for Machine Learning
        • Tutorials
          • Ingestion
            • Data Ingestion
              • Dense Datasets
              • Sparse Datasets
            • ML Model Ingestion
          • Management
            • Array Schema
            • Machine Learning: Groups
            • Time Traveling
    • Life Sciences
      • Single-cell
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Data Structures
            • Use of Apache Arrow
            • Join IDs
            • State Management
            • TileDB Cloud URIs
          • SOMA API Specification
        • Tutorials
          • Data Ingestion
          • Bulk Ingestion Tutorial
          • Data Access
          • Distributed Compute
          • Basic S3 Example
          • Multi-Experiment Queries
          • Appending Data to a SOMA Experiment
          • Add New Measurements
          • SQL Queries
          • Running Locally
          • Shapes in TileDB-SOMA
          • Drug Discovery App
        • Spatial
          • Introduction
          • Foundation
            • Spatial Data Model
            • Data Structures
          • Tutorials
            • Spatial Data Ingestion
            • Access Spatial Data
            • Manage Coordinate Spaces
        • API Reference
      • Population Genomics
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • The N+1 Problem
            • Architecture
            • Arrays
            • Ingestion
            • Reads
            • Variant Statistics
            • Annotations
            • User-Defined Functions
            • Tables and SQL
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Basic Ingestion
            • Basic Queries
            • Export to VCF
            • Add New Samples
            • Deleting Samples
            • Basic S3 Example
            • Basic TileDB Cloud
          • Advanced
            • Scalable Ingestion
            • Scalable Queries
            • Query Transforms
            • Handling Large Queries
            • Annotations
              • Finding Annotations
              • Embedded Annotations
              • External Annotations
              • Annotation VCFs
              • Ingesting Annotations
            • Variant Statistics
            • Tables and SQL
            • User-Defined Functions
            • Sample Metadata
            • Split VCF
          • Performance
        • API Reference
          • Command Line Interface
          • Python API
          • Cloud API
      • Biomedical Imaging
        • Introduction
        • Foundation
          • Data Model
          • Key Concepts
            • Arrays
            • Ingestion
            • Reads
            • User Defined Functions
          • Storage Format Spec
        • Quickstart
        • Tutorials
          • Basics
            • Ingestion
            • Read
              • OpenSlide
              • TileDB-Py
          • Advanced
            • Batched Ingestion
            • Chunked Ingestion
            • Machine Learning
              • PyTorch
            • Napari
    • Files
  • API Reference
  • Self-Hosting
    • Installation
    • Upgrades
    • Administrative Tasks
    • Image Customization
      • Customize User-Defined Function Images
      • AWS ECR Container Registry
      • Customize Jupyter Notebook Images
    • Single Sign-On
      • Configure Single Sign-On
      • OpenID Connect
      • Okta SCIM
      • Microsoft Entra
  • Glossary

On this page

  • Quickstart
  • AWS setup
  • AWS security credentials
    • Access keys
    • Session tokens
    • Assume role
  • Physical organization
  • Performance
  • Advanced
    • Proxy server settings
    • Logging
    • Certificate paths
  1. Structure
  2. Arrays
  3. Tutorials
  4. Advanced
  5. Backends
  6. Amazon S3

Amazon S3

arrays
tutorials
amazon s3
configuration
backends
object stores
Learn how to integrate Amazon S3 with TileDB.

This section covers all you need to know about how to configure and use TileDB on Amazon S3.

Quickstart

Assuming you have installed TileDB and you are already set up on AWS, the quickest way to get started with TileDB on Amazon S3 is by going through the Tutorials: Basic S3 Example with Arrays section.

After configuring TileDB to work with Amazon S3 (more details on that below), your TileDB programs will function properly without any API change! Instead of using local file system paths for referencing files (e.g. arrays, groups, VFS files) use must format your URIs to start with s3://. For instance, if you wish to create (and subsequently write/read) an array on Amazon S3, you use URI s3://<your-bucket>/<your-array-name> for the array name.

AWS setup

First, you need to set up an AWS account and generate access keys.

  1. Create a new AWS account.

  2. Visit the AWS console and sign in.

  3. On the AWS console, select the Services drop-down menu and select Storage -> S3. You can create S3 buckets there.

  4. On the AWS console, select the Services drop-down menu and select Security, Identity & Compliance -> IAM.

  5. Select Users from the left-hand side menu, and then select the Add User button. Provide the email or username of the user you wish to add, select the Programmatic Access checkbox and select Next: Permissions.

  6. Select the Attach existing policies directly button, search for the S3-related policies and add the policy of your choice (e.g., full-access, read-only, etc.). Select Next and then Create User. Using TileDB with an existing bucket requires at least the following S3 permissions:

    Copy
    s3:ListBucket
    s3:GetObject
    s3:PutObject
    s3:ListBucketMultipartUploads
    s3:AbortMultipartUpload
    s3:ListMultipartUploadParts
    s3:DeleteObject
  7. Upon successful creation, the next page will show the user along with two keys: Access key ID and Secret access key. Write down both these keys.

AWS security credentials

AWS supports many ways to access its resources. You can either request access with long-term access credentials (for example, Access Keys) or temporary ones by using the AWS Security Token Service.

Access keys

Access keys are long-term credentials for an IAM user or the AWS account root user. To be able to access AWS resource this way, you need to follow the next steps.

Export these keys to your environment from a console:

  • Linux/macOS
  • Windows (PowerShell)
  • Windows (cmd.exe)
export AWS_ACCESS_KEY_ID=<your-access-key-id>
export AWS_SECRET_ACCESS_KEY=<your-secret-access-key>
$env:AWS_ACCESS_KEY_ID = "<your-access-key-id>"
$env:AWS_SECRET_ACCESS_KEY = "<your-secret-access-key>"
set AWS_ACCESS_KEY_ID=<your-access-key-id>
set AWS_SECRET_ACCESS_KEY=<your-secret-access-key>

Or, set the following keys in a configuration object (see Configuration):

Parameter Default value
"vfs.s3.aws_access_key_id" ""
"vfs.s3.aws_secret_access_key" ""

Session tokens

TileDB (version 1.8+) supports authentication with temporary credentials from the AWS Session Token. This method of acquiring temporary credentials is preferred in case you want to maintain permissions solely within your organization.

Parameter Values
"vfs.s3.aws_session_token" session token corresponding to the configured key/secret pair

Assume role

TileDB (version 2.1+) supports authentication with temporary credentials from the AWS AssumeRole API. If you prefer to maintain permissions within AWS, this method’s base permissions for the temporary credentials will be derived from the policy on a role. You can use them to access AWS resources to which you might not normally have access. In this case, you will need to configure the following parameters:

Parameter Values
"vfs.s3.aws_role_arn" Required - The Amazon Resource Name (ARN) of the role to assume
"vfs.s3.aws_session_name" Optional - An identifier for the assumed role session
"vfs.s3.aws_external_id" Optional - A unique identifier that might be required when you assume a role in another account
"vfs.s3.aws_load_freq" Optional - The duration, in minutes, of the role session
Note

Using this method, the IAM user credentials used by your proxy server only requires the ability to call sts:AssumeRole. You must also create a new role and attach a trust policy to it, which isn’t the case with the Session Tokens approach.

Now you are ready to start writing TileDB programs! When creating a TileDB context or a VFS object, you need to set up a configuration object with the following parameters for Amazon S3.

Parameter Default value
"vfs.s3.scheme" "https"
"vfs.s3.region" "us-east-1"
"vfs.s3.endpoint_override" ""
"vfs.s3.use_virtual_addressing" "true"
Note

The above configuration parameters are currently set as shown in TileDB by default. However, you should always check whether the default values are the desired ones for your application.

Physical organization

So far, you learned that TileDB stores arrays and groups as directories. S3 has no concept of a directory, similar to other object stores. However, S3 uses character / in the object URIs which allows the same conceptual organization as a directory hierarchy in local storage. At a physical level, TileDB stores on S3 all the files it would create locally as objects. For instance, for array s3://bucket/path/to/array, TileDB creates array schema object s3://bucket/path/to/array/schema/__<timestamp>_<timestamp>_<uuid>, along with other files and objects. Since S3 has no concept of a directory, nothing distinctive persists on S3 for directories (for example, s3://bucket/path/to/array/meta/ doesn’t exist as an object).

With the AWS CLI, you can sync (that is, download) the S3 objects having a common URI prefix to local storage, organizing them into a directory hierarchy based on the use of / in the object URIs. You can clone TileDB arrays or entire groups locally from S3 by using the aws s3 sync command. For instance, given an array my_array you created and wrote on an S3 bucket my_bucket, you can clone it locally to an array my_local_array with the following command from your console:

aws s3 sync s3://my_bucket/my_array my_local_array

After downloading an array locally, your TileDB program will function properly by changing the array name from s3://my_bucket/my_array to my_local_array, without any other modification.

Performance

TileDB writes the various fragment files as append-only objects using the multi-part upload API of the AWS C++ SDK. In addition to enabling appends, this API renders the TileDB writes to S3 particularly amenable to optimizations via parallelization. Since TileDB updates arrays only by writing (appending to) new files (i.e., it never updates a file in-place), TileDB does not need to download entire objects, update them, and re-upload them to S3. This leads to excellent write performance.

TileDB reads utilize the range GET request API of the AWS SDK, which retrieves only the requested (contiguous) bytes from a file/object, rather than downloading the entire file from the cloud. This results in extremely fast subarray reads, especially because of the array tiling. Recall that a tile (which groups cell values that are stored contiguously in the file) is the atomic unit of I/O. The range GET API enables reading each tile from S3 in a single request. Finally, TileDB performs all reads in parallel using multiple threads, which is a tunable configuration parameter.

Advanced

This section includes more advanced settings regarding using a proxy server, as well as logging.

Proxy server settings

The AWS backend supports several settings for proxy servers:

Parameter Default value Description
"vfs.s3.proxy_host" "" The S3 proxy host.
"vfs.s3.proxy_port" "0" The S3 proxy port.
"vfs.s3.proxy_scheme" "https" The S3 proxy scheme.
"vfs.s3.proxy_username" "" The S3 proxy username.
"vfs.s3.proxy_password" "" The S3 proxy password.
Note

It is necessary to override "vfs.s3.proxy_scheme" to http for most proxy setups. TileDB 2.0.8 and later uses the default setting, which will be updated in the next release.

Logging

TileDB uses the AWS C++ SDK and cURL for access to S3. The AWS SDK logging level is a process-global setting. The configuration of the most recently constructed context will set the process state. Log files are written to the process working directory.

Parameter Values
"vfs.s3.logging_level" "[OFF], TRACE, DEBUG"

Certificate paths

Linux has no universal location for SSL/TLS certificates. While TileDB searches for the default CA store on several major distributions, other systems or custom certificates may require path specification. to SSL/TLS certificate file to be used by cURL for S3 HTTPS encryption. These parameters follow cURL conventions: https://curl.haxx.se/docs/manpage.html

Parameter Values
"vfs.s3.ca_file" File path
"vfs.s3.ca_path" Directory path

For debugging purposes only, it is possible to disable SSL/TLS certificate verification:

Parameter Values
"vfs.s3.verify_ssl" [false], true
Backends
Azure Blob Storage