Virtual Filesystem

TileDB’s virtual filesystem (VFS) abstracts all I/O operations to storage backends behind a unified interface, supporting powerful file and directory management.

TileDB is designed such that all I/O to and from the storage backends is abstracted behind a virtual filesystem (VFS) module. The VFS module supports basic operations, such as creating a file or directory, reading from and writing to a file, and so on. With this abstraction, the TileDB team can add more storage backends in the future, effectively making the storage backend opaque to the user.

A nice positive “by-product” of this architecture is that it is possible to expose the basic VFS functionality through the TileDB APIs. This offers a simplified interface for file I/O and directory management (not related to TileDB assets such as arrays) on all the storage backends that TileDB supports.

This page covers most of the TileDB VFS functionality.

Setup

First, import the necessary libraries, set the array URI (that is, its path, which in this tutorial will be on local storage), and delete any previously created directories with the same name.

Python
R

import shutil
import struct
from pathlib import Path

import tiledb

ctx = tiledb.Ctx()
vfs = tiledb.VFS(ctx=ctx)

base_path = Path("~/tiledb_vfs_py").expanduser()
path = Path(base_path) / "tiledb_vfs.bin"

if Path(base_path).exists():
    shutil.rmtree(base_path)

library(tiledb)

vfs <- tiledb_vfs()

base_path <- path.expand("~/tiledb_vfs_r")
path <- file.path(base_path, "tiledb_vfs.bin")

# Delete array if it already exists
if (file.exists(base_path)) {
  unlink(base_path, recursive = TRUE)
}

Next, create a directory to hold the files you’ll be managing in this tutorial:

Python
R

# Create a directory
if not vfs.is_dir(base_path):
    vfs.create_dir(base_path)
    print(f"Created {base_path}")
else:
    print(f"{base_path} already exists")

Created /Users/nickv/tiledb_vfs_py

# Create a directory
if (!tiledb_vfs_is_dir(base_path)) {
  tiledb_vfs_create_dir(base_path)
  cat(paste0("Created ", base_path))
} else {
  cat(paste0(base_path, " already exists"))
}

Created /Users/nickv/tiledb_vfs_r

Write to and read from files

When writing to and reading from files, the Python VFS API treats files you open with the .open() method as a typical file through its io module, so all methods and attributes in the io module work with TileDB VFS file handlers.

The VFS API supports bytes only and does not automatically convert the data for you. Thus, you must open the file in binary mode and handle encoding manually. In the Python API, this only requires you use a b-string, but for floats, you’ll use struct.pack(). For the R API, you’ll need to serialize() all data first, and then cast it to an integer type with as.integer(), before passing it to tiledb_vfs_write().

Python
R

# Create and open writable buffer object
with vfs.open(path, "wb") as fh:
    fh.write(struct.pack("<f", 153.0))
    fh.write(b"abcd")

fh <- tiledb_vfs_open(path, "WRITE")

# create a binary payload from a serialized R object
payload <- as.integer(serialize(list(dbl = 153, string = "abcde"), NULL))

# write it and close file
tiledb_vfs_write(fh, payload)
tiledb_vfs_close(fh)

Python
R

# Write data again - this will overwrite the previous file
with vfs.open(path, "wb") as fh:
    fh.write(struct.pack("<f", 153.1))
    fh.write(b"abcd")

# Write data again - this will overwrite the previous file
# This is alternative syntax to the previous cell
tiledb_vfs_remove_file(uri = path)
tiledb_vfs_serialize(obj = list(dbl = 153, string = "abcde"), uri = path)

You can append data to a binary file as follows:

Python
R

# Append data to existing file (this will NOT work on cloud object stores)
with vfs.open(path, "ab") as fh:
    fh.write(b"ghijkl")

# Append data to existing object (this just overwrites the file again)
obj <- tiledb_vfs_unserialize(uri = path)
obj["string"] <- paste0(obj["string"], "ghijkl")
tiledb_vfs_serialize(obj = obj, uri = path)

Open the file in read mode and decode the binary data:

Python
R

# Create and open readable handle
fh = vfs.open(path, "rb")
float_struct = struct.Struct("<f")

float_data = fh.read(float_struct.size)

# Offset the starting byte
fh.seek(float_struct.size)

# Read the string data
string_data = fh.read(12)

print(float_struct.unpack(float_data)[0])
print(string_data.decode("UTF-8"))

# Don't forget to close the handle
fh = vfs.close(fh)

153.10000610351562
abcdghijkl

# Quickly print the unserialized path
# print(tiledb_vfs_unserialize(path))

# Create and open readable handle
fh <- tiledb_vfs_open(path, "READ")

# Get the file size
file_size <- tiledb_vfs_file_size(path)

# # Read the data into a vector of integers
# vec_double <- tiledb_vfs_read(fh, 0, 228)

# vec_str <- tiledb_vfs_read(fh, 208, file_size)

vec <- tiledb_vfs_read(fh, 0, file_size)

# Close the file handle
tiledb_vfs_close(fh)

print(unserialize(as.raw(vec)))

$dbl
[1] 153

$string
[1] "abcde"

Common file operations

Create an empty file, similar to the Unix touch command:

Python
R

# Create a directory
dir_a = Path(base_path) / "dir_a"
vfs.create_dir(dir_a)

# Create an empty file
file_a = Path(dir_a) / "file_a"
if not vfs.is_file(file_a):
    vfs.touch(file_a)
    print(f"Created empty file {file_a}")
else:
    print(f"{file_a} already exists")

Created empty file /Users/nickv/tiledb_vfs_py/dir_a/file_a

# Create a directory
dir_a <- file.path(base_path, "dir_a")
invisible(tiledb_vfs_create_dir(dir_a))

file_a <- file.path(dir_a, "file_a")
if (!tiledb_vfs_is_file(file_a)) {
  tiledb_vfs_touch(file_a)
  cat(paste0("Created empty file ", file_a))
} else {
  cat(paste0(file_a, " already exists"))
}

Created empty file /Users/nickv/tiledb_vfs_r/dir_a/file_a

Get the size of a file or directory in bytes:

Python
R

print(f"Size of file {path}: {vfs.size(path)} bytes")
print(f"Size of file {file_a}: {vfs.size(file_a)} bytes")

# The .size() method also accepts directories
print(f"Size of dir {base_path}: {vfs.size(base_path)} bytes")

Size of file /Users/nickv/tiledb_vfs_py/tiledb_vfs.bin: 14
Size of file /Users/nickv/tiledb_vfs_py/dir_a/file_a: 0
Size of dir /Users/nickv/tiledb_vfs_py: 128

# Read file sizes with tiledb_vfs_file_size()
cat(paste0("Size of file ", path, ": ", tiledb_vfs_file_size(path), " bytes\n"))
cat(paste0("Size of file", file_a, ": ", tiledb_vfs_file_size(file_a), " bytes\n"))

# Read directory sizes with tiledb_vfs_dir_size()
cat(paste0("Size of file ", base_path, ": ", tiledb_vfs_dir_size(base_path), " bytes"))

Size of file /Users/nickv/tiledb_vfs_r/tiledb_vfs.bin: 448 bytes
Size of file/Users/nickv/tiledb_vfs_r/dir_a/file_a: 0 bytes
Size of file /Users/nickv/tiledb_vfs_r: 448 bytes

List the contents of a directory, similar to the Unix ls command. The results of this method are a list of files and directories in the given directory.

Python
R

# Run an ls-like command on a directory:
print("vfs.ls(base_path):\n")
for file in vfs.ls(base_path):
    print(f"- {file}")

# You can run this recursively:
print("\nvfs.ls(base_path, recursive=True):\n")
for file in vfs.ls(base_path, recursive=True):
    print(f"- {file}")

# Shorthand for the recursive ls:
print("\nvfs.ls_recursive(base_path):\n")
for file in vfs.ls_recursive(base_path):
    print(f"- {file}")

vfs.ls(base_path):

- file:///Users/nickv/tiledb_vfs_py/dir_a
- file:///Users/nickv/tiledb_vfs_py/tiledb_vfs.bin

vfs.ls(base_path, recursive=True):

- file:///Users/nickv/tiledb_vfs_py/tiledb_vfs.bin
- file:///Users/nickv/tiledb_vfs_py/dir_a
- file:///Users/nickv/tiledb_vfs_py/dir_a/file_a

vfs.ls_recursive(base_path):

- file:///Users/nickv/tiledb_vfs_py/tiledb_vfs.bin
- file:///Users/nickv/tiledb_vfs_py/dir_a
- file:///Users/nickv/tiledb_vfs_py/dir_a/file_a

# Run an ls-like command on a directory
cat(paste0("Non-recursive ls:\n\n"))
for (path in tiledb_vfs_ls(base_path)) {
  cat(paste0("- ", path, "\n"))
}

# You can make it recursive
cat(paste0("\nRecursive ls:\n\n"))
print(tiledb_vfs_ls_recursive(base_path))

Non-recursive ls:

- file:///Users/nickv/tiledb_vfs_r/dir_a
- file:///Users/nickv/tiledb_vfs_r/tiledb_vfs.bin

Recursive ls:

                                             path size
1 file:///Users/nickv/tiledb_vfs_r/tiledb_vfs.bin  448
2          file:///Users/nickv/tiledb_vfs_r/dir_a    0
3   file:///Users/nickv/tiledb_vfs_r/dir_a/file_a    0

Copy file_a to a new path file_b.

Note

Copying files on Windows is not yet supported.

Python
R

file_b = Path(dir_a) / "file_b"
vfs.copy_file(file_a, file_b)

print("Files:\n")
for file in vfs.ls_recursive(base_path):
    print(f"- {file}")

Files:

- file:///Users/nickv/tiledb_vfs_py/tiledb_vfs.bin
- file:///Users/nickv/tiledb_vfs_py/dir_a
- file:///Users/nickv/tiledb_vfs_py/dir_a/file_a
- file:///Users/nickv/tiledb_vfs_py/dir_a/file_b

file_b <- file.path(dir_a, "file_b")
invisible(tiledb_vfs_copy_file(file_a, file_b))

cat(paste0("Files:\n\n"))
print(tiledb_vfs_ls_recursive(base_path))

Files:

                                             path size
1 file:///Users/nickv/tiledb_vfs_r/tiledb_vfs.bin  448
2          file:///Users/nickv/tiledb_vfs_r/dir_a    0
3   file:///Users/nickv/tiledb_vfs_r/dir_a/file_a    0
4   file:///Users/nickv/tiledb_vfs_r/dir_a/file_b    0

Copy dir_a to a new path dir_b. This recursively copies all files in that directory.

Note

Copying directories on Windows is not yet supported.

Python
R

dir_b = Path(base_path) / "dir_b"
vfs.copy_dir(dir_a, dir_b)

print("Files:\n")
for file in vfs.ls_recursive(base_path):
    print(f"- {file}")

Files:

- file:///Users/nickv/tiledb_vfs_py/tiledb_vfs.bin
- file:///Users/nickv/tiledb_vfs_py/dir_b
- file:///Users/nickv/tiledb_vfs_py/dir_b/file_a
- file:///Users/nickv/tiledb_vfs_py/dir_b/file_b
- file:///Users/nickv/tiledb_vfs_py/dir_a
- file:///Users/nickv/tiledb_vfs_py/dir_a/file_a
- file:///Users/nickv/tiledb_vfs_py/dir_a/file_b

copy_dir <- function(source, target) {
  if (tiledb_vfs_is_dir(target)) {
    stop(cat(paste0(target, " already exists")))
  } else {
    tiledb_vfs_create_dir(target)
  }
  for (f in sort(tiledb_vfs_ls_recursive(source)$path)) {
    source_file <- gsub("file://", "", f)
    target_file <- gsub(source, target, source_file)
    if (tiledb_vfs_is_dir(source_file)) {
      tiledb_vfs_create_dir(target_file)
    } else if (tiledb_vfs_is_file(source_file)) {
      tiledb_vfs_copy_file(source_file, target_file)
    } else {
      stop(cat(paste0(source_file, " is not a valid file")))
    }
  }
}

dir_b <- file.path(base_path, "dir_b")

copy_dir(dir_a, dir_b)

cat(paste0("Files:\n\n"))
print(sort(tiledb_vfs_ls_recursive(base_path)$path))

Files:

[1] "file:///Users/nickv/tiledb_vfs_r/dir_a"         
[2] "file:///Users/nickv/tiledb_vfs_r/dir_a/file_a"  
[3] "file:///Users/nickv/tiledb_vfs_r/dir_a/file_b"  
[4] "file:///Users/nickv/tiledb_vfs_r/dir_b"         
[5] "file:///Users/nickv/tiledb_vfs_r/dir_b/file_a"  
[6] "file:///Users/nickv/tiledb_vfs_r/dir_b/file_b"  
[7] "file:///Users/nickv/tiledb_vfs_r/tiledb_vfs.bin"

Rename file_b to file_c. The following command also works if you’re moving a file to a different directory without renaming it:

Python
R

file_c = Path(dir_a) / "file_c"
vfs.move_file(file_b, file_c)

print("Files:\n")
for file in vfs.ls_recursive(base_path):
    print(f"- {file}")

Files:

- file:///Users/nickv/tiledb_vfs_py/tiledb_vfs.bin
- file:///Users/nickv/tiledb_vfs_py/dir_b
- file:///Users/nickv/tiledb_vfs_py/dir_b/file_a
- file:///Users/nickv/tiledb_vfs_py/dir_b/file_b
- file:///Users/nickv/tiledb_vfs_py/dir_a
- file:///Users/nickv/tiledb_vfs_py/dir_a/file_a
- file:///Users/nickv/tiledb_vfs_py/dir_a/file_c

file_c <- file.path(dir_a, "file_c")
invisible(tiledb_vfs_move_file(file_a, file_c))

cat(paste0("Files:\n\n"))
print(tiledb_vfs_ls_recursive(base_path))

Files:

                                             path size
1 file:///Users/nickv/tiledb_vfs_r/tiledb_vfs.bin  448
2          file:///Users/nickv/tiledb_vfs_r/dir_b    0
3   file:///Users/nickv/tiledb_vfs_r/dir_b/file_a    0
4   file:///Users/nickv/tiledb_vfs_r/dir_b/file_b    0
5          file:///Users/nickv/tiledb_vfs_r/dir_a    0
6   file:///Users/nickv/tiledb_vfs_r/dir_a/file_c    0
7   file:///Users/nickv/tiledb_vfs_r/dir_a/file_b    0

You can also move directories, which moves the source directory and all children recursively to the destination directory, or rename the directory.

Python
R

dir_b = Path(base_path) / "dir_b"
vfs.copy_dir(dir_a, dir_b)

print("Files:\n")
for file in vfs.ls_recursive(base_path):
    print(f"- {file}")

Files:

- file:///Users/nickv/tiledb_vfs_py/tiledb_vfs.bin
- file:///Users/nickv/tiledb_vfs_py/dir_b
- file:///Users/nickv/tiledb_vfs_py/dir_b/file_a
- file:///Users/nickv/tiledb_vfs_py/dir_b/file_b
- file:///Users/nickv/tiledb_vfs_py/dir_a
- file:///Users/nickv/tiledb_vfs_py/dir_a/file_a
- file:///Users/nickv/tiledb_vfs_py/dir_a/file_b

copy_dir <- function(source, target) {
  if (tiledb_vfs_is_dir(target)) {
    stop(cat(paste0(target, " already exists")))
  } else {
    tiledb_vfs_create_dir(target)
  }
  for (f in sort(tiledb_vfs_ls_recursive(source)$path)) {
    source_file <- gsub("file://", "", f)
    target_file <- gsub(source, target, source_file)
    if (tiledb_vfs_is_dir(source_file)) {
      tiledb_vfs_create_dir(target_file)
    } else if (tiledb_vfs_is_file(source_file)) {
      tiledb_vfs_copy_file(source_file, target_file)
    } else {
      stop(cat(paste0(source_file, " is not a valid file")))
    }
  }
}

dir_b <- file.path(base_path, "dir_b")

copy_dir(dir_a, dir_b)

cat(paste0("Files:\n\n"))
print(sort(tiledb_vfs_ls_recursive(base_path)$path))

Files:

[1] "file:///Users/nickv/tiledb_vfs_r/dir_a"         
[2] "file:///Users/nickv/tiledb_vfs_r/dir_a/file_a"  
[3] "file:///Users/nickv/tiledb_vfs_r/dir_a/file_b"  
[4] "file:///Users/nickv/tiledb_vfs_r/dir_b"         
[5] "file:///Users/nickv/tiledb_vfs_r/dir_b/file_a"  
[6] "file:///Users/nickv/tiledb_vfs_r/dir_b/file_b"  
[7] "file:///Users/nickv/tiledb_vfs_r/tiledb_vfs.bin"

Remove file_c:

Python
R

if vfs.is_file(file_c):
    vfs.remove_file(file_c)

print("Files:\n")
for file in vfs.ls_recursive(base_path):
    print(f"- {file}")

Files:

- file:///Users/nickv/tiledb_vfs_py/tiledb_vfs.bin
- file:///Users/nickv/tiledb_vfs_py/dir_a
- file:///Users/nickv/tiledb_vfs_py/dir_a/dir_c
- file:///Users/nickv/tiledb_vfs_py/dir_a/dir_c/file_a
- file:///Users/nickv/tiledb_vfs_py/dir_a/dir_c/file_b
- file:///Users/nickv/tiledb_vfs_py/dir_a/file_a

if (tiledb_vfs_is_file(file_c)) {
  invisible(tiledb_vfs_remove_file(file_c))
}

cat(paste0("Files:\n\n"))
print(tiledb_vfs_ls_recursive(base_path))

Files:

                                                 path size
1     file:///Users/nickv/tiledb_vfs_r/tiledb_vfs.bin  448
2              file:///Users/nickv/tiledb_vfs_r/dir_a    0
3        file:///Users/nickv/tiledb_vfs_r/dir_a/dir_c    0
4 file:///Users/nickv/tiledb_vfs_r/dir_a/dir_c/file_a    0
5 file:///Users/nickv/tiledb_vfs_r/dir_a/dir_c/file_b    0
6       file:///Users/nickv/tiledb_vfs_r/dir_a/file_b    0

Remove the base_path directory and all its remaining children:

Python
R

if vfs.is_dir(base_path):
    vfs.remove_dir(base_path)

if (tiledb_vfs_is_dir(base_path)) {
  tiledb_vfs_remove_dir(base_path)
}

Context and configuration

You can set a context, a configuration, or both on a VFS object. Any configuration object you pass through the config parameter overrides the ctx’s VFS configurations with updated values in config.

Python
R

cfg = tiledb.Config(
    {"vfs.file.posix_file_permissions": "660", "vfs.read_logging_mode": "fragments"},
)
ctx = tiledb.Ctx(cfg)
cfg["vfs.file.posix_directory_permissions"] = "770"
cfg["vfs.read_logging_mode"] = "all_files"

vfs_cfg_ctx = tiledb.VFS(config=cfg, ctx=ctx)

new_cfg = vfs_cfg_ctx.config()

print(
    "vfs.file.posix_directory_permissions:",
    new_cfg["vfs.file.posix_directory_permissions"],
)
print("vfs.file.posix_file_permissions:", new_cfg["vfs.file.posix_file_permissions"])
print("vfs.read_logging_mode:", new_cfg["vfs.read_logging_mode"])

vfs.file.posix_directory_permissions: 770
vfs.file.posix_file_permissions: 660
vfs.read_logging_mode: all_files

cfg <- tiledb_config(
  c("vfs.file.posix_file_permissions" = "660", "vfs.read_logging_mode" = "fragments")
)
ctx <- tiledb_ctx(config = cfg)
cfg["vfs.file.posix_directory_permissions"] <- "770"
cfg["vfs.read_logging_mode"] <- "all_files"

vfs_cfg_ctx <- tiledb_vfs(config = cfg, ctx = ctx)

new_cfg <- config(tiledb_get_context())

cat(paste0("vfs.file.posix_directory_permissions: ", new_cfg["vfs.file.posix_directory_permissions"], "\n"))
cat(paste0("vfs.file.posix_file_permissions: ", new_cfg["vfs.file.posix_file_permissions"], "\n"))
cat(paste0("vfs.read_logging_mode: ", new_cfg["vfs.read_logging_mode"]))

vfs.file.posix_directory_permissions: 770
vfs.file.posix_file_permissions: 660
vfs.read_logging_mode: all_files

Cloud object storage

You can perform different operations on cloud storage buckets if you pass a valid URI. Except for appending data to an existing file, all the previously mentioned methods work on cloud storage buckets the same way as they do on files in your local filesystem.

You can check to see if your cloud storage provider is supported:

Python
R

print("Amazon S3 supported:", vfs.supports("s3"))
print("Microsoft Azure supported:", vfs.supports("azure"))
print("Google Cloud Storage supported:", vfs.supports("gcs"))

try:
    print("Storj supported:", vfs.supports("sj"))
except Exception:
    print("Storj supported: False")

Amazon S3 supported: True
Microsoft Azure supported: True
Google Cloud Storage supported: True
Storj supported: False

cat(paste0("Amazon S3 supported: ", tiledb_is_supported_fs("s3"), "\n"))
cat(paste0("Microsoft Azure supported: ", tiledb_is_supported_fs("azure"), "\n"))
cat(paste0("Google Cloud Storage supported: ", tiledb_is_supported_fs("gcs"), "\n"))

tryCatch(
  {
    cat(paste0("Storj supported: ", tiledb_is_supported_fs("sj")))
  },
  error = function(cond) {
    cat(paste0("Storj supported: FALSE"))
  }
)

Amazon S3 supported: TRUE
Microsoft Azure supported: TRUE
Google Cloud Storage supported: TRUE
Storj supported: FALSE

After logging in to your cloud storage provider with their preferred authentication mechanism, you can check if a bucket exists:

Python
R

bucket_name = "<cloud_provider_scheme>://<name_of_your_bucket>"
print(f"Bucket {bucket_name} exists: {vfs.is_bucket(bucket_name)}")

bucket_name <- "<cloud_provider_scheme>://<name_of_your_bucket>"
cat(paste0(
    "Bucket ",
    bucket_name,
    " exists: ",
    tiledb_vfs_is_bucket(bucket_name)
))

If the bucket doesn’t exist, you can create the bucket, if you have the appropriate permissions within your cloud storage provider:

Python
R

if not vfs.is_bucket(bucket_name):
    vfs.create_bucket(bucket_name)

if (!tiledb_vfs_is_bucket(bucket_name)) {
    tiledb_vfs_create_bucket(bucket_name)
}

Warning

You must take extreme care when creating or deleting buckets when using the VFS APIs. After its creation, the bucket may take some time to “appear” in the system. This will cause problems if you create the bucket and immediately try to write a file in it.

Wait some time before trying to write files to the bucket. You can add a polling mechanism with the .is_bucket() method in Python or the tiledb_vfs_is_bucket() function in R to verify TileDB created the bucket successfully.

After creating a new bucket and verifying the bucket exists, you can verify that the bucket is empty:

Python
R

print(f"Bucket {bucket_name} is empty: {vfs.is_empty_bucket(bucket_name)}")

bucket_name <-> "<name_of_your_bucket>"
cat(paste0(
    "Bucket ",
    bucket_name,
    " is empty: ",
    tiledb_vfs_is_empty_bucket(bucket_name)
))

You can empty a bucket, if you have permission to do so.

Caution

Emptying a bucket will permanently delete all items in that bucket. This is irreversible.

Python
R

vfs.empty_bucket(bucket_name)

tiledb_vfs_empty_bucket(bucket_name)

You can delete a bucket from cloud storage, with the appropriate permissions.

Caution

Deleting a bucket is irreversible.

Warning

Deleting a bucket may not take effect immediately. Thus, it may continue to “exist” for some time. You can apply a polling mechanism to check if you deleted the bucket successfully with the .is_bucket() method in Python or the tiledb_vfs_is_bucket() function in R.

Python
R

vfs.remove_bucket(bucket_name)

tiledb_vfs_remove_bucket(bucket_name)