1. Structure
  2. Life Sciences
  3. Population Genomics
  4. API Reference
  5. Python API
  • Home
  • What is TileDB?
  • Get Started
  • Explore Content
  • Accounts
    • Individual Accounts
      • Apply for the Free Tier
      • Profile
        • Overview
        • Cloud Credentials
        • Storage Paths
        • REST API Tokens
        • Credits
    • Organization Admins
      • Create an Organization
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
    • Organization Members
      • Organization Invitations
      • Profile
        • Overview
        • Members
        • Cloud Credentials
        • Storage Paths
        • Billing
      • API Tokens
  • Catalog
    • Introduction
    • Data
      • Arrays
      • Tables
      • Single-Cell (SOMA)
      • Genomics (VCF)
      • Biomedical Imaging
      • Vector Search
      • Files
    • Code
      • Notebooks
      • Dashboards
      • User-Defined Functions
      • Task Graphs
      • ML Models
    • Groups
    • Marketplace
    • Search
  • Collaborate
    • Introduction
    • Organizations
    • Access Control
      • Introduction
      • Share Assets
      • Asset Permissions
      • Public Assets
    • Logging
    • Marketplace
  • Analyze
    • Introduction
    • Slice Data
    • Multi-Region Redirection
    • Notebooks
      • Launch a Notebook
      • Usage
      • Widgets
      • Notebook Image Dependencies
    • Dashboards
      • Dashboards
      • Streamlit
    • Preview
    • User-Defined Functions
    • Task Graphs
    • Serverless SQL
    • Monitor
      • Task Log
      • Task Graph Log
  • Scale
    • Introduction
    • Task Graphs
    • API Usage
  • Structure
    • Why Structure Is Important
    • Arrays
      • Introduction
      • Quickstart
      • Foundation
        • Array Data Model
        • Key Concepts
          • Storage
            • Arrays
            • Dimensions
            • Attributes
            • Cells
            • Domain
            • Tiles
            • Data Layout
            • Compression
            • Encryption
            • Tile Filters
            • Array Schema
            • Schema Evolution
            • Fragments
            • Fragment Metadata
            • Commits
            • Indexing
            • Array Metadata
            • Datetimes
            • Groups
            • Object Stores
          • Compute
            • Writes
            • Deletions
            • Consolidation
            • Vacuuming
            • Time Traveling
            • Reads
            • Query Conditions
            • Aggregates
            • User-Defined Functions
            • Distributed Compute
            • Concurrency
            • Parallelism
        • Storage Format Spec
      • Tutorials
        • Basics
          • Basic Dense Array
          • Basic Sparse Array
          • Array Metadata
          • Compression
          • Encryption
          • Data Layout
          • Tile Filters
          • Datetimes
          • Multiple Attributes
          • Variable-Length Attributes
          • String Dimensions
          • Nullable Attributes
          • Multi-Range Reads
          • Query Conditions
          • Aggregates
          • Deletions
          • Catching Errors
          • Configuration
          • Basic S3 Example
          • Basic TileDB Cloud
          • fromDataFrame
          • Palmer Penguins
        • Advanced
          • Schema Evolution
          • Advanced Writes
            • Write at a Timestamp
            • Get Fragment Info
            • Consolidation
              • Fragments
              • Fragment List
              • Consolidation Plan
              • Commits
              • Fragment Metadata
              • Array Metadata
            • Vacuuming
              • Fragments
              • Commits
              • Fragment Metadata
              • Array Metadata
          • Advanced Reads
            • Get Fragment Info
            • Time Traveling
              • Introduction
              • Fragments
              • Array Metadata
              • Schema Evolution
          • Array Upgrade
          • Backends
            • Amazon S3
            • Azure Blob Storage
            • Google Cloud Storage
            • MinIO
            • Lustre
          • Virtual Filesystem
          • User-Defined Functions
          • Distributed Compute
          • Result Estimation
          • Incomplete Queries
        • Management
          • Array Schema
          • Groups
          • Object Management
        • Performance
          • Summary of Factors
          • Dense vs. Sparse
          • Dimensions vs. Attributes
          • Compression
          • Tiling and Data Layout
          • Tuning Writes
          • Tuning Reads
      • API Reference
    • Tables
      • Introduction
      • Quickstart
      • Foundation
        • Data Model
        • Key Concepts
          • Indexes
          • Columnar Storage
          • Compression
          • Data Manipulation
          • Optimize Tables
          • ACID
          • Serverless SQL
          • SQL Connectors
          • Dataframes
          • CSV Ingestion
      • Tutorials
        • Basics
          • Ingestion with SQL
          • CSV Ingestion
          • Basic S3 Example
          • Running Locally
        • Advanced
          • Scalable Ingestion
          • Scalable Queries
      • API Reference
    • AI & ML
      • Vector Search
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Vector Search
            • Vector Databases
            • Algorithms
            • Distance Metrics
            • Updates
            • Deployment Methods
            • Architecture
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Ingestion & Querying
            • Updates
            • Deletions
            • Basic S3 Example
            • Running Locally
          • Advanced
            • Versioning
            • Time Traveling
            • Consolidation
            • Distributed Compute
            • RAG LLM
            • LLM Memory
            • File Search
            • Image Search
            • Protein Search
          • Performance
        • API Reference
      • ML Models
        • Introduction
        • Quickstart
        • Foundation
          • Basics
          • Storage
          • Cloud Execution
          • Why TileDB for Machine Learning
        • Tutorials
          • Ingestion
            • Data Ingestion
              • Dense Datasets
              • Sparse Datasets
            • ML Model Ingestion
          • Management
            • Array Schema
            • Machine Learning: Groups
            • Time Traveling
    • Life Sciences
      • Single-cell
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • Data Structures
            • Use of Apache Arrow
            • Join IDs
            • State Management
            • TileDB Cloud URIs
          • SOMA API Specification
        • Tutorials
          • Data Ingestion
          • Bulk Ingestion Tutorial
          • Data Access
          • Distributed Compute
          • Basic S3 Example
          • Multi-Experiment Queries
          • Appending Data to a SOMA Experiment
          • Add New Measurements
          • SQL Queries
          • Running Locally
          • Shapes in TileDB-SOMA
          • Drug Discovery App
        • Spatial
          • Introduction
          • Foundation
            • Spatial Data Model
            • Data Structures
          • Tutorials
            • Spatial Data Ingestion
            • Access Spatial Data
            • Manage Coordinate Spaces
        • API Reference
      • Population Genomics
        • Introduction
        • Quickstart
        • Foundation
          • Data Model
          • Key Concepts
            • The N+1 Problem
            • Architecture
            • Arrays
            • Ingestion
            • Reads
            • Variant Statistics
            • Annotations
            • User-Defined Functions
            • Tables and SQL
            • Distributed Compute
          • Storage Format Spec
        • Tutorials
          • Basics
            • Basic Ingestion
            • Basic Queries
            • Export to VCF
            • Add New Samples
            • Deleting Samples
            • Basic S3 Example
            • Basic TileDB Cloud
          • Advanced
            • Scalable Ingestion
            • Scalable Queries
            • Query Transforms
            • Handling Large Queries
            • Annotations
              • Finding Annotations
              • Embedded Annotations
              • External Annotations
              • Annotation VCFs
              • Ingesting Annotations
            • Variant Statistics
            • Tables and SQL
            • User-Defined Functions
            • Sample Metadata
            • Split VCF
          • Performance
        • API Reference
          • Command Line Interface
          • Python API
          • Cloud API
      • Biomedical Imaging
        • Introduction
        • Foundation
          • Data Model
          • Key Concepts
            • Arrays
            • Ingestion
            • Reads
            • User Defined Functions
          • Storage Format Spec
        • Quickstart
        • Tutorials
          • Basics
            • Ingestion
            • Read
              • OpenSlide
              • TileDB-Py
          • Advanced
            • Batched Ingestion
            • Chunked Ingestion
            • Machine Learning
              • PyTorch
            • Napari
    • Files
  • API Reference
  • Self-Hosting
    • Installation
    • Upgrades
    • Administrative Tasks
    • Image Customization
      • Customize User-Defined Function Images
      • AWS ECR Container Registry
      • Customize Jupyter Notebook Images
    • Single Sign-On
      • Configure Single Sign-On
      • OpenID Connect
      • Okta SCIM
      • Microsoft Entra
  • Glossary

On this page

  • Dataset
    • Parameters
    • Methods
      • attributes
      • continue_read
      • continue_read_arrow
      • count
      • create_dataset
      • export
      • ingest_samples
      • read
      • read_allele_count
      • read_arrow
      • read_completed
      • read_iter
      • read_variant_stats
      • sample_count
      • samples
      • schema_version
      • tiledb_stats
      • version
  • ReadConfig
    • Attributes
  • config_logging
    • Parameters
  1. Structure
  2. Life Sciences
  3. Population Genomics
  4. API Reference
  5. Python API

Python API

life sciences
genomics (vcf)
reference

Classes:

  • Dataset provides read/write access to a TileDB-VCF dataset.
  • ReadConfig provides config settings for a TileDB-VCF dataset.
  • config_logging is used to configure TileDB-VCF logging.

Dataset

Dataset(self, uri, mode='r', cfg=None, stats=False, verbose=False, tiledb_config=None)

A class that provides read/write access to a TileDB-VCF dataset.

Parameters

Name Type Description Default
uri str URI of the dataset. required
mode str Mode of operation ('r'|'w') 'r'
cfg ReadConfig TileDB-VCF configuration. None
stats bool Enable internal TileDB statistics. False
verbose bool Enable verbose output. False
tiledb_config dict TileDB configuration, alternative to cfg.tiledb_config. None

Methods

Name Description
attributes Return a list of queryable attributes available in the VCF dataset.
continue_read Continue an incomplete read.
continue_read_arrow Continue an incomplete read.
count Count records in the dataset.
create_dataset Create a new dataset.
export Exports data to multiple VCF files or a combined VCF file.
ingest_samples Ingest VCF files into the dataset.
read Read data from the dataset into a pandas DataFrame.
read_allele_count Read allele count from the dataset into a pandas DataFrame
read_arrow Read data from the dataset into a PyArrow Table.
read_completed Returns true if the previous read operation was complete.
read_iter Iterator version of read().
read_variant_stats Read variant stats from the dataset into a pandas DataFrame
sample_count Get the number of samples in the dataset.
samples Get the list of samples in the dataset.
schema_version Get the VCF schema version of the dataset.
tiledb_stats Get TileDB stats as a string.
version Return the TileDB-VCF version used to create the dataset.

attributes

Dataset.attributes(attr_type='all')

Return a list of queryable attributes available in the VCF dataset.

Parameters
Name Type Description Default
attr_type str The subset of attributes to retrieve; info or fmt will only retrieve attributes ingested from the VCF INFO and FORMAT fields, respectively, “builtin” retrieves the static attributes defined in TileDB-VCF’s schema, “all” (the default) returns all queryable attributes. 'all'
Returns
Type Description
list A list of attribute names.

continue_read

Dataset.continue_read(release_buffers=True)

Continue an incomplete read.

Parameters
Name Type Description Default
release_buffers bool Release the buffers after reading. True
Returns
Type Description
pd.DataFrame The next batch of data as a pandas DataFrame.

continue_read_arrow

Dataset.continue_read_arrow(release_buffers=True)

Continue an incomplete read.

Parameters
Name Type Description Default
release_buffers bool Release the buffers after reading. True
Returns
Type Description
pa.Table The next batch of data as a PyArrow Table.

count

Dataset.count(samples=None, regions=None)

Count records in the dataset.

Parameters
Name Type Description Default
samples (str, List[str]) Sample names to include in the count. None
regions (str, List[str]) Genomic regions to include in the count. None
Returns
Type Description
int Number of intersecting records in the dataset.

create_dataset

Dataset.create_dataset(extra_attrs=None, vcf_attrs=None, tile_capacity=10000, anchor_gap=1000, checksum_type='sha256', allow_duplicates=True, enable_allele_count=True, enable_variant_stats=True, compress_sample_dim=True, compression_level=4)

Create a new dataset.

Parameters
Name Type Description Default
extra_attrs str CSV list of extra attributes to materialize from fmt and info fields. None
vcf_attrs str URI of VCF file with all fmt and info fields to materialize in the dataset. None
tile_capacity int Tile capacity to use for the array schema. 10000
anchor_gap int Length of gaps between inserted anchor records in bases. 1000
checksum_type str Optional checksum type for the dataset, “sha256” or “md5”. 'sha256'
allow_duplicates bool Allow records with duplicate start positions to be written to the array. True
enable_allele_count bool Enable the allele count ingestion task. True
enable_variant_stats bool Enable the variant stats ingestion task. True
compress_sample_dim bool Enable compression on the sample dimension. True
compression_level int Compression level for zstd compression. 4

export

Dataset.export(samples=None, regions=None, samples_file=None, bed_file=None, bed_array=None, skip_check_samples=False, enable_progress_estimation=False, merge=False, output_format='z', output_path='', output_dir='.')

Exports data to multiple VCF files or a combined VCF file.

Parameters
Name Type Description Default
samples (str, List[str]) Sample names to be read. None
regions (str, List[str]) Genomic regions to be read. None
samples_file str URI of file containing sample names to be read, one per line. None
bed_file str URI of a BED file of genomic regions to be read. None
bed_array str URI of a BED array of genomic regions to be read. None
skip_check_samples bool Skip checking if the samples in samples_file exist in the dataset. False
set_af_filter Filter variants by internal allele frequency. For example, to include variants with AF > 0.1, set this to “>0.1”. required
scan_all_samples Scan all samples when computing internal allele frequency. required
enable_progress_estimation bool DEPRECATED - This parameter will be removed in a future release. False
merge bool Merge samples to create a combined VCF file. False
output_format str Export file format: ‘b’: .bcf (compressed), ‘u’: .bcf, ‘z’: .vcf.gz, ‘v’: .vcf. 'z'
output_path str Combined VCF output file. ''
output_dir str Directory used for local output of exported samples. '.'

ingest_samples

Dataset.ingest_samples(sample_uris=None, threads=None, total_memory_budget_mb=None, total_memory_percentage=None, ratio_tiledb_memory=None, max_tiledb_memory_mb=None, input_record_buffer_mb=None, avg_vcf_record_size=None, ratio_task_size=None, ratio_output_flush=None, scratch_space_path=None, scratch_space_size=None, sample_batch_size=None, resume=False, contig_fragment_merging=True, contigs_to_keep_separate=None, contigs_to_allow_merging=None, contig_mode='all', thread_task_size=None, memory_budget_mb=None, record_limit=None)

Ingest VCF files into the dataset.

Parameters
Name Type Description Default
sample_uris List[str] List of sample URIs to ingest. None
threads int Set the number of threads used for ingestion. None
total_memory_budget_mb int Total memory budget for ingestion (MiB). None
total_memory_percentage float Percentage of total system memory used for ingestion (overrides ‘total_memory_budget_mb’). None
ratio_tiledb_memory float Ratio of memory budget allocated to TileDB::sm.mem.total_budget. None
max_tiledb_memory_mb int Maximum memory allocated to TileDB::sm.mem.total_budget (MiB). None
input_record_buffer_mb int Size of input record buffer for each sample file (MiB). None
avg_vcf_record_size int Average VCF record size (bytes). None
ratio_task_size float Ratio of worker task size to computed task size. None
ratio_output_flush float Ratio of output buffer capacity that triggers a flush to TileDB. None
scratch_space_path str Directory used for local storage of downloaded remote samples. None
scratch_space_size int Amount of local storage that can be used for downloading remote samples (MB). None
sample_batch_size int Number of samples per batch for ingestion (default 10). None
resume bool Whether to check and attempt to resume a partial completed ingestion. False
contig_fragment_merging bool Whether to enable merging of contigs into fragments. This overrides the contigs-to-keep-separate/contigs-to-allow- merging options. Generally contig fragment merging is good, this is a performance optimization to reduce the prefixes on a s3/azure/gcs bucket when there is a large number of pseudo contigs which are small in size. True
contigs_to_keep_separate List[str] List of contigs that should not be merged into combined fragments. The default list includes all standard human chromosomes in both UCSC (e.g., chr1) and Ensembl (e.g., 1) formats. None
contigs_to_allow_merging List[str] List of contigs that should be allowed to be merged into combined fragments. None
contig_mode str Select which contigs are ingested: ‘all’, ‘separate’, or ‘merged’. 'all'
thread_task_size int DEPRECATED - This parameter will be removed in a future release. None
memory_budget_mb int DEPRECATED - This parameter will be removed in a future release. None
record_limit int DEPRECATED - This parameter will be removed in a future release. None

read

Dataset.read(attrs=DEFAULT_ATTRS, samples=None, regions=None, samples_file=None, bed_file=None, bed_array=None, skip_check_samples=False, set_af_filter='', scan_all_samples=False, enable_progress_estimation=False)

Read data from the dataset into a pandas DataFrame.

For large datasets, a call to read() may not be able to fit all results in memory. In that case, the returned table will contain as many results as possible, and in order to retrieve the rest of the results, use the continue_read() function.

You can also use the Python generator version, read_iter().

Parameters
Name Type Description Default
attrs List[str] List of attribute names to be read. DEFAULT_ATTRS
samples (str, List[str]) Sample names to be read. None
regions (str, List[str]) Genomic regions to be read. None
samples_file str URI of file containing sample names to be read, one per line. None
bed_file str URI of a BED file of genomic regions to be read. None
bed_array str URI of a BED array of genomic regions to be read. None
skip_check_samples bool Skip checking if the samples in samples_file exist in the dataset. False
set_af_filter str Filter variants by internal allele frequency. For example, to include variants with AF > 0.1, set this to “>0.1”. ''
enable_progress_estimation bool DEPRECATED - This parameter will be removed in a future release. False
Returns
Type Description
pd.DataFrame Query results as a pandas DataFrame.

read_allele_count

Dataset.read_allele_count(region=None)

Read allele count from the dataset into a pandas DataFrame

Parameters
Name Type Description Default
region str Genomic region to be queried. None

read_arrow

Dataset.read_arrow(attrs=DEFAULT_ATTRS, samples=None, regions=None, samples_file=None, bed_file=None, bed_array=None, skip_check_samples=False, set_af_filter='', scan_all_samples=False, enable_progress_estimation=False)

Read data from the dataset into a PyArrow Table.

For large queries, a call to read_arrow() may not be able to fit all results in memory. In that case, the returned table will contain as many results as possible, and in order to retrieve the rest of the results, use the continue_read() function.

Parameters
Name Type Description Default
attrs List[str] List of attribute names to be read. DEFAULT_ATTRS
samples (str, List[str]) Sample names to be read. None
regions (str, List[str]) Genomic regions to be read. None
samples_file str URI of file containing sample names to be read, one per line. None
bed_file str URI of a BED file of genomic regions to be read. None
bed_array str URI of a BED array of genomic regions to be read. None
skip_check_samples bool Skip checking if the samples in samples_file exist in the dataset. False
set_af_filter str Filter variants by internal allele frequency. For example, to include variants with AF > 0.1, set this to “>0.1”. ''
scan_all_samples bool Scan all samples when computing internal allele frequency. False
enable_progress_estimation bool DEPRECATED - This parameter will be removed in a future release. False
Returns
Type Description
pa.Table Query results as a PyArrow Table.

read_completed

Dataset.read_completed()

Returns true if the previous read operation was complete. A read is considered complete if the resulting dataframe contained all results.

Returns
Type Description
True if the previous read operation was complete.

read_iter

Dataset.read_iter(attrs=DEFAULT_ATTRS, samples=None, regions=None, samples_file=None, bed_file=None)

Iterator version of read().

Parameters
Name Type Description Default
attrs List[str] List of attribute names to be read. DEFAULT_ATTRS
samples (str, List[str]) Sample names to be read. None
regions (str, List[str]) Genomic regions to be read. None
samples_file str URI of file containing sample names to be read, one per line. None
bed_file str URI of a BED file of genomic regions to be read. None

read_variant_stats

Dataset.read_variant_stats(region=None)

Read variant stats from the dataset into a pandas DataFrame

Parameters
Name Type Description Default
region str Genomic region to be queried. None

sample_count

Dataset.sample_count()

Get the number of samples in the dataset.

Returns
Type Description
int Number of samples in the dataset.

samples

Dataset.samples()

Get the list of samples in the dataset.

Returns
Type Description
list List of samples in the dataset.

schema_version

Dataset.schema_version()

Get the VCF schema version of the dataset.

Returns
Type Description
int VCF schema version of the dataset.

tiledb_stats

Dataset.tiledb_stats()

Get TileDB stats as a string.

Returns
Type Description
str TileDB stats as a string.

version

Dataset.version()

Return the TileDB-VCF version used to create the dataset.

Returns
Type Description
str The TileDB-VCF version.

ReadConfig

ReadConfig

Config settings for a TileDB-VCF dataset.

Attributes

Name Type Description
limit int Max number of records (rows) to read
region_partition tuple Region partition tuple (idx, num_partitions)
sample_partition tuple Samples partition tuple (idx, num_partitions)
sort_regions bool Whether or not to sort the regions to be read, default True
memory_budget_mb int Memory budget (MB) for buffer and internal allocations, default 2048MB
tiledb_config typing.List[str] List of strings of format ‘option=value’
buffer_percentage int Percentage of memory to dedicate to TileDB Query Buffers, default 25
tiledb_tile_cache_percentage int Percentage of memory to dedicate to TileDB Tile Cache, default 10

config_logging

config_logging(level='fatal', log_file='')

Configure tiledbvcf logging.

Parameters

Name Type Description Default
level str Log level from (fatal|error|warn|info|debug|trace) 'fatal'
log_file str Log file path. ''
Command Line Interface
Cloud API