Python API
provides read/write access to a TileDB-VCF dataset.ReadConfig
provides config settings for a TileDB-VCF dataset.config_logging
is used to configure TileDB-VCF logging.
Dataset(self, uri, mode='r', cfg=None, stats=False, verbose=False, tiledb_config=None)
A class that provides read/write access to a TileDB-VCF dataset.
Name | Type | Description | Default |
uri |
str | URI of the dataset. | required |
mode |
str | Mode of operation ('r' |'w' ) |
'r' |
cfg |
ReadConfig | TileDB-VCF configuration. | None |
stats |
bool | Enable internal TileDB statistics. | False |
verbose |
bool | Enable verbose output. | False |
tiledb_config |
dict | TileDB configuration, alternative to cfg.tiledb_config . |
None |
Name | Description |
attributes |
Return a list of queryable attributes available in the VCF dataset. |
continue_read |
Continue an incomplete read. |
continue_read_arrow |
Continue an incomplete read. |
count |
Count records in the dataset. |
create_dataset |
Create a new dataset. |
export |
Exports data to multiple VCF files or a combined VCF file. |
ingest_samples |
Ingest VCF files into the dataset. |
read |
Read data from the dataset into a pandas DataFrame. |
read_allele_count |
Read allele count from the dataset into a pandas DataFrame |
read_arrow |
Read data from the dataset into a PyArrow Table. |
read_completed |
Returns true if the previous read operation was complete. |
read_iter |
Iterator version of read() . |
read_variant_stats |
Read variant stats from the dataset into a pandas DataFrame |
sample_count |
Get the number of samples in the dataset. |
samples |
Get the list of samples in the dataset. |
schema_version |
Get the VCF schema version of the dataset. |
tiledb_stats |
Get TileDB stats as a string. |
version |
Return the TileDB-VCF version used to create the dataset. |
Return a list of queryable attributes available in the VCF dataset.
Name | Type | Description | Default |
attr_type |
str | The subset of attributes to retrieve; info or fmt will only retrieve attributes ingested from the VCF INFO and FORMAT fields, respectively, “builtin” retrieves the static attributes defined in TileDB-VCF’s schema, “all” (the default) returns all queryable attributes. |
'all' |
Type | Description |
list | A list of attribute names. |
Continue an incomplete read.
Name | Type | Description | Default |
release_buffers |
bool | Release the buffers after reading. | True |
Type | Description |
pd.DataFrame | The next batch of data as a pandas DataFrame. |
Continue an incomplete read.
Name | Type | Description | Default |
release_buffers |
bool | Release the buffers after reading. | True |
Type | Description |
pa.Table | The next batch of data as a PyArrow Table. |
Dataset.count(samples=None, regions=None)
Count records in the dataset.
Name | Type | Description | Default |
samples |
(str, List[str]) | Sample names to include in the count. | None |
regions |
(str, List[str]) | Genomic regions to include in the count. | None |
Type | Description |
int | Number of intersecting records in the dataset. |
Dataset.create_dataset(extra_attrs=None, vcf_attrs=None, tile_capacity=10000, anchor_gap=1000, checksum_type='sha256', allow_duplicates=True, enable_allele_count=True, enable_variant_stats=True, compress_sample_dim=True, compression_level=4)
Create a new dataset.
Name | Type | Description | Default |
extra_attrs |
str | CSV list of extra attributes to materialize from fmt and info fields. |
None |
vcf_attrs |
str | URI of VCF file with all fmt and info fields to materialize in the dataset. |
None |
tile_capacity |
int | Tile capacity to use for the array schema. | 10000 |
anchor_gap |
int | Length of gaps between inserted anchor records in bases. | 1000 |
checksum_type |
str | Optional checksum type for the dataset, “sha256” or “md5”. | 'sha256' |
allow_duplicates |
bool | Allow records with duplicate start positions to be written to the array. | True |
enable_allele_count |
bool | Enable the allele count ingestion task. | True |
enable_variant_stats |
bool | Enable the variant stats ingestion task. | True |
compress_sample_dim |
bool | Enable compression on the sample dimension. | True |
compression_level |
int | Compression level for zstd compression. | 4 |
Dataset.export(samples=None, regions=None, samples_file=None, bed_file=None, bed_array=None, skip_check_samples=False, enable_progress_estimation=False, merge=False, output_format='z', output_path='', output_dir='.')
Exports data to multiple VCF files or a combined VCF file.
Name | Type | Description | Default |
samples |
(str, List[str]) | Sample names to be read. | None |
regions |
(str, List[str]) | Genomic regions to be read. | None |
samples_file |
str | URI of file containing sample names to be read, one per line. | None |
bed_file |
str | URI of a BED file of genomic regions to be read. | None |
bed_array |
str | URI of a BED array of genomic regions to be read. | None |
skip_check_samples |
bool | Skip checking if the samples in samples_file exist in the dataset. |
False |
set_af_filter |
Filter variants by internal allele frequency. For example, to include variants with AF > 0.1, set this to “>0.1”. | required | |
scan_all_samples |
Scan all samples when computing internal allele frequency. | required | |
enable_progress_estimation |
bool | DEPRECATED - This parameter will be removed in a future release. | False |
merge |
bool | Merge samples to create a combined VCF file. | False |
output_format |
str | Export file format: ‘b’: .bcf (compressed), ‘u’: .bcf , ‘z’: .vcf.gz , ‘v’: .vcf . |
'z' |
output_path |
str | Combined VCF output file. | '' |
output_dir |
str | Directory used for local output of exported samples. | '.' |
Dataset.ingest_samples(sample_uris=None, threads=None, total_memory_budget_mb=None, total_memory_percentage=None, ratio_tiledb_memory=None, max_tiledb_memory_mb=None, input_record_buffer_mb=None, avg_vcf_record_size=None, ratio_task_size=None, ratio_output_flush=None, scratch_space_path=None, scratch_space_size=None, sample_batch_size=None, resume=False, contig_fragment_merging=True, contigs_to_keep_separate=None, contigs_to_allow_merging=None, contig_mode='all', thread_task_size=None, memory_budget_mb=None, record_limit=None)
Ingest VCF files into the dataset.
Name | Type | Description | Default |
sample_uris |
List[str] | List of sample URIs to ingest. | None |
threads |
int | Set the number of threads used for ingestion. | None |
total_memory_budget_mb |
int | Total memory budget for ingestion (MiB). | None |
total_memory_percentage |
float | Percentage of total system memory used for ingestion (overrides ‘total_memory_budget_mb’). | None |
ratio_tiledb_memory |
float | Ratio of memory budget allocated to TileDB::sm.mem.total_budget . |
None |
max_tiledb_memory_mb |
int | Maximum memory allocated to TileDB::sm.mem.total_budget (MiB). | None |
input_record_buffer_mb |
int | Size of input record buffer for each sample file (MiB). | None |
avg_vcf_record_size |
int | Average VCF record size (bytes). | None |
ratio_task_size |
float | Ratio of worker task size to computed task size. | None |
ratio_output_flush |
float | Ratio of output buffer capacity that triggers a flush to TileDB. | None |
scratch_space_path |
str | Directory used for local storage of downloaded remote samples. | None |
scratch_space_size |
int | Amount of local storage that can be used for downloading remote samples (MB). | None |
sample_batch_size |
int | Number of samples per batch for ingestion (default 10). | None |
resume |
bool | Whether to check and attempt to resume a partial completed ingestion. | False |
contig_fragment_merging |
bool | Whether to enable merging of contigs into fragments. This overrides the contigs-to-keep-separate/contigs-to-allow- merging options. Generally contig fragment merging is good, this is a performance optimization to reduce the prefixes on a s3/azure/gcs bucket when there is a large number of pseudo contigs which are small in size. | True |
contigs_to_keep_separate |
List[str] | List of contigs that should not be merged into combined fragments. The default list includes all standard human chromosomes in both UCSC (e.g., chr1) and Ensembl (e.g., 1) formats. | None |
contigs_to_allow_merging |
List[str] | List of contigs that should be allowed to be merged into combined fragments. | None |
contig_mode |
str | Select which contigs are ingested: ‘all’, ‘separate’, or ‘merged’. | 'all' |
thread_task_size |
int | DEPRECATED - This parameter will be removed in a future release. | None |
memory_budget_mb |
int | DEPRECATED - This parameter will be removed in a future release. | None |
record_limit |
int | DEPRECATED - This parameter will be removed in a future release. | None |
read, samples=None, regions=None, samples_file=None, bed_file=None, bed_array=None, skip_check_samples=False, set_af_filter='', scan_all_samples=False, enable_progress_estimation=False)
Read data from the dataset into a pandas DataFrame.
For large datasets, a call to read()
may not be able to fit all results in memory. In that case, the returned table will contain as many results as possible, and in order to retrieve the rest of the results, use the continue_read()
You can also use the Python generator version, read_iter()
Name | Type | Description | Default |
attrs |
List[str] | List of attribute names to be read. | DEFAULT_ATTRS |
samples |
(str, List[str]) | Sample names to be read. | None |
regions |
(str, List[str]) | Genomic regions to be read. | None |
samples_file |
str | URI of file containing sample names to be read, one per line. | None |
bed_file |
str | URI of a BED file of genomic regions to be read. | None |
bed_array |
str | URI of a BED array of genomic regions to be read. | None |
skip_check_samples |
bool | Skip checking if the samples in samples_file exist in the dataset. |
False |
set_af_filter |
str | Filter variants by internal allele frequency. For example, to include variants with AF > 0.1, set this to “>0.1”. | '' |
enable_progress_estimation |
bool | DEPRECATED - This parameter will be removed in a future release. | False |
Type | Description |
pd.DataFrame | Query results as a pandas DataFrame. |
Read allele count from the dataset into a pandas DataFrame
Name | Type | Description | Default |
region |
str | Genomic region to be queried. | None |
Dataset.read_arrow(attrs=DEFAULT_ATTRS, samples=None, regions=None, samples_file=None, bed_file=None, bed_array=None, skip_check_samples=False, set_af_filter='', scan_all_samples=False, enable_progress_estimation=False)
Read data from the dataset into a PyArrow Table.
For large queries, a call to read_arrow()
may not be able to fit all results in memory. In that case, the returned table will contain as many results as possible, and in order to retrieve the rest of the results, use the continue_read()
Name | Type | Description | Default |
attrs |
List[str] | List of attribute names to be read. | DEFAULT_ATTRS |
samples |
(str, List[str]) | Sample names to be read. | None |
regions |
(str, List[str]) | Genomic regions to be read. | None |
samples_file |
str | URI of file containing sample names to be read, one per line. | None |
bed_file |
str | URI of a BED file of genomic regions to be read. | None |
bed_array |
str | URI of a BED array of genomic regions to be read. | None |
skip_check_samples |
bool | Skip checking if the samples in samples_file exist in the dataset. |
False |
set_af_filter |
str | Filter variants by internal allele frequency. For example, to include variants with AF > 0.1, set this to “>0.1”. | '' |
scan_all_samples |
bool | Scan all samples when computing internal allele frequency. | False |
enable_progress_estimation |
bool | DEPRECATED - This parameter will be removed in a future release. | False |
Type | Description |
pa.Table | Query results as a PyArrow Table. |
Returns true if the previous read operation was complete. A read is considered complete if the resulting dataframe contained all results.
Type | Description |
True if the previous read operation was complete. |
Dataset.read_iter(attrs=DEFAULT_ATTRS, samples=None, regions=None, samples_file=None, bed_file=None)
Iterator version of read()
Name | Type | Description | Default |
attrs |
List[str] | List of attribute names to be read. | DEFAULT_ATTRS |
samples |
(str, List[str]) | Sample names to be read. | None |
regions |
(str, List[str]) | Genomic regions to be read. | None |
samples_file |
str | URI of file containing sample names to be read, one per line. | None |
bed_file |
str | URI of a BED file of genomic regions to be read. | None |
Read variant stats from the dataset into a pandas DataFrame
Name | Type | Description | Default |
region |
str | Genomic region to be queried. | None |
Get the number of samples in the dataset.
Type | Description |
int | Number of samples in the dataset. |
Get the list of samples in the dataset.
Type | Description |
list | List of samples in the dataset. |
Get the VCF schema version of the dataset.
Type | Description |
int | VCF schema version of the dataset. |
Get TileDB stats as a string.
Type | Description |
str | TileDB stats as a string. |
Return the TileDB-VCF version used to create the dataset.
Type | Description |
str | The TileDB-VCF version. |
Config settings for a TileDB-VCF dataset.
Name | Type | Description |
limit | int | Max number of records (rows) to read |
region_partition | tuple | Region partition tuple (idx , num_partitions ) |
sample_partition | tuple | Samples partition tuple (idx , num_partitions ) |
sort_regions | bool | Whether or not to sort the regions to be read, default True |
memory_budget_mb | int | Memory budget (MB) for buffer and internal allocations, default 2048MB |
tiledb_config | typing.List[str] | List of strings of format ‘option=value’ |
buffer_percentage | int | Percentage of memory to dedicate to TileDB Query Buffers, default 25 |
tiledb_tile_cache_percentage | int | Percentage of memory to dedicate to TileDB Tile Cache, default 10 |
config_logging(level='fatal', log_file='')
Configure tiledbvcf logging.
Name | Type | Description | Default |
level |
str | Log level from (fatal|error|warn|info|debug|trace) | 'fatal' |
log_file |
str | Log file path. | '' |