Python API

life sciences

genomics (vcf)

reference

Classes:

Dataset provides read/write access to a TileDB-VCF dataset.
ReadConfig provides config settings for a TileDB-VCF dataset.
config_logging is used to configure TileDB-VCF logging.

Dataset

Dataset(self, uri, mode='r', cfg=None, stats=False, verbose=False, tiledb_config=None)

A class that provides read/write access to a TileDB-VCF dataset.

Parameters

Name	Type	Description	Default
`uri`	str	URI of the dataset.	required
`mode`	str	Mode of operation (`'r'`\|`'w'`)	`'r'`
`cfg`	ReadConfig	TileDB-VCF configuration.	`None`
`stats`	bool	Enable internal TileDB statistics.	`False`
`verbose`	bool	Enable verbose output.	`False`
`tiledb_config`	dict	TileDB configuration, alternative to `cfg.tiledb_config`.	`None`

Methods

Name	Description
`attributes`	Return a list of queryable attributes available in the VCF dataset.
`continue_read`	Continue an incomplete read.
`continue_read_arrow`	Continue an incomplete read.
`count`	Count records in the dataset.
`create_dataset`	Create a new dataset.
`export`	Exports data to multiple VCF files or a combined VCF file.
`ingest_samples`	Ingest VCF files into the dataset.
`read`	Read data from the dataset into a pandas DataFrame.
`read_allele_count`	Read allele count from the dataset into a pandas DataFrame
`read_arrow`	Read data from the dataset into a PyArrow Table.
`read_completed`	Returns true if the previous read operation was complete.
`read_iter`	Iterator version of `read()`.
`read_variant_stats`	Read variant stats from the dataset into a pandas DataFrame
`sample_count`	Get the number of samples in the dataset.
`samples`	Get the list of samples in the dataset.
`schema_version`	Get the VCF schema version of the dataset.
`tiledb_stats`	Get TileDB stats as a string.
`version`	Return the TileDB-VCF version used to create the dataset.

attributes

Dataset.attributes(attr_type='all')

Return a list of queryable attributes available in the VCF dataset.

Parameters

Name	Type	Description	Default
`attr_type`	str	The subset of attributes to retrieve; `info` or `fmt` will only retrieve attributes ingested from the VCF INFO and FORMAT fields, respectively, “builtin” retrieves the static attributes defined in TileDB-VCF’s schema, “all” (the default) returns all queryable attributes.	`'all'`

Returns

Type	Description
list	A list of attribute names.

continue_read

Dataset.continue_read(release_buffers=True)

Continue an incomplete read.

Parameters

Name	Type	Description	Default
`release_buffers`	bool	Release the buffers after reading.	`True`

Returns

Type	Description
pd.DataFrame	The next batch of data as a pandas DataFrame.

continue_read_arrow

Dataset.continue_read_arrow(release_buffers=True)

Continue an incomplete read.

Parameters

Name	Type	Description	Default
`release_buffers`	bool	Release the buffers after reading.	`True`

Returns

Type	Description
pa.Table	The next batch of data as a PyArrow Table.

count

Dataset.count(samples=None, regions=None)

Count records in the dataset.

Parameters

Name	Type	Description	Default
`samples`	(str, List[str])	Sample names to include in the count.	`None`
`regions`	(str, List[str])	Genomic regions to include in the count.	`None`

Returns

Type	Description
int	Number of intersecting records in the dataset.

create_dataset

Dataset.create_dataset(extra_attrs=None, vcf_attrs=None, tile_capacity=10000, anchor_gap=1000, checksum_type='sha256', allow_duplicates=True, enable_allele_count=True, enable_variant_stats=True, compress_sample_dim=True, compression_level=4)

Create a new dataset.

Parameters

Name	Type	Description	Default
`extra_attrs`	str	CSV list of extra attributes to materialize from `fmt` and `info` fields.	`None`
`vcf_attrs`	str	URI of VCF file with all `fmt` and `info` fields to materialize in the dataset.	`None`
`tile_capacity`	int	Tile capacity to use for the array schema.	`10000`
`anchor_gap`	int	Length of gaps between inserted anchor records in bases.	`1000`
`checksum_type`	str	Optional checksum type for the dataset, “sha256” or “md5”.	`'sha256'`
`allow_duplicates`	bool	Allow records with duplicate start positions to be written to the array.	`True`
`enable_allele_count`	bool	Enable the allele count ingestion task.	`True`
`enable_variant_stats`	bool	Enable the variant stats ingestion task.	`True`
`compress_sample_dim`	bool	Enable compression on the sample dimension.	`True`
`compression_level`	int	Compression level for zstd compression.	`4`

export

Dataset.export(samples=None, regions=None, samples_file=None, bed_file=None, bed_array=None, skip_check_samples=False, enable_progress_estimation=False, merge=False, output_format='z', output_path='', output_dir='.')

Exports data to multiple VCF files or a combined VCF file.

Parameters

Name	Type	Description	Default
`samples`	(str, List[str])	Sample names to be read.	`None`
`regions`	(str, List[str])	Genomic regions to be read.	`None`
`samples_file`	str	URI of file containing sample names to be read, one per line.	`None`
`bed_file`	str	URI of a BED file of genomic regions to be read.	`None`
`bed_array`	str	URI of a BED array of genomic regions to be read.	`None`
`skip_check_samples`	bool	Skip checking if the samples in `samples_file` exist in the dataset.	`False`
`set_af_filter`		Filter variants by internal allele frequency. For example, to include variants with AF > 0.1, set this to “>0.1”.	required
`scan_all_samples`		Scan all samples when computing internal allele frequency.	required
`enable_progress_estimation`	bool	DEPRECATED - This parameter will be removed in a future release.	`False`
`merge`	bool	Merge samples to create a combined VCF file.	`False`
`output_format`	str	Export file format: ‘b’: `.bcf` (compressed), ‘u’: `.bcf`, ‘z’: `.vcf.gz`, ‘v’: `.vcf`.	`'z'`
`output_path`	str	Combined VCF output file.	`''`
`output_dir`	str	Directory used for local output of exported samples.	`'.'`

`ingest_samples`

Dataset.ingest_samples(sample_uris=None, threads=None, total_memory_budget_mb=None, total_memory_percentage=None, ratio_tiledb_memory=None, max_tiledb_memory_mb=None, input_record_buffer_mb=None, avg_vcf_record_size=None, ratio_task_size=None, ratio_output_flush=None, scratch_space_path=None, scratch_space_size=None, sample_batch_size=None, resume=False, contig_fragment_merging=True, contigs_to_keep_separate=None, contigs_to_allow_merging=None, contig_mode='all', thread_task_size=None, memory_budget_mb=None, record_limit=None)

Ingest VCF files into the dataset.

Parameters

Name	Type	Description	Default
`sample_uris`	List[str]	List of sample URIs to ingest.	`None`
`threads`	int	Set the number of threads used for ingestion.	`None`
`total_memory_budget_mb`	int	Total memory budget for ingestion (MiB).	`None`
`total_memory_percentage`	float	Percentage of total system memory used for ingestion (overrides ‘total_memory_budget_mb’).	`None`
`ratio_tiledb_memory`	float	Ratio of memory budget allocated to `TileDB::sm.mem.total_budget`.	`None`
`max_tiledb_memory_mb`	int	Maximum memory allocated to TileDB::sm.mem.total_budget (MiB).	`None`
`input_record_buffer_mb`	int	Size of input record buffer for each sample file (MiB).	`None`
`avg_vcf_record_size`	int	Average VCF record size (bytes).	`None`
`ratio_task_size`	float	Ratio of worker task size to computed task size.	`None`
`ratio_output_flush`	float	Ratio of output buffer capacity that triggers a flush to TileDB.	`None`
`scratch_space_path`	str	Directory used for local storage of downloaded remote samples.	`None`
`scratch_space_size`	int	Amount of local storage that can be used for downloading remote samples (MB).	`None`
`sample_batch_size`	int	Number of samples per batch for ingestion (default 10).	`None`
`resume`	bool	Whether to check and attempt to resume a partial completed ingestion.	`False`
`contig_fragment_merging`	bool	Whether to enable merging of contigs into fragments. This overrides the contigs-to-keep-separate/contigs-to-allow- merging options. Generally contig fragment merging is good, this is a performance optimization to reduce the prefixes on a s3/azure/gcs bucket when there is a large number of pseudo contigs which are small in size.	`True`
`contigs_to_keep_separate`	List[str]	List of contigs that should not be merged into combined fragments. The default list includes all standard human chromosomes in both UCSC (e.g., chr1) and Ensembl (e.g., 1) formats.	`None`
`contigs_to_allow_merging`	List[str]	List of contigs that should be allowed to be merged into combined fragments.	`None`
`contig_mode`	str	Select which contigs are ingested: ‘all’, ‘separate’, or ‘merged’.	`'all'`
`thread_task_size`	int	DEPRECATED - This parameter will be removed in a future release.	`None`
`memory_budget_mb`	int	DEPRECATED - This parameter will be removed in a future release.	`None`
`record_limit`	int	DEPRECATED - This parameter will be removed in a future release.	`None`

read

Dataset.read(attrs=DEFAULT_ATTRS, samples=None, regions=None, samples_file=None, bed_file=None, bed_array=None, skip_check_samples=False, set_af_filter='', scan_all_samples=False, enable_progress_estimation=False)

Read data from the dataset into a pandas DataFrame.

For large datasets, a call to read() may not be able to fit all results in memory. In that case, the returned table will contain as many results as possible, and in order to retrieve the rest of the results, use the continue_read() function.

You can also use the Python generator version, read_iter().

Parameters

Name	Type	Description	Default
`attrs`	List[str]	List of attribute names to be read.	`DEFAULT_ATTRS`
`samples`	(str, List[str])	Sample names to be read.	`None`
`regions`	(str, List[str])	Genomic regions to be read.	`None`
`samples_file`	str	URI of file containing sample names to be read, one per line.	`None`
`bed_file`	str	URI of a BED file of genomic regions to be read.	`None`
`bed_array`	str	URI of a BED array of genomic regions to be read.	`None`
`skip_check_samples`	bool	Skip checking if the samples in `samples_file` exist in the dataset.	`False`
`set_af_filter`	str	Filter variants by internal allele frequency. For example, to include variants with AF > 0.1, set this to “>0.1”.	`''`
`enable_progress_estimation`	bool	DEPRECATED - This parameter will be removed in a future release.	`False`

Returns

Type	Description
pd.DataFrame	Query results as a pandas DataFrame.

read_allele_count

Dataset.read_allele_count(region=None)

Read allele count from the dataset into a pandas DataFrame

Parameters

Name	Type	Description	Default
`region`	str	Genomic region to be queried.	`None`

read_arrow

Dataset.read_arrow(attrs=DEFAULT_ATTRS, samples=None, regions=None, samples_file=None, bed_file=None, bed_array=None, skip_check_samples=False, set_af_filter='', scan_all_samples=False, enable_progress_estimation=False)

Read data from the dataset into a PyArrow Table.

For large queries, a call to read_arrow() may not be able to fit all results in memory. In that case, the returned table will contain as many results as possible, and in order to retrieve the rest of the results, use the continue_read() function.

Parameters

Name	Type	Description	Default
`attrs`	List[str]	List of attribute names to be read.	`DEFAULT_ATTRS`
`samples`	(str, List[str])	Sample names to be read.	`None`
`regions`	(str, List[str])	Genomic regions to be read.	`None`
`samples_file`	str	URI of file containing sample names to be read, one per line.	`None`
`bed_file`	str	URI of a BED file of genomic regions to be read.	`None`
`bed_array`	str	URI of a BED array of genomic regions to be read.	`None`
`skip_check_samples`	bool	Skip checking if the samples in `samples_file` exist in the dataset.	`False`
`set_af_filter`	str	Filter variants by internal allele frequency. For example, to include variants with AF > 0.1, set this to “>0.1”.	`''`
`scan_all_samples`	bool	Scan all samples when computing internal allele frequency.	`False`
`enable_progress_estimation`	bool	DEPRECATED - This parameter will be removed in a future release.	`False`

Returns

Type	Description
pa.Table	Query results as a PyArrow Table.

read_completed

Dataset.read_completed()

Returns true if the previous read operation was complete. A read is considered complete if the resulting dataframe contained all results.

Returns

Type	Description
True if the previous read operation was complete.

read_iter

Dataset.read_iter(attrs=DEFAULT_ATTRS, samples=None, regions=None, samples_file=None, bed_file=None)

Iterator version of read().

Parameters

Name	Type	Description	Default
`attrs`	List[str]	List of attribute names to be read.	`DEFAULT_ATTRS`
`samples`	(str, List[str])	Sample names to be read.	`None`
`regions`	(str, List[str])	Genomic regions to be read.	`None`
`samples_file`	str	URI of file containing sample names to be read, one per line.	`None`
`bed_file`	str	URI of a BED file of genomic regions to be read.	`None`

read_variant_stats

Dataset.read_variant_stats(region=None)

Read variant stats from the dataset into a pandas DataFrame

Parameters

Name	Type	Description	Default
`region`	str	Genomic region to be queried.	`None`

`sample_count`

Dataset.sample_count()

Get the number of samples in the dataset.

Returns

Type	Description
int	Number of samples in the dataset.

samples

Dataset.samples()

Get the list of samples in the dataset.

Returns

Type	Description
list	List of samples in the dataset.

schema_version

Dataset.schema_version()

Get the VCF schema version of the dataset.

Returns

Type	Description
int	VCF schema version of the dataset.

tiledb_stats

Dataset.tiledb_stats()

Get TileDB stats as a string.

Returns

Type	Description
str	TileDB stats as a string.

version

Dataset.version()

Return the TileDB-VCF version used to create the dataset.

Returns

Type	Description
str	The TileDB-VCF version.

ReadConfig

ReadConfig

Config settings for a TileDB-VCF dataset.

Attributes

Name	Type	Description
limit	int	Max number of records (rows) to read
region_partition	tuple	Region partition tuple (`idx`, `num_partitions`)
sample_partition	tuple	Samples partition tuple (`idx`, `num_partitions`)
sort_regions	bool	Whether or not to sort the regions to be read, default True
memory_budget_mb	int	Memory budget (MB) for buffer and internal allocations, default 2048MB
tiledb_config	typing.List[str]	List of strings of format ‘option=value’
buffer_percentage	int	Percentage of memory to dedicate to TileDB Query Buffers, default 25
tiledb_tile_cache_percentage	int	Percentage of memory to dedicate to TileDB Tile Cache, default 10

config_logging

config_logging(level='fatal', log_file='')

Configure tiledbvcf logging.

Parameters

Name	Type	Description	Default
`level`	str	Log level from (fatal\|error\|warn\|info\|debug\|trace)	`'fatal'`
`log_file`	str	Log file path.	`''`