Query Transforms

life sciences

genomics (vcf)

tutorials

queries

Learn about using transforms in TileDB-VCF distributed queries.

How to run this tutorial

You can run this tutorial only on TileDB Cloud. However, TileDB Cloud has a free tier. We strongly recommend that you sign up and run everything there, as that requires no installations or deployment.

A VCF distributed query transform will transform the results of VCF query in a distributed fashion. Each UDF node in the VCF query will run the same transform function on its results. The transform argument provides a means for the user to supply a method to modify VCF query results without assembling the entire dataframe.

Example use models for the query transform are as follows:

Filter query results to remove records that are not needed in downstream analysis.
Create new columns that are derived from the original query columns.
Modify column names and column order.
Populate a new array based on the results of the query. For example, summary statistics or ML model inputs.

This tutorial shows an example of filtering VCF query results based on the value of an INFO field, using public dataset tiledb://TileDB-Inc/vcf-1kg-dragen-v376, which you can locate on the TileDB Cloud Marketplace.

Import the necessary libraries, and set the VCF URI. If you are running this from a local notebook, visit the Tutorials: Basic TileDB Cloud for more information on how to set your TileDB Cloud credentials in a configuration object (this step can be omitted inside a TileDB Cloud notebook).

Python

import tiledb.cloud.vcf as vcf
import tiledbvcf

# Set the VCF URI
vcf_uri = "tiledb://TileDB-Inc/vcf-1kg-dragen-v376"

Open the dataset for reading, and get the sample names and number.

Python

# Get the samples in the dataset
ds = tiledbvcf.Dataset(vcf_uri)
samples = ds.samples()
print(f"{len(samples):,} samples")

3,202 samples

A query transform filter takes a PyArrow table as input and returns a PyArrow table.

The vcf_filter function below shows how to pass an argument to a transform function. The vcf_filter function returns a transform_result function with the signature required by the vcf.read function.

The transform_result function:

Filters the query results based on the filter string.
Splits the alleles column into a ref and alt column and drops the alleles column.

Python

from functools import partial

import pyarrow as pa


# Create a function that applies a filter to the input table
# and splits alleles into ref and alt
def vcf_filter(filter: str):
    def transform_result(table: pa.Table, filter: str) -> pa.Table:
        # Convert arrow table to pandas and filter
        df = table.to_pandas().query(filter)

        # Split alleles into ref and alt
        df["ref"] = df["alleles"].str[0]
        df["alt"] = df["alleles"].apply(lambda x: ",".join(x[1:]))
        df = df.drop("alleles", axis=1)

        # Return arrow table
        return pa.Table.from_pandas(df)

    return partial(transform_result, filter=filter)

Pass the filter string to the vcf_filter transform function and submit the query.

Python

# Set the regions, attributes, and filter for the query
regions = "chr21:8220186-8221000"
attrs = [
    "sample_name",
    "contig",
    "pos_start",
    "alleles",
    "info_DP",
    "fmt_GT",
]
filter = "info_DP > 100"

# Submit the query, setting the vcf_filter function
# as a result transform
df = vcf.read(
    dataset_uri=vcf_uri,
    attrs=attrs,
    regions=regions,
    samples=samples,
    transform_result=vcf_filter(filter),
).to_pandas()
df

	sample_name	contig	pos_start	info_DP	fmt_GT	ref	alt
0	HG00100	chr21	8220178	[287]	[0, 1]	TCTCTCTCTCTCCCTCCCTCC	T
1	HG00102	chr21	8220178	[346]	[0, 1]	TCTCTCTCTCTCCCTCCCTCCCTCCCTCC	T
2	HG00119	chr21	8220178	[293]	[0, 1]	TCTCTCTCTCTCCCTCCCTCCCTCC	T
3	HG00126	chr21	8220178	[249]	[0, 1]	TCTCTCTCTCTCCCTCC	T
4	HG00133	chr21	8220178	[418]	[0, 1]	TCTCTCTCTCTCC	T
...	...	...	...	...	...	...	...
260	NA21104	chr21	8220745	[135]	[0, 1]	TTC	T
261	NA21105	chr21	8220745	[127]	[0, 1]	TTC	T
263	NA21125	chr21	8220745	[139]	[0, 1]	TTC	T
270	NA21102	chr21	8220747	[116]	[0, 1]	CTT	C
273	NA20889	chr21	8220762	[146]	[0, 1]	GCTCTCGCT	G

9626 rows × 7 columns