import tiledbvcf
import tiledb.cloud.vcf as vcf
# Set the VCF URI
= "tiledb://TileDB-Inc/vcf-1kg-dragen-v376" vcf_uri
Query Transforms
You can run this tutorial only on TileDB Cloud. However, TileDB Cloud has a free tier. We strongly recommend that you sign up and run everything there, as that requires no installations or deployment.
A VCF distributed query transform will transform the results of VCF query in a distributed fashion. Each UDF node in the VCF query will run the same transform function on its results. The transform
argument provides a means for the user to supply a method to modify VCF query results without assembling the entire dataframe.
Example use models for the query transform are as follows:
- Filter query results to remove records that are not needed in downstream analysis.
- Create new columns that are derived from the original query columns.
- Modify column names and column order.
- Populate a new array based on the results of the query. For example, summary statistics or ML model inputs.
This tutorial shows an example of filtering VCF query results based on the value of an INFO
field, using public dataset tiledb://TileDB-Inc/vcf-1kg-dragen-v376
, which you can locate on the TileDB Cloud Marketplace.
Import the necessary libraries, and set the VCF URI. If you are running this from a local notebook, visit the Tutorials: Basic TileDB Cloud for more information on how to set your TileDB Cloud credentials in a configuration object (this step can be omitted inside a TileDB Cloud notebook).
Open the dataset for reading, and get the sample names and number.
A query transform filter takes a PyArrow table as input and returns a PyArrow table.
The vcf_filter
function below shows how to pass an argument to a transform function. The vcf_filter
function returns a transform_result
function with the signature required by the vcf.read
function.
The transform_result
function:
- Filters the query results based on the
filter
string. - Splits the
alleles
column into aref
andalt
column and drops thealleles
column.
Pass the filter
string to the vcf_filter
transform function and submit the query.