Learn how to extend TileDB-VCF with the help of TileDB arrays to perform complex queries with this tutorial.
How to run this tutorial
You can run this tutorial in two ways:
Locally on your machine.
On TileDB Cloud.
However, since TileDB Cloud has a free tier, we strongly recommend that you sign up and run everything there, as that requires no installations or deployment.
The TileDB-VCF API ingests variant data for rapid access into a 3D array. One of these three dimensions corresponds to the sample_name of the variant data. This enables consumers to query their dataset by one or more samples, extracting attributes for that sample.
Sometimes, for more complex experiments, it may be necessary to generate sample cohorts to use for querying a tiledbvcf.Dataset as a prerequisite step. For example, a researcher may desire to pull all variant data from a tiledbvcf.Dataset related to a particular patient ID, where a patient may have a 1-to-1 or 1-to-many relationship with samples. Because that information isn’t stored in VCF files and, as a result, is not included in a tiledbvcf.Dataset, you can use an accompanying array relating patient ID to sample_name. With this setup, you would first query the array for all samples belonging to a patient and then those samples used to query the tiledbvcf.Dataset.
In this tutorial, you’ll learn how to extend TileDB-VCF with the help of TileDB arrays to perform complex queries.
First, pull a public TileDB-VCF dataset and choose some random samples for this tutorial. The TileDB team has made available several public TileDB-VCF datasets.
Now that you have a tiledbvcf.Dataset and some example sample names, simulate some corresponding metadata. This metadata will include other observations related to those samples such as parent, weight, and whether the sample is a pure breed. First create a pandas.DataFrame and then convert that to an array.
Note that during the array transformation, TileDB established the data types in the metadata array schema because the original types were not explicitly defined by the caller. Most data types should look familiar, with the exception of the Parent attribute, which is of type <U0. This represents a variable-length Unicode string in numpy, as described here.
# example parent we'd like to get samples forquery_parent ="X400"with tiledb.open(metadata_uri, "r") as Ar:# query metadata array for sample_result = Ar.query(cond=f"Parent == '{query_parent}'")[:]samples = [s.decode() for s in sample_result["Sample"]]print(samples)
As demonstrated here, storing metadata in a separate array is a useful strategy for querying a tiledbvcf.Dataset using supporting data to the TileDB-VCF API.