import os
import tiledb
import tiledb.cloud
import tiledb.cloud.vcf
import tiledbvcf
# Get your credentials
= os.environ["TILEDB_REST_TOKEN"]
tiledb_token # or use your username and password (not recommended)
# tiledb_username = os.environ["TILEDB_USERNAME"]
# tiledb_password = os.environ["TILEDB_PASSWORD"]
# Public URI dataset to be used in this tutorial
= "tiledb://TileDB-Inc/SARS-CoV-2"
vcf_uri
# Log into TileDB Cloud
=tiledb_token)
tiledb.cloud.login(token# or use your username and password (not recommended)
# tiledb.cloud.login(username=tiledb_username, password=tiledb_password)
Embedded Annotations
You can run this tutorial only on TileDB Cloud. However, TileDB Cloud has a free tier. We strongly recommend that you sign up and run everything there, as that requires no installations or deployment.
TileDB-VCF faithfully represents annotations that are embedded in the INFO_
or FMT_
fields of VCFs. These attributes are not directly queryable in the TileDB-VCF API, but you can access them or filter them downstream using pandas or other data manipulation tools.
This tutorial describes how to search for frameshifts in recorded in the embedded SnpEFF annotation (info_EFF
) of select SARS-CoV2 samples located at tiledb://TileDB-Inc/SARS-CoV-2
. This annotations are in the VCFs provided by the ACTIV TRACE initiative.
Import the necessary libraries, and set the URI that will be used in this tutorial. If you are running this from a local notebook, visit the Tutorials: Basic TileDB Cloud for more information on how to set your TileDB Cloud credentials in a configuration object (this step can be omitted inside a TileDB Cloud notebook).
Prepare the SARS dataset for reading, and view the available attributes in the dataset.
Select the attributes that contain the SnpEFF annotations, and query a select group of samples and a limited genomic range.
Use a string wildcard search provided by pandas to select for NON_SYNONYMOUS
variants.