import os
import tiledb
import tiledb.cloud
import tiledb.cloud.vcf
import pandas as pd
import numpy as np
# Get your credentials
= os.environ["TILEDB_REST_TOKEN"]
tiledb_token # or use your username and password (not recommended)
# tiledb_username = os.environ["TILEDB_USERNAME"]
# tiledb_password = os.environ["TILEDB_PASSWORD"]
= "tiledb://TileDB-Inc/vep-chr1-10k-tsv-gz"
tsv_uri = "vep-chr1-10k.tsv.gz"
tsv_name = "my_vep_array"
array_name
# Clean up
if tiledb.object_type(array_name) == "array":
tiledb.remove(array_name)if os.path.exists(tsv_name):
os.remove(tsv_name)
# Log into TileDB Cloud
=tiledb_token)
tiledb.cloud.login(token# or use your username and password (not recommended)
# tiledb.cloud.login(username=tiledb_username, password=tiledb_password)
Ingesting Annotations
You can run this tutorial only on TileDB Cloud. However, TileDB Cloud has a free tier. We strongly recommend that you sign up and run everything there, as that requires no installations or deployment.
This tutorial demonstrates how annotation that is tabular in nature, from sample metadata or variant annotation, can be ingested into generic TileDB arrays by using the from_csv
or from_pandas
methods.
This is the method used to ingest VEP and SnpEFF annotations into TileDB arrays. While creating the source tab-delimited files is beyond the scope of this tutorial, the following steps can be used for a number of external annotation approaches:
- Produce a sampleless, variant-only VCF file from the
allele_count
auxiliary array member of a TileDB-VCF dataset. - Annotate the sampleless VCF file with VEP or SnpEFF.
- Convert the annotated VCF file to a comma-separated (CSV) or tab-delimited (TSV) file.
- Import the CSV (or TSV) into pandas, and make the necessary data type conversions.
- Use the
from_pandas
method to convert the pandas DataFrame to a TileDB array.
Import the necessary libraries, and set the URIs that will be used in this tutorial. If you are running this from a local notebook, visit the Tutorials: Basic TileDB Cloud for more information on how to set your TileDB Cloud credentials in a configuration object (this step can be omitted inside a TileDB Cloud notebook).
Fetch a pre-made, tab-delimited VEP annotation file from the TileDB Cloud.
Select the columns to ingest.
vep_cols = [
"CHROM",
"POS",
"REF",
"ALT",
"Gene",
"Feature",
"Feature_type",
"Consequence",
"cDNA_position",
"CDS_position",
"Protein_position",
"Amino_acids",
"Codons",
"Existing_variation",
"IMPACT",
"SYMBOL",
"BIOTYPE",
"EXON",
"INTRON",
"HGVSc",
"HGVSp",
"HGVSg",
"HGVS_OFFSET",
"REF_ALLELE",
"DISTANCE",
"STRAND",
"FLAGS",
"VARIANT_CLASS",
"SYMBOL_SOURCE",
"HGNC_ID",
"CCDS",
"gnomADg_AF",
"gnomADg_AFR_AF",
"gnomADg_AMI_AF",
"gnomADg_AMR_AF",
"gnomADg_ASJ_AF",
"gnomADg_EAS_AF",
"gnomADg_FIN_AF",
"gnomADg_MID_AF",
"gnomADg_NFE_AF",
"gnomADg_OTH_AF",
"gnomADg_SAS_AF",
"CLIN_SIG",
"SOMATIC",
"PHENO",
]
Read the tab-delimited file into a pandas DataFrame.
df = pd.read_csv(
tsv_name,
comment="#",
sep="\t",
usecols=vep_cols,
compression="gzip",
dtype=pd.StringDtype(),
)
df
CHROM | POS | REF | ALT | Consequence | IMPACT | SYMBOL | Gene | Feature_type | Feature | ... | gnomADg_ASJ_AF | gnomADg_EAS_AF | gnomADg_FIN_AF | gnomADg_MID_AF | gnomADg_NFE_AF | gnomADg_OTH_AF | gnomADg_SAS_AF | CLIN_SIG | SOMATIC | PHENO | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 10097 | T | A | upstream_gene_variant | MODIFIER | DDX11L2 | ENSG00000290825 | Transcript | ENST00000456328 | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> |
1 | chr1 | 10109 | A | T | upstream_gene_variant | MODIFIER | DDX11L2 | ENSG00000290825 | Transcript | ENST00000456328 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | <NA> | <NA> | <NA> |
2 | chr1 | 10113 | CT | C | upstream_gene_variant | MODIFIER | DDX11L2 | ENSG00000290825 | Transcript | ENST00000456328 | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> |
3 | chr1 | 10146 | ACCCCT | A | upstream_gene_variant | MODIFIER | DDX11L2 | ENSG00000290825 | Transcript | ENST00000456328 | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> |
4 | chr1 | 10159 | A | AC | upstream_gene_variant | MODIFIER | DDX11L2 | ENSG00000290825 | Transcript | ENST00000456328 | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9986 | chr1 | 664707 | T | C | intron_variant&non_coding_transcript_variant | MODIFIER | <NA> | ENSG00000230021 | Transcript | ENST00000635509 | ... | 0 | 0.0002242 | 0 | 0 | 0 | 0 | 0 | <NA> | <NA> | <NA> |
9987 | chr1 | 664724 | G | A | intron_variant&non_coding_transcript_variant | MODIFIER | <NA> | ENSG00000230021 | Transcript | ENST00000635509 | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> |
9988 | chr1 | 664772 | G | T | intron_variant&non_coding_transcript_variant | MODIFIER | <NA> | ENSG00000230021 | Transcript | ENST00000635509 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | <NA> | <NA> | <NA> |
9989 | chr1 | 664793 | G | A | intron_variant&non_coding_transcript_variant | MODIFIER | <NA> | ENSG00000230021 | Transcript | ENST00000635509 | ... | 0 | 0.0004803 | 0 | 0 | 8.331e-05 | 0 | 0 | <NA> | <NA> | <NA> |
9990 | chr1 | 664808 | A | G | intron_variant&non_coding_transcript_variant | MODIFIER | <NA> | ENSG00000230021 | Transcript | ENST00000635509 | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> |
9991 rows × 45 columns
Design an array schema, and create the TileDB array.
ctx = tiledb.cloud.Ctx()
zstd_filter = tiledb.FilterList([tiledb.ZstdFilter()])
double_delta_zstd_filter = tiledb.FilterList(
[tiledb.DoubleDeltaFilter(), tiledb.ZstdFilter()]
)
schema = tiledb.ArraySchema(
domain=tiledb.Domain(
*[
tiledb.Dim(
name="contig",
domain=(None, None),
tile="None",
dtype="|S0",
var=True,
filters=zstd_filter,
),
tiledb.Dim(
name="pos_start",
domain=(0, np.iinfo(np.uint32).max - 1),
tile=np.iinfo(np.uint32).max,
dtype="uint32",
filters=double_delta_zstd_filter,
),
]
),
attrs=[
tiledb.Attr(
name="ref",
dtype="ascii",
var=True,
nullable=False,
filters=zstd_filter,
),
tiledb.Attr(
name="alt",
dtype="ascii",
var=True,
nullable=False,
filters=zstd_filter,
),
tiledb.Attr(
name="Gene",
dtype="ascii",
var=True,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="Feature",
dtype="ascii",
var=True,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="Feature_type",
dtype="ascii",
var=True,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="Consequence",
dtype="ascii",
var=True,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="cDNA_position",
dtype="ascii",
var=True,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="CDS_position",
dtype="ascii",
var=True,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="Protein_position",
dtype="ascii",
var=True,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="Amino_acids",
dtype="ascii",
var=True,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="Codons",
dtype="ascii",
var=True,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="Existing_variation",
dtype="ascii",
var=True,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="IMPACT",
dtype="ascii",
var=True,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="SYMBOL",
dtype="ascii",
var=True,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="BIOTYPE",
dtype="ascii",
var=True,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="EXON",
dtype="ascii",
var=True,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="INTRON",
dtype="ascii",
var=True,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="HGVSc",
dtype="ascii",
var=True,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="HGVSp",
dtype="ascii",
var=True,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="HGVSg",
dtype="ascii",
var=True,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="HGVS_OFFSET",
dtype="ascii",
var=True,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="REF_ALLELE",
dtype="ascii",
var=True,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="DISTANCE",
dtype="int32",
var=False,
nullable=True,
filters=double_delta_zstd_filter,
),
tiledb.Attr(
name="STRAND",
dtype="ascii",
var=True,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="FLAGS",
dtype="ascii",
var=True,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="VARIANT_CLASS",
dtype="ascii",
var=True,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="SYMBOL_SOURCE",
dtype="ascii",
var=True,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="HGNC_ID",
dtype="ascii",
var=True,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="CCDS",
dtype="ascii",
var=True,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="gnomADg_AF",
dtype=np.float32,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="gnomADg_AFR_AF",
dtype=np.float32,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="gnomADg_AMI_AF",
dtype=np.float32,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="gnomADg_AMR_AF",
dtype=np.float32,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="gnomADg_ASJ_AF",
dtype=np.float32,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="gnomADg_EAS_AF",
dtype=np.float32,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="gnomADg_FIN_AF",
dtype=np.float32,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="gnomADg_MID_AF",
dtype=np.float32,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="gnomADg_NFE_AF",
dtype=np.float32,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="gnomADg_OTH_AF",
dtype=np.float32,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="gnomADg_SAS_AF",
dtype=np.float32,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="CLIN_SIG",
dtype="ascii",
var=True,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="SOMATIC",
dtype="ascii",
var=True,
nullable=True,
filters=zstd_filter,
),
tiledb.Attr(
name="PHENO",
dtype="ascii",
var=True,
nullable=True,
filters=zstd_filter,
),
],
cell_order="row-major",
tile_order="row-major",
capacity=10000,
sparse=True,
allows_duplicates=True,
ctx=ctx,
)
# Create array
tiledb.Array.create(array_name, schema)
Clean the DataFrame, casting pandas data types in order to conform to the TileDB schema data types.
df.rename(
columns={"CHROM": "contig", "POS": "pos_start", "REF": "ref", "ALT": "alt"},
inplace=True,
)
df["contig"] = df["contig"].astype(str)
df["pos_start"] = df["pos_start"].astype(np.uint32)
df["ref"] = df["ref"].astype(str)
df["alt"] = df["alt"].astype(str)
df["DISTANCE"] = df["DISTANCE"].astype(object).fillna(0).astype(np.int32)
for gnoval in [
"gnomADg_AF",
"gnomADg_AFR_AF",
"gnomADg_AMI_AF",
"gnomADg_AMR_AF",
"gnomADg_ASJ_AF",
"gnomADg_EAS_AF",
"gnomADg_FIN_AF",
"gnomADg_MID_AF",
"gnomADg_NFE_AF",
"gnomADg_OTH_AF",
"gnomADg_SAS_AF",
]:
df[gnoval] = df[gnoval].astype(str).str.replace("e-", "E-")
df[gnoval] = df[gnoval].replace("<NA>", np.nan)
df[gnoval] = df[gnoval].astype(object).astype(np.float32)
Set the index of the DataFrame to the match the TileDB dimensions.
Ingest the DataFrame into the TileDB array using from_pandas
.
Filter the TileDB array for variants with the gnomAD global allele frequency of less than 0.0001.
contig | pos_start | ref | alt | Gene | Feature | Feature_type | Consequence | cDNA_position | CDS_position | ... | gnomADg_ASJ_AF | gnomADg_EAS_AF | gnomADg_FIN_AF | gnomADg_MID_AF | gnomADg_NFE_AF | gnomADg_OTH_AF | gnomADg_SAS_AF | CLIN_SIG | SOMATIC | PHENO | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 10109 | A | T | ENSG00000290825 | ENST00000456328 | Transcript | upstream_gene_variant | None | None | ... | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | None | None | None |
1 | chr1 | 10228 | TA | T | ENSG00000290825 | ENST00000456328 | Transcript | upstream_gene_variant | None | None | ... | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | None | None | None |
2 | chr1 | 10243 | A | G | ENSG00000290825 | ENST00000456328 | Transcript | upstream_gene_variant | None | None | ... | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | None | None | None |
3 | chr1 | 10254 | TA | CA | ENSG00000290825 | ENST00000456328 | Transcript | upstream_gene_variant | None | None | ... | NaN | 0.000000 | 0.0 | NaN | 0.000000 | NaN | 0.0 | None | None | None |
4 | chr1 | 10292 | A | T | ENSG00000290825 | ENST00000456328 | Transcript | upstream_gene_variant | None | None | ... | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | None | None | None |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2389 | chr1 | 664602 | A | G | ENSG00000230021 | ENST00000635509 | Transcript | intron_variant&non_coding_transcript_variant | None | None | ... | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | None | None | None |
2390 | chr1 | 664652 | T | A | ENSG00000230021 | ENST00000635509 | Transcript | intron_variant&non_coding_transcript_variant | None | None | ... | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | None | None | None |
2391 | chr1 | 664707 | T | C | ENSG00000230021 | ENST00000635509 | Transcript | intron_variant&non_coding_transcript_variant | None | None | ... | 0.0 | 0.000224 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | None | None | None |
2392 | chr1 | 664772 | G | T | ENSG00000230021 | ENST00000635509 | Transcript | intron_variant&non_coding_transcript_variant | None | None | ... | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | None | None | None |
2393 | chr1 | 664793 | G | A | ENSG00000230021 | ENST00000635509 | Transcript | intron_variant&non_coding_transcript_variant | None | None | ... | 0.0 | 0.000480 | 0.0 | 0.0 | 0.000083 | 0.0 | 0.0 | None | None | None |
2394 rows × 45 columns
Clean up the TileDB array and TSV file.
contig | pos_start | ref | alt | Gene | Feature | Feature_type | Consequence | cDNA_position | CDS_position | ... | gnomADg_ASJ_AF | gnomADg_EAS_AF | gnomADg_FIN_AF | gnomADg_MID_AF | gnomADg_NFE_AF | gnomADg_OTH_AF | gnomADg_SAS_AF | CLIN_SIG | SOMATIC | PHENO | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 10109 | A | T | ENSG00000290825 | ENST00000456328 | Transcript | upstream_gene_variant | None | None | ... | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | None | None | None |
1 | chr1 | 10228 | TA | T | ENSG00000290825 | ENST00000456328 | Transcript | upstream_gene_variant | None | None | ... | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | None | None | None |
2 | chr1 | 10243 | A | G | ENSG00000290825 | ENST00000456328 | Transcript | upstream_gene_variant | None | None | ... | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | None | None | None |
3 | chr1 | 10254 | TA | CA | ENSG00000290825 | ENST00000456328 | Transcript | upstream_gene_variant | None | None | ... | NaN | 0.000000 | 0.0 | NaN | 0.000000 | NaN | 0.0 | None | None | None |
4 | chr1 | 10292 | A | T | ENSG00000290825 | ENST00000456328 | Transcript | upstream_gene_variant | None | None | ... | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | None | None | None |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2389 | chr1 | 664602 | A | G | ENSG00000230021 | ENST00000635509 | Transcript | intron_variant&non_coding_transcript_variant | None | None | ... | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | None | None | None |
2390 | chr1 | 664652 | T | A | ENSG00000230021 | ENST00000635509 | Transcript | intron_variant&non_coding_transcript_variant | None | None | ... | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | None | None | None |
2391 | chr1 | 664707 | T | C | ENSG00000230021 | ENST00000635509 | Transcript | intron_variant&non_coding_transcript_variant | None | None | ... | 0.0 | 0.000224 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | None | None | None |
2392 | chr1 | 664772 | G | T | ENSG00000230021 | ENST00000635509 | Transcript | intron_variant&non_coding_transcript_variant | None | None | ... | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | None | None | None |
2393 | chr1 | 664793 | G | A | ENSG00000230021 | ENST00000635509 | Transcript | intron_variant&non_coding_transcript_variant | None | None | ... | 0.0 | 0.000480 | 0.0 | 0.0 | 0.000083 | 0.0 | 0.0 | None | None | None |
2394 rows × 45 columns