Ingesting Annotations

life sciences

genomics (vcf)

tutorials

annotations

Learn about ingesting tabular annotations into TileDB arrays.

How to run this tutorial

You can run this tutorial only on TileDB Cloud. However, TileDB Cloud has a free tier. We strongly recommend that you sign up and run everything there, as that requires no installations or deployment.

This tutorial demonstrates how annotation that is tabular in nature, from sample metadata or variant annotation, can be ingested into generic TileDB arrays by using the from_csv or from_pandas methods.

This is the method used to ingest VEP and SnpEFF annotations into TileDB arrays. While creating the source tab-delimited files is beyond the scope of this tutorial, the following steps can be used for a number of external annotation approaches:

Produce a sampleless, variant-only VCF file from the allele_count auxiliary array member of a TileDB-VCF dataset.
Annotate the sampleless VCF file with VEP or SnpEFF.
Convert the annotated VCF file to a comma-separated (CSV) or tab-delimited (TSV) file.
Import the CSV (or TSV) into pandas, and make the necessary data type conversions.
Use the from_pandas method to convert the pandas DataFrame to a TileDB array.

Import the necessary libraries, and set the URIs that will be used in this tutorial. If you are running this from a local notebook, visit the Tutorials: Basic TileDB Cloud for more information on how to set your TileDB Cloud credentials in a configuration object (this step can be omitted inside a TileDB Cloud notebook).

Python

import os

import numpy as np
import pandas as pd
import tiledb
import tiledb.cloud
import tiledb.cloud.vcf

# Get your credentials
tiledb_token = os.environ["TILEDB_REST_TOKEN"]
# or use your username and password (not recommended)
# tiledb_username = os.environ["TILEDB_USERNAME"]
# tiledb_password = os.environ["TILEDB_PASSWORD"]

tsv_uri = "tiledb://TileDB-Inc/vep-chr1-10k-tsv-gz"
tsv_name = "vep-chr1-10k.tsv.gz"
array_name = "my_vep_array"

# Clean up
if tiledb.object_type(array_name) == "array":
    tiledb.remove(array_name)
if os.path.exists(tsv_name):
    os.remove(tsv_name)

# Log into TileDB Cloud
tiledb.cloud.login(token=tiledb_token)
# or use your username and password (not recommended)
# tiledb.cloud.login(username=tiledb_username, password=tiledb_password)

Fetch a pre-made, tab-delimited VEP annotation file from the TileDB Cloud.

Python

tiledb.cloud.files.utils.export_file(uri=tsv_uri, output_uri=tsv_name)

Select the columns to ingest.

Python

vep_cols = [
    "CHROM",
    "POS",
    "REF",
    "ALT",
    "Gene",
    "Feature",
    "Feature_type",
    "Consequence",
    "cDNA_position",
    "CDS_position",
    "Protein_position",
    "Amino_acids",
    "Codons",
    "Existing_variation",
    "IMPACT",
    "SYMBOL",
    "BIOTYPE",
    "EXON",
    "INTRON",
    "HGVSc",
    "HGVSp",
    "HGVSg",
    "HGVS_OFFSET",
    "REF_ALLELE",
    "DISTANCE",
    "STRAND",
    "FLAGS",
    "VARIANT_CLASS",
    "SYMBOL_SOURCE",
    "HGNC_ID",
    "CCDS",
    "gnomADg_AF",
    "gnomADg_AFR_AF",
    "gnomADg_AMI_AF",
    "gnomADg_AMR_AF",
    "gnomADg_ASJ_AF",
    "gnomADg_EAS_AF",
    "gnomADg_FIN_AF",
    "gnomADg_MID_AF",
    "gnomADg_NFE_AF",
    "gnomADg_OTH_AF",
    "gnomADg_SAS_AF",
    "CLIN_SIG",
    "SOMATIC",
    "PHENO",
]

Read the tab-delimited file into a pandas DataFrame.

Python

df = pd.read_csv(
    tsv_name,
    comment="#",
    sep="\t",
    usecols=vep_cols,
    compression="gzip",
    dtype=pd.StringDtype(),
)
df

	CHROM	POS	REF	ALT	Consequence	IMPACT	SYMBOL	Gene	Feature_type	Feature	...	gnomADg_ASJ_AF	gnomADg_EAS_AF	gnomADg_FIN_AF	gnomADg_MID_AF	gnomADg_NFE_AF	gnomADg_OTH_AF	gnomADg_SAS_AF	CLIN_SIG	SOMATIC	PHENO
0	chr1	10097	T	A	upstream_gene_variant	MODIFIER	DDX11L2	ENSG00000290825	Transcript	ENST00000456328	...	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>
1	chr1	10109	A	T	upstream_gene_variant	MODIFIER	DDX11L2	ENSG00000290825	Transcript	ENST00000456328	...	0	0	0	0	0	0	0	<NA>	<NA>	<NA>
2	chr1	10113	CT	C	upstream_gene_variant	MODIFIER	DDX11L2	ENSG00000290825	Transcript	ENST00000456328	...	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>
3	chr1	10146	ACCCCT	A	upstream_gene_variant	MODIFIER	DDX11L2	ENSG00000290825	Transcript	ENST00000456328	...	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>
4	chr1	10159	A	AC	upstream_gene_variant	MODIFIER	DDX11L2	ENSG00000290825	Transcript	ENST00000456328	...	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
9986	chr1	664707	T	C	intron_variant&non_coding_transcript_variant	MODIFIER	<NA>	ENSG00000230021	Transcript	ENST00000635509	...	0	0.0002242	0	0	0	0	0	<NA>	<NA>	<NA>
9987	chr1	664724	G	A	intron_variant&non_coding_transcript_variant	MODIFIER	<NA>	ENSG00000230021	Transcript	ENST00000635509	...	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>
9988	chr1	664772	G	T	intron_variant&non_coding_transcript_variant	MODIFIER	<NA>	ENSG00000230021	Transcript	ENST00000635509	...	0	0	0	0	0	0	0	<NA>	<NA>	<NA>
9989	chr1	664793	G	A	intron_variant&non_coding_transcript_variant	MODIFIER	<NA>	ENSG00000230021	Transcript	ENST00000635509	...	0	0.0004803	0	0	8.331e-05	0	0	<NA>	<NA>	<NA>
9990	chr1	664808	A	G	intron_variant&non_coding_transcript_variant	MODIFIER	<NA>	ENSG00000230021	Transcript	ENST00000635509	...	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>

9991 rows × 45 columns

Design an array schema, and create the TileDB array.

Python

ctx = tiledb.cloud.Ctx()
zstd_filter = tiledb.FilterList([tiledb.ZstdFilter()])
double_delta_zstd_filter = tiledb.FilterList(
    [tiledb.DoubleDeltaFilter(), tiledb.ZstdFilter()]
)

schema = tiledb.ArraySchema(
    domain=tiledb.Domain(
        *[
            tiledb.Dim(
                name="contig",
                domain=(None, None),
                tile="None",
                dtype="|S0",
                var=True,
                filters=zstd_filter,
            ),
            tiledb.Dim(
                name="pos_start",
                domain=(0, np.iinfo(np.uint32).max - 1),
                tile=np.iinfo(np.uint32).max,
                dtype="uint32",
                filters=double_delta_zstd_filter,
            ),
        ]
    ),
    attrs=[
        tiledb.Attr(
            name="ref",
            dtype="ascii",
            var=True,
            nullable=False,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="alt",
            dtype="ascii",
            var=True,
            nullable=False,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="Gene",
            dtype="ascii",
            var=True,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="Feature",
            dtype="ascii",
            var=True,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="Feature_type",
            dtype="ascii",
            var=True,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="Consequence",
            dtype="ascii",
            var=True,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="cDNA_position",
            dtype="ascii",
            var=True,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="CDS_position",
            dtype="ascii",
            var=True,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="Protein_position",
            dtype="ascii",
            var=True,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="Amino_acids",
            dtype="ascii",
            var=True,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="Codons",
            dtype="ascii",
            var=True,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="Existing_variation",
            dtype="ascii",
            var=True,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="IMPACT",
            dtype="ascii",
            var=True,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="SYMBOL",
            dtype="ascii",
            var=True,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="BIOTYPE",
            dtype="ascii",
            var=True,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="EXON",
            dtype="ascii",
            var=True,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="INTRON",
            dtype="ascii",
            var=True,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="HGVSc",
            dtype="ascii",
            var=True,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="HGVSp",
            dtype="ascii",
            var=True,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="HGVSg",
            dtype="ascii",
            var=True,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="HGVS_OFFSET",
            dtype="ascii",
            var=True,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="REF_ALLELE",
            dtype="ascii",
            var=True,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="DISTANCE",
            dtype="int32",
            var=False,
            nullable=True,
            filters=double_delta_zstd_filter,
        ),
        tiledb.Attr(
            name="STRAND",
            dtype="ascii",
            var=True,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="FLAGS",
            dtype="ascii",
            var=True,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="VARIANT_CLASS",
            dtype="ascii",
            var=True,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="SYMBOL_SOURCE",
            dtype="ascii",
            var=True,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="HGNC_ID",
            dtype="ascii",
            var=True,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="CCDS",
            dtype="ascii",
            var=True,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="gnomADg_AF",
            dtype=np.float32,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="gnomADg_AFR_AF",
            dtype=np.float32,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="gnomADg_AMI_AF",
            dtype=np.float32,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="gnomADg_AMR_AF",
            dtype=np.float32,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="gnomADg_ASJ_AF",
            dtype=np.float32,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="gnomADg_EAS_AF",
            dtype=np.float32,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="gnomADg_FIN_AF",
            dtype=np.float32,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="gnomADg_MID_AF",
            dtype=np.float32,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="gnomADg_NFE_AF",
            dtype=np.float32,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="gnomADg_OTH_AF",
            dtype=np.float32,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="gnomADg_SAS_AF",
            dtype=np.float32,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="CLIN_SIG",
            dtype="ascii",
            var=True,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="SOMATIC",
            dtype="ascii",
            var=True,
            nullable=True,
            filters=zstd_filter,
        ),
        tiledb.Attr(
            name="PHENO",
            dtype="ascii",
            var=True,
            nullable=True,
            filters=zstd_filter,
        ),
    ],
    cell_order="row-major",
    tile_order="row-major",
    capacity=10000,
    sparse=True,
    allows_duplicates=True,
    ctx=ctx,
)

# Create array
tiledb.Array.create(array_name, schema)

Clean the DataFrame, casting pandas data types in order to conform to the TileDB schema data types.

Python

df.rename(
    columns={"CHROM": "contig", "POS": "pos_start", "REF": "ref", "ALT": "alt"},
    inplace=True,
)
df["contig"] = df["contig"].astype(str)
df["pos_start"] = df["pos_start"].astype(np.uint32)
df["ref"] = df["ref"].astype(str)
df["alt"] = df["alt"].astype(str)

df["DISTANCE"] = df["DISTANCE"].astype(object).fillna(0).astype(np.int32)

for gnoval in [
    "gnomADg_AF",
    "gnomADg_AFR_AF",
    "gnomADg_AMI_AF",
    "gnomADg_AMR_AF",
    "gnomADg_ASJ_AF",
    "gnomADg_EAS_AF",
    "gnomADg_FIN_AF",
    "gnomADg_MID_AF",
    "gnomADg_NFE_AF",
    "gnomADg_OTH_AF",
    "gnomADg_SAS_AF",
]:
    df[gnoval] = df[gnoval].astype(str).str.replace("e-", "E-")
    df[gnoval] = df[gnoval].replace("<NA>", np.nan)
    df[gnoval] = df[gnoval].astype(object).astype(np.float32)

Set the index of the DataFrame to the match the TileDB dimensions.

Python

vep_dims = ["contig", "pos_start"]
df = df.set_index(["contig", "pos_start"])

Ingest the DataFrame into the TileDB array using from_pandas.

Python

tiledb.from_pandas(array_name, df, mode="append", index_dims=vep_dims, ctx=ctx)

Filter the TileDB array for variants with the gnomAD global allele frequency of less than 0.0001.

Python

my_vep = tiledb.open(array_name)
res = my_vep.query(cond="(gnomADg_AF < 0.0001)")
res.df[:]

	contig	pos_start	ref	alt	Gene	Feature	Feature_type	Consequence	cDNA_position	CDS_position	...	gnomADg_ASJ_AF	gnomADg_EAS_AF	gnomADg_FIN_AF	gnomADg_MID_AF	gnomADg_NFE_AF	gnomADg_OTH_AF	gnomADg_SAS_AF	CLIN_SIG	SOMATIC	PHENO
0	chr1	10109	A	T	ENSG00000290825	ENST00000456328	Transcript	upstream_gene_variant	None	None	...	0.0	0.000000	0.0	0.0	0.000000	0.0	0.0	None	None	None
1	chr1	10228	TA	T	ENSG00000290825	ENST00000456328	Transcript	upstream_gene_variant	None	None	...	0.0	0.000000	0.0	0.0	0.000000	0.0	0.0	None	None	None
2	chr1	10243	A	G	ENSG00000290825	ENST00000456328	Transcript	upstream_gene_variant	None	None	...	0.0	0.000000	0.0	0.0	0.000000	0.0	0.0	None	None	None
3	chr1	10254	TA	CA	ENSG00000290825	ENST00000456328	Transcript	upstream_gene_variant	None	None	...	NaN	0.000000	0.0	NaN	0.000000	NaN	0.0	None	None	None
4	chr1	10292	A	T	ENSG00000290825	ENST00000456328	Transcript	upstream_gene_variant	None	None	...	0.0	0.000000	0.0	0.0	0.000000	0.0	0.0	None	None	None
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2389	chr1	664602	A	G	ENSG00000230021	ENST00000635509	Transcript	intron_variant&non_coding_transcript_variant	None	None	...	0.0	0.000000	0.0	0.0	0.000000	0.0	0.0	None	None	None
2390	chr1	664652	T	A	ENSG00000230021	ENST00000635509	Transcript	intron_variant&non_coding_transcript_variant	None	None	...	0.0	0.000000	0.0	0.0	0.000000	0.0	0.0	None	None	None
2391	chr1	664707	T	C	ENSG00000230021	ENST00000635509	Transcript	intron_variant&non_coding_transcript_variant	None	None	...	0.0	0.000224	0.0	0.0	0.000000	0.0	0.0	None	None	None
2392	chr1	664772	G	T	ENSG00000230021	ENST00000635509	Transcript	intron_variant&non_coding_transcript_variant	None	None	...	0.0	0.000000	0.0	0.0	0.000000	0.0	0.0	None	None	None
2393	chr1	664793	G	A	ENSG00000230021	ENST00000635509	Transcript	intron_variant&non_coding_transcript_variant	None	None	...	0.0	0.000480	0.0	0.0	0.000083	0.0	0.0	None	None	None

2394 rows × 45 columns

Clean up the TileDB array and TSV file.

Python

my_vep = tiledb.open(array_name)
res = my_vep.query(cond="(gnomADg_AF < 0.0001)")
res.df[:]

	contig	pos_start	ref	alt	Gene	Feature	Feature_type	Consequence	cDNA_position	CDS_position	...	gnomADg_ASJ_AF	gnomADg_EAS_AF	gnomADg_FIN_AF	gnomADg_MID_AF	gnomADg_NFE_AF	gnomADg_OTH_AF	gnomADg_SAS_AF	CLIN_SIG	SOMATIC	PHENO
0	chr1	10109	A	T	ENSG00000290825	ENST00000456328	Transcript	upstream_gene_variant	None	None	...	0.0	0.000000	0.0	0.0	0.000000	0.0	0.0	None	None	None
1	chr1	10228	TA	T	ENSG00000290825	ENST00000456328	Transcript	upstream_gene_variant	None	None	...	0.0	0.000000	0.0	0.0	0.000000	0.0	0.0	None	None	None
2	chr1	10243	A	G	ENSG00000290825	ENST00000456328	Transcript	upstream_gene_variant	None	None	...	0.0	0.000000	0.0	0.0	0.000000	0.0	0.0	None	None	None
3	chr1	10254	TA	CA	ENSG00000290825	ENST00000456328	Transcript	upstream_gene_variant	None	None	...	NaN	0.000000	0.0	NaN	0.000000	NaN	0.0	None	None	None
4	chr1	10292	A	T	ENSG00000290825	ENST00000456328	Transcript	upstream_gene_variant	None	None	...	0.0	0.000000	0.0	0.0	0.000000	0.0	0.0	None	None	None
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2389	chr1	664602	A	G	ENSG00000230021	ENST00000635509	Transcript	intron_variant&non_coding_transcript_variant	None	None	...	0.0	0.000000	0.0	0.0	0.000000	0.0	0.0	None	None	None
2390	chr1	664652	T	A	ENSG00000230021	ENST00000635509	Transcript	intron_variant&non_coding_transcript_variant	None	None	...	0.0	0.000000	0.0	0.0	0.000000	0.0	0.0	None	None	None
2391	chr1	664707	T	C	ENSG00000230021	ENST00000635509	Transcript	intron_variant&non_coding_transcript_variant	None	None	...	0.0	0.000224	0.0	0.0	0.000000	0.0	0.0	None	None	None
2392	chr1	664772	G	T	ENSG00000230021	ENST00000635509	Transcript	intron_variant&non_coding_transcript_variant	None	None	...	0.0	0.000000	0.0	0.0	0.000000	0.0	0.0	None	None	None
2393	chr1	664793	G	A	ENSG00000230021	ENST00000635509	Transcript	intron_variant&non_coding_transcript_variant	None	None	...	0.0	0.000480	0.0	0.0	0.000083	0.0	0.0	None	None	None

2394 rows × 45 columns