This tutorial demonstrates how to create a vector index for PDF files, and search over the files using an English phrase.
How to run this tutorial
We recommend running this tutorial, as well as the other various tutorials in the Tutorials section, inside TileDB Cloud. This will allow you to quickly experiment avoiding all the installation, deployment, and configuration hassles. Sign up for the free tier, spin up a TileDB Cloud notebook with a Python kernel, and follow the tutorial instructions. If you wish to learn how to run tutorials locally on your machine, read the Tutorials: Running Locally tutorial.
In this tutorial, you will learn how to load large collections of PDF files into a TileDB-Vector-Search index, and query them using an English phrase.
Setup
To be able to run this tutorial, you will need an OpenAI API key. In addition, if you wish to use your local machine instead of a TileDB Cloud notebook, you will need to install the following:
You will ingest the invoice PDFs. The ingestion performs:
File parsing, text extraction, and text splitting into chunks.
Text embedding generation using open source embedding models or OpenAI API calls.
Vector indexing of embeddings.
Extract text from the PDF documents and split it into text chunks using LangChain utilities.
# Parse documents and split them into text chunksloader = GenericLoader.from_filesystem( input_files_uri, glob="**/*", suffixes=[".pdf"], parser=PyMuPDFParser())documents = loader.load()splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)documents = loader.load()print(f"Number of raw documents loaded: {len(documents)}")documents = splitter.split_documents(documents)texts = [d.page_content for d in documents]print(f"Number of document chunks: {len(texts)}")
Number of raw documents loaded: 971
Number of document chunks: 971
You can now generate text embeddings and index them using an IVF_FLAT index.
# NOTE: You need to set the OPENAI_API_KEY variable for this to work.# Generate embeddings for each document chunkembedding = OpenAIEmbeddings()text_embeddings = embedding.embed_documents(texts)# Index document chunks using a TileDB IVF_FLAT indexvs.ingest( index_type="IVF_FLAT", index_uri=index_uri, input_vectors=np.array(text_embeddings).astype(np.float32),)print(f"Number of vector embeddings stored in TileDB-Vector-Search: {len(text_embeddings)}")
Number of vector embeddings stored in TileDB-Vector-Search: 971