import os
import shutil
from langchain.chains import ConversationalRetrievalChain, ConversationChain
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers.txt import TextParser
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores.tiledb import TileDB
# URIs to be used throughout the tutorial
= "langchain"
langchain_repo_uri = "langchain_doc_index"
index_uri
# Clean up
if os.path.exists(langchain_repo_uri):
shutil.rmtree(langchain_repo_uri)if os.path.exists(index_uri):
shutil.rmtree(index_uri)
RAG LLM
We recommend running this tutorial, as well as the other various tutorials in the Tutorials section, inside TileDB Cloud. This will allow you to quickly experiment avoiding all the installation, deployment, and configuration hassles. Sign up for the free tier, spin up a TileDB Cloud notebook with a Python kernel, and follow the tutorial instructions. If you wish to learn how to run tutorials locally on your machine, read the Tutorials: Running Locally tutorial.
One of the limitations of LLMs is that their knowledge extends only to the data that were used during their training. Public training datasets are missing private and proprietary information required for enterprise applications. They are also missing information about the world and events that happened after the dataset creation time. This problem affects all types of LLMs, including public models, proprietary models, and even those deployed and used locally (e.g., in sensitive enterprise applications).
In this tutorial, you will use TileDB-Vector-Search to allow the gpt-3.5-turbo
model to answer questions about LangChain. Most ChatGPT models have limited world knowledge after 2021, which is the training data cutoff date. LangChain was created and became popular after 2021. Although we use ChatGPT 3.5, this example can be easily extended to other LLMs. You will augment gpt-3.5-turbo
with TileDB-Vector-Search, via LangChain, one of the most popular large language model (LLM) application development frameworks that integrates with our TileDB-Vector-Search library. This approach is called retrieval-augmented generation (RAG).
If you wish to learn more about RAG LLMs, visit the Introduction section.
Setup
To be able to run this tutorial, you will need an OpenAI API key. In addition, if you wish to use your local machine instead of a TileDB Cloud notebook, you will need to install the following:
conda install -c conda-forge langchain==0.0.331 openai==0.28.1 tiktoken
pip install langchain==0.0.331 openai==0.28.1 tiktoken
Start by importing the necessary libraries, setting the URIs you will use throughout the tutorial, and clean up any previously generated data.
Vanilla ChatGPT 3.5 Turbo
Initialize ChatGPT 3.5 Turbo.
Ask ChatGPT a question about LangChain. Note that ChatGPT incorrectly describes LangChain as a language learning platform based on the blockchain, and not as a framework for developing applications powered by LLMs.
RAG ChatGPT 3.5 Turbo
Now, you will use LangChain’s documentation to augment ChatGPT so that it can correctly answer the question about the project.
First, clone the LangChain repo from GitHub:
Next, parse the documents in the repo.
Generate the appropriate vector embeddings from the documents, which will be used to create a vector index afterwards.
Create a vector index on the generated embeddings, using TileDB-Vector-Search:
Now, ask the same question, augmenting ChatGPT with the vector index you created.
You can see that ChatGPT successfully responds with meaningful information about the LangChain project.
Clean up
Clean up the generated data.
What’s next?
Now that you know how to augment an LLM with vector search, you should learn about how you can augment it with conversation history as well by reading the Tutorials: LLM memory tutorial.