HyDE (Hypothetical Document Embeddings)

Instead of generating queries based on the original question, HyDE focuses on generating hypothetical docuemnts for a given query. The intution behind generating such hypothetical documents is their embedding vectors can be used to identify a neighborhood in the corpus embedding space where similar real documents are retrieved based on vector similarity. In that case, RAG will be able to retrieve more relevant documents based on the hypothetical documents to answer the user query accurately.

Let’s try to use HyDE to answer questions through RAG!

%load_ext dotenv
%dotenv secrets/secrets.env

from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

First, similar to the previous notebooks, first we create our vector store and initialize the retriever using OpenAIEmbeddings and Chroma.

loader = DirectoryLoader('data/',glob="*.pdf",loader_cls=PyPDFLoader)
documents = loader.load()

# Split text into chunks

text_splitter  = RecursiveCharacterTextSplitter(chunk_size=500,chunk_overlap=20)
text_chunks = text_splitter.split_documents(documents)

vectorstore = Chroma.from_documents(documents=text_chunks, 
                                    embedding=OpenAIEmbeddings(),
                                    persist_directory="data/vectorstore")
vectorstore.persist()

retriever = vectorstore.as_retriever(search_kwargs={'k':5})

/Users/sakunaharinda/Documents/Repositories/ragatouille-book/.venv/lib/python3.12/site-packages/pypdf/_crypt_providers/_cryptography.py:32: CryptographyDeprecationWarning: ARC4 has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.ARC4 and will be removed from cryptography.hazmat.primitives.ciphers.algorithms in 48.0.0.
  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4
/var/folders/lk/t9rd7y8534757z3rjdvbmhm40000gn/T/ipykernel_77230/1214593182.py:12: LangChainDeprecationWarning: Since Chroma 0.4.x the manual persistence method is no longer supported as docs are automatically persisted.
  vectorstore.persist()

Then we ask the LLM to write a “hypothetical” passage on the asked question through a chain.

from langchain_core.prompts import ChatPromptTemplate

hyde_prompt = ChatPromptTemplate.from_template(
    """
    Please write a scientific passage of a paper to answer the following question:\n
    Question: {question}\n
    Passage: 
    """
)

generate_doc_chain = (
    {'question': RunnablePassthrough()}
    | hyde_prompt
    | ChatOpenAI(model='gpt-4',temperature=0)
    | StrOutputParser()
)

question = "How Low Rank Adapters work in LLMs?"
generate_doc_chain.invoke(question)

"Low Rank Adapters (LRAs) are a recent development in the field of Large Language Models (LLMs) that aim to improve the efficiency and performance of these models. LLMs are typically trained on large amounts of data and have a high number of parameters, which can make them computationally expensive and difficult to fine-tune. LRAs address these issues by introducing a low-rank structure into the adapter modules of the LLMs.\n\nThe working of LRAs in LLMs can be understood in two steps: the introduction of adapter layers and the application of low-rank constraints. Adapter layers are additional layers inserted into the pre-trained LLMs. These layers are designed to adapt the pre-existing model to new tasks without altering the original parameters. This allows for efficient fine-tuning as only the parameters of the adapter layers need to be updated.\n\nThe second step involves the application of low-rank constraints to these adapter layers. The term 'low-rank' refers to the rank of a matrix, which is the maximum number of linearly independent rows or columns in the matrix. By constraining the adapter layers to have a low rank, the number of parameters that need to be learned is significantly reduced. This not only makes the model more computationally efficient but also helps to prevent overfitting.\n\nIn summary, Low Rank Adapters work in LLMs by introducing additional, low-rank adapter layers into the model. These layers adapt the model to new tasks without changing the original parameters, and their low-rank structure reduces the number of parameters that need to be learned. This makes the model more efficient and less prone to overfitting, thereby improving its performance."

Using the generated passage, we then retrieve the similar documents using our retriever.

retrieval_chain = generate_doc_chain | retriever 
retireved_docs = retrieval_chain.invoke({"question":question})
retireved_docs

[Document(metadata={'source': 'data/LoRA.pdf', 'page': 1}, page_content='over-parametrized models in fact reside on a low intrinsic dimension. We hypothesize that the\nchange in weights during model adaptation also has a low “intrinsic rank”, leading to our proposed\nLow-RankAdaptation (LoRA) approach. LoRA allows us to train some dense layers in a neural\nnetwork indirectly by optimizing rank decomposition matrices of the dense layers’ change during\nadaptation instead, while keeping the pre-trained weights frozen, as shown in Figure 1. Using GPT-3'),
 Document(metadata={'page': 1, 'source': 'data/LoRA.pdf'}, page_content='over-parametrized models in fact reside on a low intrinsic dimension. We hypothesize that the\nchange in weights during model adaptation also has a low “intrinsic rank”, leading to our proposed\nLow-RankAdaptation (LoRA) approach. LoRA allows us to train some dense layers in a neural\nnetwork indirectly by optimizing rank decomposition matrices of the dense layers’ change during\nadaptation instead, while keeping the pre-trained weights frozen, as shown in Figure 1. Using GPT-3'),
 Document(metadata={'source': 'data/LoRA.pdf', 'page': 1}, page_content='over-parametrized models in fact reside on a low intrinsic dimension. We hypothesize that the\nchange in weights during model adaptation also has a low “intrinsic rank”, leading to our proposed\nLow-RankAdaptation (LoRA) approach. LoRA allows us to train some dense layers in a neural\nnetwork indirectly by optimizing rank decomposition matrices of the dense layers’ change during\nadaptation instead, while keeping the pre-trained weights frozen, as shown in Figure 1. Using GPT-3'),
 Document(metadata={'source': 'data/LoRA.pdf', 'page': 1}, page_content='over-parametrized models in fact reside on a low intrinsic dimension. We hypothesize that the\nchange in weights during model adaptation also has a low “intrinsic rank”, leading to our proposed\nLow-RankAdaptation (LoRA) approach. LoRA allows us to train some dense layers in a neural\nnetwork indirectly by optimizing rank decomposition matrices of the dense layers’ change during\nadaptation instead, while keeping the pre-trained weights frozen, as shown in Figure 1. Using GPT-3'),
 Document(metadata={'page': 1, 'source': 'data/LoRA.pdf'}, page_content='over-parametrized models in fact reside on a low intrinsic dimension. We hypothesize that the\nchange in weights during model adaptation also has a low “intrinsic rank”, leading to our proposed\nLow-RankAdaptation (LoRA) approach. LoRA allows us to train some dense layers in a neural\nnetwork indirectly by optimizing rank decomposition matrices of the dense layers’ change during\nadaptation instead, while keeping the pre-trained weights frozen, as shown in Figure 1. Using GPT-3')]

Finally, we use the retrieved docuemnts based on the “hypothetical” passage is used as the context to answer our original question through the final_rag_chain.

template = """Answer the following question based on the provided context:

{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

final_rag_chain = (
    prompt
    | ChatOpenAI(model='gpt-4',temperature=0)
    | StrOutputParser()
)

final_rag_chain.invoke({"context":retireved_docs,"question":question})

'The Low-Rank Adaptation (LoRA) approach in Large Language Models (LLMs) works by training some dense layers in a neural network indirectly. This is done by optimizing rank decomposition matrices of the dense layers\' change during adaptation, while keeping the pre-trained weights frozen. The hypothesis behind this approach is that the change in weights during model adaptation has a low "intrinsic rank".'

Even though this technique might help answer questions, there is a chance to the answer be wrong due to retrieving documents based on the incorrect/hallucinated hypothetical passage.

In the next section, we talk about “Routing” in RAG.