Generative AI ChatBot: The OpenAI Langchain Document QA Bot with Gradio UI
Here is an explanation of the OpenAI Langchain Document QA code:
The main components:
UnstructuredFileLoader
- Loads documents from various file types (txt, pdf, docx, etc) into a list of text documents that Langchain can process.
loader = UnstructuredFileLoader(file_obj.name, strategy="fast")
docs = loader.load()
OpenAIEmbeddings
- Creates vector embeddings for each chunk of text using OpenAI's embedding model. This encodes the semantic meaning of each chunk.
chunks = text_splitter.split_documents(docs)
embeddings = OpenAIEmbeddings()
FAISS
- Indexes the vector embeddings to create the actual searchable knowledge base. This allows fast similarity search over the embeddings/documents.
knowledge_base = FAISS.from_documents(chunks, embeddings)
CharacterTextSplitter
- Splits the documents into smaller chunks before embedding. This allows embedding documents of arbitrary size.
text_splitter = CharacterTextSplitter(separator="\n", chunk_size=500, chunk_overlap=0, length_function=len)
load_qa_chain
- Loads a pre-trained QA model from Langchain using the OpenAI LLM. This is a T5 model fine-tuned for question answering.
chain = load_qa_chain(llm, chain_type="stuff")
The main functions:
create_knowledge_base
- Takes documents, splits them, embeds them, and indexes them into a knowledge base.upload_file
- Handler for uploading a local file. Loads it and creates the knowledge base.upload_via_url
- Handles a URL, downloads the content, and loads it.answer_question
- Takes a question and searches the knowledge base embeddings to find the most relevant passage. Passes this to the QA model to generate an answer.
The Gradio app:
- Allows uploading local files or entering URLs to populate the knowledge base.
- Provides a textbox to enter a question.
- Displays the generated answer returned by the QA model.
with gr.Blocks(css="style.css",theme=gr.themes.Soft()) as demo:
state = gr.State(get_empty_state())
gr.HTML("""LINKAGETHINK""")
with gr.Column(elem_id="col-container"):
gr.HTML(
""""""
)
gr.HTML(
"""
OPENAI Document QA
"""
)
gr.HTML(
""""""
)
gr.Markdown("**Upload your file**")
with gr.Row(elem_id="row-flex"):
with gr.Column(scale=0.85):
file_url = gr.Textbox(
value="",
label="Upload your file",
placeholder="Enter a url",
show_label=False,
visible=False
)
with gr.Column(scale=0.90, min_width=160):
file_output = gr.File(elem_classes="filenameshow")
with gr.Column(scale=0.10, min_width=160):
upload_button = gr.UploadButton(
"Browse File",file_types=[".txt", ".pdf", ".doc", ".docx",".json",".csv"],
elem_classes="filenameshow")
with gr.Row():
with gr.Column(scale=1, min_width=0):
user_question = gr.Textbox(value="",label='Question Box :',show_label=True, placeholder="Ask a question about your file:",elem_classes="spaceH")
with gr.Row():
with gr.Column(scale=1, min_width=0):
answer = gr.Textbox(value="",label='Answer Box :',show_label=True, placeholder="",lines=5)
file_url.submit(upload_via_url, file_url, [file_output, state])
upload_button.upload(upload_file, upload_button, [file_output,state])
user_question.submit(answer_question, [user_question, state], [answer])
Loading documents
The UnstructuredFileLoader
handles ingesting files of different formats like PDF, DOC, TXT etc.
loader = UnstructuredFileLoader(file_path, strategy="fast")
docs = loader.load()
It uses heuristic strategies to extract text content from these file types.
The resulting docs
is a list of text snippets representing the contents of the uploaded file.
Text Splitting
The CharacterTextSplitter
splits the loaded documents into smaller chunks:
text_splitter = CharacterTextSplitter(separator="\n", chunk_size=500, chunk_overlap=0)
chunks = text_splitter.split_documents(docs)
This is done because large documents can’t be embedded efficiently in one go. The chunks will be embedded separately.
Embedding
The OpenAIEmbeddings
module is used to generate vector embeddings for each chunk:
embeddings = OpenAIEmbeddings()
This uses OpenAI’s text-embedding-ada-002 model under the hood.
Indexing
The vector embeddings for each chunk are indexed using FAISS for fast similarity search:
knowledge_base = FAISS.from_documents(chunks, embeddings)
This indexing allows finding the most relevant chunks for a query.
Question Answering
Given a user question, relevant chunks are retrieved by searching the knowledge base index:
docs = knowledge_base.similarity_search(question)
These chunks are passed to a QA model to generate the answer:
llm = OpenAI()
chain = load_qa_chain(llm)
response = chain.run(docs, question)
The load_qa_chain helper sets up the QA model using the OpenAI API
So in summary, it provides an end-to-end pipeline for uploading documents, indexing them for search, and leveraging a QA model to answer questions based on the uploaded doc contents.
Run app.py script
pip install -r requirements.txt
python app.py
Complete Notebook: Github make a Star, Fork this notebook
More LLM-Related Practice: Visit
Follow
Feel free to reach out if you have any questions or need further assistance.