Document Question Answering Chatbot with the help of OpenAI, LangChain, VectorDB, and Gradio UI.

4 min readNov 29, 2023

AI Document Chat Bot Karthikeyan Rathinam

Just in touch with Karthikeyan Rathinam: Linkedin, GitHub, Youtube

Building an OpenAI Chatbot with Gradio UI Using LangChain

In this tutorial, we’ll walk through the process of creating a chatbot powered by OpenAI, integrated into a Gradio UI, and enhanced with LangChain for document handling. This powerful combination allows for intelligent document searching and question-answering capabilities.

Prerequisites

Before we dive into the code, make sure you have the following dependencies installed:

langchain
unstructured
pandas
chromadb
tiktoken
openai
gradio
adaptive
pdf2image
pytesseract

Create a requirements.txt file with these dependencies and install them using:

requirements.txt
langchain
unstructured
pandas
chromadb
tiktoken
openai
gradio
adaptive
pdf2image
pytesseract

pip install -r requirements.txt

The Code

`app.py`

Import libraries and packages

import gradio as gr
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain
import pinecone
import os

Set Secret key:

os.system("!sudo apt-get install tesseract-ocr")
os.system("!sudo apt-get install poppler-utils")
os.system("!pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu111/torch1.10/index.html")
os.system("!pip install -qU pinecone-client")

Init Pinecone keys :

pinecone.init(
    api_key=PINECONE_API_KEY,
    environment=PINECONE_API_ENV
)

Set LLM and Chain :

llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)
chain = load_qa_chain(llm, chain_type="stuff")

Load PDF Document :

def load_pdf_document(file_path):
    loader = UnstructuredPDFLoader(file_path)
    return loader.load()

PDF Document to Chunks:

def split_document_to_chunks(documents):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    return text_splitter.split_documents(documents)

Search Document :

def documentsearch(texts, embeddings, index_name):
    return Pinecone.from_texts([t.page_content for t in texts], embeddings, index_name=index_name)

LLM Responce Block:

def responces(query, docsearch):
    docs = docsearch.similarity_search(query, include_metadata=True)
    return chain.run(input_documents=docs, question=query)

Clear Chat:

def clear_chat():
    global history
    history = []
    iface.update_chat([])

ChatBot Block :

docsearch = None
history = []

def chatbot(file, question):
    global history
    global docsearch
    if file is not None:
        data = load_pdf_document(file.name)
        texts = split_document_to_chunks(data)
        embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
        docsearch = documentsearch(texts, embeddings, index_name)
    if docsearch is not None and question is not None:
        history.append(("User", question))
        response = responces(question, docsearch)
        history.append(("Bot", response))
    return history

iface = gr.Interface(fn=chatbot, inputs=["file", "text"], outputs="list")

Launch Chatbot :

iface.launch()

Complete Code Block :

import gradio as gr
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain
import pinecone
import os

os.system("!sudo apt-get install tesseract-ocr")
os.system("!sudo apt-get install poppler-utils")
os.system("!pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu111/torch1.10/index.html")
os.system("!pip install -qU pinecone-client")

OPENAI_API_KEY = '---'
PINECONE_API_KEY = '---'
PINECONE_API_ENV = '---'
index_name = "---"

pinecone.init(
    api_key=PINECONE_API_KEY,
    environment=PINECONE_API_ENV
)

llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)
chain = load_qa_chain(llm, chain_type="stuff")

def load_pdf_document(file_path):
    loader = UnstructuredPDFLoader(file_path)
    return loader.load()

def split_document_to_chunks(documents):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    return text_splitter.split_documents(documents)

def documentsearch(texts, embeddings, index_name):
    return Pinecone.from_texts([t.page_content for t in texts], embeddings, index_name=index_name)

def responces(query, docsearch):
    docs = docsearch.similarity_search(query, include_metadata=True)
    return chain.run(input_documents=docs, question=query)

docsearch = None
history = []

def chatbot(file, question):
    global history
    global docsearch
    if file is not None:
        data = load_pdf_document(file.name)
        texts = split_document_to_chunks(data)
        embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
        docsearch = documentsearch(texts, embeddings, index_name)
    if docsearch is not None and question is not None:
        history.append(("User", question))
        response = responces(question, docsearch)
        history.append(("Bot", response))
    return history

iface = gr.Interface(fn=chatbot, inputs=["file", "text"], outputs="list")

def clear_chat():
    global history
    history = []
    iface.update_chat([])

iface.launch()

Explanation of key components:

Loading Dependencies: We start by importing necessary libraries and installing required packages.
Setting Up API Keys: Replace ‘ — -’ with your actual OpenAI and Pinecone API keys.
Initializing Pinecone and OpenAI: Setting up Pinecone for document search and OpenAI for language understanding.
Defining the Chatbot Function: The chatbot function orchestrates document loading, text splitting, and question-answering using LangChain.
Gradio Interface Setup: Creating a Gradio interface to interact with the chatbot, allowing users to upload a file and input text queries.

Document Handling with LangChain

The LangChain library is utilized for efficient document handling. It includes document loading, text splitting, and embeddings for intelligent question-answering.

Running the Application

To run the chatbot, execute the following commands:

sudo apt-get install tesseract-ocr poppler-utils
pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu111/torch1.10/index.html
pip install -qU pinecone-client
python app.py

Visit the provided Gradio UI link in your browser to interact with the chatbot.

Conclusion

Congratulations! You’ve built a sophisticated OpenAI-powered chatbot with a user-friendly Gradio interface, enhanced by LangChain for seamless document handling. This versatile system can be further customized and expanded to meet your specific requirements.

Feel free to experiment, add more features, or integrate additional functionalities to make your chatbot even more intelligent and useful.

Happy coding! 🚀

GitHub Repository : https://github.com/karthikeyanrathinam/Langchain-chatbot-with-openai

Just in touch with Karthikeyan Rathinam: Linkedin, GitHub, Youtube