Giter Club home page Giter Club logo

llm-pdf-qa-workshop's Introduction

Workshop 1: PDF Q&Aโ“ App using OpenAI, Langchain, and Chainlit

This repository contains an introductory workshop for learning LLM Application Development using Langchain, OpenAI, and Chainlist. The workshop goes over a simplified process of developing an LLM application that provides a question answering interface to PDF documents. The prerequisite to the workshop is basic working knowledge of Git, Linux, and Python. The workshop is organized as follows.

Lab Learning Objective Problem Solution
1 Basic chat with data LLM App ๐Ÿ’ PDF Q&A Application โœ… Solution
2 Basic prompt engineering ๐Ÿ’ Improving Q&A Factuality โœ… Solution

To run the fully functional application, please checkout the main branch and follow the instruction to run the application

The app provides an chat interface that asks user to upload a PDF document and then allow users to ask questions against the PDF document. It uses OpenAI's API for the chat and embedding models, Langchain for the framework, and Chainlit as the fullstack interface.

For the purpose of the workshop, we are using Gap Q1 2023 Earnings Release as the example PDF.

The completed application looks as follows: PDF Q&A App

๐Ÿงฐ Stack

๐Ÿ‘‰ Getting Started

We use Python Poetry for managing virtual environments and we recommend using pyenv to manage Python versions. Alternatively, you can use Mamba for Python version management.

Install and start the Poetry shell as follows.

poetry install
poetry shell

Please create an .env file from .env.sample once the application is installed. Edit the .env file with your OpenAI org and OpenAI key.

cp .env.sample .env

Run the Application

Run the application by:

chainlit run app/app.py -w

Lab 1: Basic chat with data LLM App

The most quintessential llm application is a chat with text application. These type of application uses a retrieval augmented generation (RAG) design pattern, where the application first retrieve the relevant texts from memory and then generate answers based on the retrieved text.

For our application, we will go through the following steps in the order of execution:

  1. User uploads a PDF file.
  2. App load and decode the PDF into plain text.
  3. App chunks the text into smaller documents. This is because embedding models have limited input size.
  4. App stores the embeddings into memory
  5. User asks a question
  6. App retrieves the relevant documents from memory and generate an answer based on the retrieved text.

The overall architecture is as follows: init

Please implement the missing pieces in the application

Lab 1: Solution

  1. We choose to use langchain.document_loaders.PDFPlumberLoader to load PDF files. It helps with PDF file metadata in the future. And we like Super Mario Brothers who are plumbers.
  2. We choose to use langchain.text_splitter.RecursiveCharacterTextSplitter to chunk the text into smaller documents.
  3. Any in-memory vector stores should be suitable for this application since we are only expecting one single PDF. Anything more is over engineering. We choose to use Chroma.
  4. We use langchain.chains.RetrievalQAWithSourcesChain since it returns the sources, which helps the end users to access the source documents.

The completed application has the following architecture: final

Now we can run the application.

Lab 2: Basic prompt engineering

Playing around our newly created application using the provided sample PDF, Gap Q1 2023 Earnings Release, we run into a hallucination problem from the following long question to ask the model to summarize the key results:

What's the results for the reporter quarter? Please describe in the following order using bullet points - revenue, gross margin, opex, op margin, net income, and EPS. INclude both gaap and non-gaap numbers. Please also include quarter over quarter changes.

A whooping 36.6% operating margin for a retail business. That is 10x higher than Amazon's consolidated operating margins!!

hallucination

We again asked the model a simpler and more bounded question:

What's the op margin for the current quarter

fact

And it finds the answer.

Please resolve this hallucination problem with prompt engineering.

Lab 2: Solution

We utilized Chainlit's Prompt Playground functionality to experiment with the prompts. First, we investigates the prompts that includes the retrieved results. We found the correct operating margins is included. So the model is having a difficult time generating summaries using the right context.

We found that if we remove the few shot examples implemented by Langchain, gpt-3.5-turbo-0613 will be able to generate the right answer. However, it, for some reason, decided to change the sources into bullet points with summaries. We then experimented around and "fixed" the sources prompt.

To implement the updated prompts in our application, we traced Langchain's Python source code. We found that RetrievalQAWithSourcesChain inherites from BaseQAWithSourcesChain, where it has a class method from_chain_type() that uses load_qa_with_sources_chain to create the chain. The function maps the keyword stuff to use _load_stuff_chain. We then found that _load_stuff_chain takes a prompt variable and a document_prompt variable to create a StuffDocumentChain for doing the QA as a documentation summarization task.

The composition of the overall prompt is as follows: Alt text

We then extracted out the prompts into their own file and implements them there. We then initialize the RetrievalQAWithSourcesChain with our custom prompts!

LICENSE

This repository is open source under GPLv3 and please cite it as:

@misc{PDF_QA_App_Workshop,
  author = {Lee, Hanchung},
  title = {Workshop 1: PDF Q&A App using OpenAI, Langchain, and Chainlit},
  url = {https://github.com/https://github.com/leehanchung/llm-pdf-qa-workshop},
  year = {2023},
  month = {6},
  howpublished = {Github Repo},
}

llm-pdf-qa-workshop's People

Contributors

leehanchung avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

llm-pdf-qa-workshop's Issues

AttributeError: module 'chainlit' has no attribute 'langchain_factory'

(tensorflow) D:\LLMS\Chainlit\llm-pdf-qa-workshop>chainlit run app/app.py -w     
Traceback (most recent call last):
  File "C:\Users\ibm26\.conda\envs\tensorflow\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\ibm26\.conda\envs\tensorflow\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\ibm26\.conda\envs\tensorflow\Scripts\chainlit.exe\__main__.py", line 7, in <module>
    sys.exit(cli())
  File "C:\Users\ibm26\.conda\envs\tensorflow\lib\site-packages\click\core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "C:\Users\ibm26\.conda\envs\tensorflow\lib\site-packages\click\core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "C:\Users\ibm26\.conda\envs\tensorflow\lib\site-packages\click\core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\Users\ibm26\.conda\envs\tensorflow\lib\site-packages\click\core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Users\ibm26\.conda\envs\tensorflow\lib\site-packages\click\core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "C:\Users\ibm26\.conda\envs\tensorflow\lib\site-packages\chainlit\cli\__init__.py", line 154, in chainlit_run
    run_chainlit(target)
  File "C:\Users\ibm26\.conda\envs\tensorflow\lib\site-packages\chainlit\cli\__init__.py", line 47, in run_chainlit
    load_module(config.run.module_name)
  File "C:\Users\ibm26\.conda\envs\tensorflow\lib\site-packages\chainlit\config.py", line 242, in load_module
    spec.loader.exec_module(module)
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "D:\LLMS\Chainlit\llm-pdf-qa-workshop\app/app.py", line 91, in <module>
    @cl.langchain_factory(use_async=True)
AttributeError: module 'chainlit' has no attribute 'langchain_factory'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.