Giter Club home page Giter Club logo

openai-pdf's Introduction

Utilize OpenAI ChatGPT API to extract information from PDF files

Why it's hard to extract information from PDF files?

PDF, or Portable Document Format, is a popular file format that is widely used for documents such as invoices, purchase orders, and other business documents. However, extracting information from PDFs can be a challenging task for developers.

One reason why it is difficult to extract information from PDFs is that the format is not structured. Unlike HTML, which has a specific format for tables and headers that developers can easily identify, PDFs do not have a consistent layout for information. This makes it harder for developers to know where to find the specific information they need.

Another reason is that there is no standard layout for information. Each system generates invoices and purchase orders differently, so developers must often write custom code to extract information from each individual document. This can be a time-consuming and error-prone process.

Additionally, PDFs can contain both text and images, making it difficult for developers to programmatically extract information from the document. OCR (optical character recognition) can be used to extract text from images, but this adds complexity to the process and may result in errors if the OCR software is not accurate.

Existing solutions

Existing solutions for extracting information from PDFs include:

  • Using regex: to match patterns in text after converting the PDF to plain text. Examples include invoice2data and traprange-invoice. However, this method requires knowledge of the format of the data fields.

  • AI-based cloud services: utilize machine learning to extract structured data from PDFs. Examples include pdftables and docparser, but these are not open-source friendly.

Yet, another solution for PDF data extraction: using OpenAI ChatGPT API

One solution to extract information from PDF files is to use OpenAI's natural language processing capabilities to understand the content of the document. However, OpenAI is not able to work with PDF or image formats directly, so the first step is to convert the PDF to text while retaining the relative positions of the text items.

One way to achieve this is to use the PDFLayoutTextStripper library, which uses PDFBox to read through all text items in the PDF file and organize them in lines, keeping the relative positions the same as in the original PDF file. This is important because, for example, in an invoice's items table, if the amount is in the same column as the quantity, it will result in incorrect values when querying for the total amount and total quantity. Here is an example of the output from the stripper:

                       
                                                                                                *PO-003847945*                                           
                                                                                                                                                         
                                                                                      Page.........................: 1    of    1                        
                                                                                                                                                         
                                                                                                                                                         
                                                                                                                                                         
                                                                                                                                                         
                                                                                                                                                         
                Address...........:     Aeeee  Consumer  Good  Co.(QSC)            Purchase       Order                                                  
                                        P.O.Box 1234                                                                                                     
                                        Dooo,                                      PO-003847945                                                          
                                        ABC                                       TL-00074                                   
                                                                                                                                                         
                Telephone........:                                                 USR\S.Morato         5/10/2020 3:40 PM                                
                Fax...................:                                                                                                                  
                                                                                                                                                         
                                                                                                                                                         
               100225                Aaaaaa  Eeeeee                                 Date...................................: 5/10/2020                   
                                                                                    Expected  DeliveryDate...:  5/10/2020                                
               Phone........:                                                       Attention Information                                                
               Fax.............:                                                                                                                         
               Vendor :    TL-00074                                                                                                                      
               AAAA BBBB CCCCCAAI    W.L.L.                                         Payment  Terms     Current month  plus  60  days                     
                                                                                                                                                         
                                                                                                                                                         
                                                                                                                         Discount                        
          Barcode           Item number     Description                  Quantity   Unit     Unit price       Amount                  Discount           
          5449000165336     304100          CRET ZERO 350ML  PET             5.00 PACK24          54.00        270.00         0.00         0.00          
                                                     350                                                                                                 
          5449000105394     300742          CEEOCE  EOE SOFT DRINKS                                                                                      
                                            1.25LTR                          5.00  PACK6          27.00        135.00         0.00         0.00          
                                                                                                                                                         
                                                1.25                                                                                                                        
(truncated...)

Once the PDF has been converted to text, the next step is to call the OpenAI API and pass the text along with queries such as "Extract fields: 'PO Number', 'Total Amount'". The response will be in JSON format, and GSON can be used to parse it and extract the final results. This two-step process of converting the PDF to text and then using OpenAI's natural language processing capabilities can be an effective solution for extracting information from PDF files.

The query is as simple as follows with %s replaced by PO text content:

private static final String QUERY = """
    Want to extract fields: "PO Number", "Total Amount" and "Delivery Address".
    Return result in JSON format without any explanation. 
    The PO content is as follows:
    %s
    """;

The query consists of two components:

  • specifying the desired fields
  • formatting the field values as JSON data for easy retrieval from API response.

And here is the example response from OpenAI:

{
  "object": "text_completion",
  "model": "text-davinci-003",
  "choices": [
    {
      "text": "\\n{\\n  \\"PO Number\\": \\"PO-003847945\\",\\n  \\"Total Amount\\": \\"1,485.00\\",\\n  \\"Delivery Address\\": \\"Peera Consumer Good Co.(QSC), P.O.Box 3371, Dohe, QAT\\"\\n}",
      "index": 0,
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  // ... some more fields
}

Decoding the text field's JSON string yields the following desired fields:

{
  "PO Number": "PO-003847945",
  "Total Amount": "1,485.00",
  "Delivery Address": "Peera Consumer Good Co.(QSC), P.O.Box 3371, Dohe, QAT"
}

Run sample code

Prerequisites:

  • Java 16+
  • Maven

Steps:

  • Create an OpenAI account
  • Log in and generate an API key
  • Replace OPENAI_API_KEY in Main.java with your key
  • (Optional) Update SAMPLE_PDF_FILE if needed
  • Open the terminal, move to the root directory and run the following commands
mvn install
java -jar target/openai-pdf-1.0-SNAPSHOT-jar-with-dependencies.jar

More development-friendly ways include:

  • Using VS Code with the Java plugin.
  • Using IntelliJ.

Solution for large PDF files

pdf-query

openai-pdf's People

Contributors

thoqbk avatar

Stargazers

 avatar  avatar Mel Matsuoka avatar Bruce avatar  avatar Yunfan Yang avatar  avatar Aaron Joyce avatar  avatar Zac Dean avatar  avatar  avatar refactorthis avatar Ricardo Santana avatar Vishal Shah avatar Patrick avatar  avatar Q- Protex´s Repository avatar Mao avatar  avatar  avatar Sinan Akalin avatar Michael avatar Huub Van de Voort avatar Ambrose avatar Geoffrey Dagley avatar  avatar Michał Ka avatar Roman Nekrasov avatar  avatar Onkar Nath Mishra avatar Chandan Jog avatar  avatar  avatar Carlos Montoya avatar  avatar Ali Thanikkal avatar Dinesh Cyanam avatar  avatar KittoZheng avatar Nguyễn Nhật Nam avatar Ben Richardson avatar Kenneth Haugland avatar Matthew LaFalce avatar  avatar rs485 avatar Charlene avatar Emmanuel Riviere avatar  avatar liyifan avatar Mad Scientist avatar Kalyan avatar Motaz Saad avatar Nilda Vendditto avatar xiemeigongzi avatar Stan P. van de Burgt avatar MomoKiller avatar Sergio Alvarez Diaz avatar  avatar Mary Zhou avatar  avatar  avatar Kristoffer avatar  avatar yoo hyeong chan avatar  avatar Farrukh Jadoon avatar myoung-su,shin avatar Mostafa Mahmoud avatar SeungYeon Hong avatar  avatar  avatar  avatar jon y avatar  avatar  avatar Brooks Brasfield avatar Glenn Franxman avatar Amir Munoz avatar Frank Lim avatar yk avatar  avatar Melvin Dave ✨DonvitoCodes✨ avatar  avatar  avatar CJ *(^-^)*  avatar  avatar qixiaobo avatar  avatar  avatar Jeff Hampton avatar hugocc avatar Paul avatar Dameon Jensen avatar famiglietti avatar  avatar Duvan Salcedo avatar Dennis Blume avatar  avatar 饭食钢 avatar

Watchers

 avatar  avatar  avatar rs485 avatar Kostas Georgiou avatar Matthew LaFalce avatar  avatar

openai-pdf's Issues

Execute this Java project

Hi,

I'm not used with Java... so.. could you help me to execute your library?

I installed java 16, execute mvn install with success. And finally:

cd target
java16 -jar openai.pdf-1.0-SNAPSHOT.jar

But I'm getting this error:

no main manifest attribute, in C:\Users\user1\apps\openai-pdf\target\openai-pdf-1.0-SNAPSHOT.jar

And no idea how could I fix this!

Thanks @thoqbk

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.