Giter Club home page Giter Club logo

pdf-reader's Introduction

PDF Reader

This project aims to serve as middleware between an original PDF file and being used to construct knowledge base in dify. However, I hope it does more than that.

Demo

Test PDF file link: https://coursepals-pdfs.s3.us-west-1.amazonaws.com/test1.pdf

Effect w/o image and correction

Effect w/ image and correction

Feature and Performance Comparison

PDF Reader Jina Reader Amazon Textract
Response Speed Fast (*) Very Fast Very Fast
Response Content Text + Image (JSON format) Text Text + Image
Pros Extra features: 1. Image alternative text 2. Text correction The response speed for text extraction is fast and the accuracy is acceptable. The response contains text and image result and the response speed is fast.
Cons It takes longer response time if enabling image extraction and text correction function. No image extraction; accuracy of text result is not high enough. The content of text and image messed up when they are not arranged strictly vertically .
Local Deployment
Open Source 🤔 (Partial)

Note (*): PDF Reader can reach the same fast speed as Jina Reader if query parameters image=False and correct=False were set, which disable the two time-consuming features.

Request Query Parameters

  • image: Decides on whether to enable generating alternative texts for images in PDF files. It defaults to be True. If you don't want this feature, you can set image=false or image=False (case of bool value does not matter).
  • correct: Decides on whether to enable text correction for text in PDF files. By enabling this feature, the text accuracy would improve by 20% by fixing some spacing and typographical errors. However, it would take longer time to process. You can set correct=false or correct=False to disable it.

Usage Example

Local: http://127.0.0.1:8000/[PDF link]?correct=false&image=false

Cloud: https://nv27s8zxgi.execute-api.us-west-1.amazonaws.com/prod/[PDF link]?correct=false&image=false

Note: For now, the cloud service is only responded to the requests with specific token in the header, which means the cloud service is still not open yet.

Response Fields

  1. code: Status code for current request.
    • 200: Request success.
    • 206: Request partial success, which means part of response can be returned, but some functionality did not work (like image uploading to S3, image caption generation and text correction).
    • 400: Request error.
  2. data: Extracted content from PDF file, by default including corrected text and alternative text for images. It depends on the query parameters and running status of services under the hood.
  3. msg: Response message. If all the services ran successfully, it would be "success", otherwise, there would be message explaining what was going wrong under the hood.

How to Set up

  1. Download the repo:

    git clone https://github.com/AshleyXM/pdf-reader.git
  2. Install dependencies:

    pip install -r requirements.txt
  3. Apply for some API keys

    What you will need:

    1. AWS Access Key and Secret Access Key
    2. AWS S3 Bucket Name
    3. OpenAI API key

    Then, run the below command and replace mine with the keys you applied in .env file:

    cp .env.example .env

🎉🎉🎉 Congrats! You are ready to go now!

How to Run

Run the below command:

uvicorn app.main:app --reload

Access http://127.0.0.1:8000/ with your browser, and you'll see the home page, which displays some quick guides of the project.

How to Deploy

Basically, what you need to do is the following five steps:

  1. Prepare a Dockerfile
  2. Build Docker Image
  3. Push the Image to AWS ECR
  4. Create Lambda Function with ECR Image
  5. Configure API Gateway

You can check out deploy-branch for more details.

Highlights

  • Developed middleware hosted on AWS Lambda using Python FastAPI to facilitate the knowledge base construction from PDF files for customized GPTs in Stanford courses.
  • Stored images in AWS S3 to obtain public links and generated alternative text for images in the format [image caption](image link).
  • Improved text extraction accuracy by 20% by leveraging OpenAI Vision Model to correct spacing and typographical errors.
  • Optimized response speed by 67% with asynchronous processing with text correction and image alternative text generation.
  • Enhanced project robustness and reliability by implementing exception handling and extensive test cases.

Challenges Under the Hood

Response Speed Optimization

One of the biggest challenges is how to optimize the response speed.

At first, I utilized Azure Computer Vision to get caption and OCR result of each image and leveraged OpenAI LLM gpt-turbo-3.5 to correct the spacing and typographical errors in text content.

However, as the number of pages in PDF file grows, it takes forever to process one PDF file, since even one API call to OpenAI and Azure CV takes several seconds. So I realized that I need to run these tasks asynchronously instead of one by one.

Then here came a new problem. Even though OpenAI provides asynchronous support, Azure CV still does not implement it yet. Therefore, in order to get to improve the response speed. I need to make a tradeoff between abandoning the original plan by adding image processing and figuring out another way to do it.

Fortunately, finally I found marvin toolkit, which is based on OpenAI Vision and provides pretty good asynchronous support for generating image caption, to resolve this problem.

By trying so hard to do some optimization, then response speed improved by around 67%, which is pretty satisfying for our current task.

Deployment

At first, I followed some tutorials of how to deploy FastAPI project to AWS Lambda. However, the dependency package size of this project is pretty big, even I tried to split them into several more layers, it was still hard to manage the dependencies with AWS layers. So I started to create lambda function with container image. Although fortunately it worked pretty smoothly to do it with docker image being pushed to ECR repository, and created lambda function with this docker image, something else always happened at this time.

All the functionalities except uploading the images to S3 bucket works perfect. At first from the message An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records. returned from lambda function, I can tell something went wrong with the access key. Therefore, I tried to re-generate access key and secret access key and replaced with the new generated keys. Still not working...

Then from a Stack Overflow comment, I got to know that I need to change some configuration for project needed to be deployed on AWS Lambda. So I removed the access key, secret access key and region while creating the S3 client. After doing this, the error message changed to An error occurred (AccessDenied) when calling the PutObject operation: Access Denied. Alright, it seemed like I got a little closer to success because this error message looked like it was just the permission issue.

Unfortunately, I spent a day troubleshooting this problem. I tried to add S3 full access permission to the lambda function, to the IAM user I was using to operate, to the S3 bucket...Still got code 206 from my project, which means the image uploading function did not work.

At last, I tried to organize the thought of how the whole system works, like which part needs which part's permission, and who granted this permission. I suddenly got an idea that I should have granted the S3FullAccess to the role who is actually executing the lambda function instead of the IAM user I am using. Finally, the whole system works out. I got to say it is truly kind of tricky to play with AWS permissions😟.

pdf-reader's People

Contributors

ashleyxm avatar

Watchers

Lucian avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.