PDF Reader

This project aims to serve as middleware between an original PDF file and being used to construct knowledge base in dify. However, I hope it does more than that.

Demo

Test PDF file link: https://coursepals-pdfs.s3.us-west-1.amazonaws.com/test1.pdf

Feature and Performance Comparison

	PDF Reader	Jina Reader	Amazon Textract
Response Speed	Fast (*)	Very Fast	Very Fast
Response Content	Text + Image (JSON format)	Text	Text + Image
Pros	Extra features: 1. Image alternative text 2. Text correction	The response speed for text extraction is fast and the accuracy is acceptable.	The response contains text and image result and the response speed is fast.
Cons	It takes longer response time if enabling image extraction and text correction function.	No image extraction; accuracy of text result is not high enough.	The content of text and image messed up when they are not arranged strictly vertically .
Local Deployment	✅	❌	❌
Open Source	✅	🤔 (Partial)	❌

Note (*): PDF Reader can reach the same fast speed as Jina Reader if query parameters image=False and correct=False were set, which disable the two time-consuming features.

Request Query Parameters

image: Decides on whether to enable generating alternative texts for images in PDF files. It defaults to be True. If you don't want this feature, you can set image=false or image=False (case of bool value does not matter).
correct: Decides on whether to enable text correction for text in PDF files. By enabling this feature, the text accuracy would improve by 20% by fixing some spacing and typographical errors. However, it would take longer time to process. You can set correct=false or correct=False to disable it.

Usage Example

Local: http://127.0.0.1:8000/[PDF link]?correct=false&image=false

Cloud: https://nv27s8zxgi.execute-api.us-west-1.amazonaws.com/prod/[PDF link]?correct=false&image=false

Note: For now, the cloud service is only responded to the requests with specific token in the header, which means the cloud service is still not open yet.

Response Fields

code: Status code for current request.
- 200: Request success.
- 206: Request partial success, which means part of response can be returned, but some functionality did not work (like image uploading to S3, image caption generation and text correction).
- 400: Request error.
data: Extracted content from PDF file, by default including corrected text and alternative text for images. It depends on the query parameters and running status of services under the hood.
msg: Response message. If all the services ran successfully, it would be "success", otherwise, there would be message explaining what was going wrong under the hood.

How to Set up

Download the repo:

git clone https://github.com/AshleyXM/pdf-reader.git

Install dependencies:
```
pip install -r requirements.txt
```
Apply for some API keys

What you will need:
1. AWS Access Key and Secret Access Key
2. AWS S3 Bucket Name
3. OpenAI API key
Then, run the below command and replace mine with the keys you applied in .env file:
```
cp .env.example .env
```

🎉🎉🎉 Congrats! You are ready to go now!

How to Run

Run the below command:

uvicorn app.main:app --reload

Access http://127.0.0.1:8000/ with your browser, and you'll see the home page, which displays some quick guides of the project.

How to Deploy

Basically, what you need to do is the following five steps:

Prepare a Dockerfile
Build Docker Image
Push the Image to AWS ECR
Create Lambda Function with ECR Image
Configure API Gateway

You can check out deploy-branch for more details.

Highlights

Developed middleware hosted on AWS Lambda using Python FastAPI to facilitate the knowledge base construction from PDF files for customized GPTs in Stanford courses.
Stored images in AWS S3 to obtain public links and generated alternative text for images in the format [image caption](image link).
Improved text extraction accuracy by 20% by leveraging OpenAI Vision Model to correct spacing and typographical errors.
Optimized response speed by 67% with asynchronous processing with text correction and image alternative text generation.
Enhanced project robustness and reliability by implementing exception handling and extensive test cases.

Challenges Under the Hood

Response Speed Optimization

One of the biggest challenges is how to optimize the response speed.

At first, I utilized Azure Computer Vision to get caption and OCR result of each image and leveraged OpenAI LLM gpt-turbo-3.5 to correct the spacing and typographical errors in text content.

However, as the number of pages in PDF file grows, it takes forever to process one PDF file, since even one API call to OpenAI and Azure CV takes several seconds. So I realized that I need to run these tasks asynchronously instead of one by one.

Then here came a new problem. Even though OpenAI provides asynchronous support, Azure CV still does not implement it yet. Therefore, in order to get to improve the response speed. I need to make a tradeoff between abandoning the original plan by adding image processing and figuring out another way to do it.

Fortunately, finally I found marvin toolkit, which is based on OpenAI Vision and provides pretty good asynchronous support for generating image caption, to resolve this problem.

By trying so hard to do some optimization, then response speed improved by around 67%, which is pretty satisfying for our current task.

Deployment

At first, I followed some tutorials of how to deploy FastAPI project to AWS Lambda. However, the dependency package size of this project is pretty big, even I tried to split them into several more layers, it was still hard to manage the dependencies with AWS layers. So I started to create lambda function with container image. Although fortunately it worked pretty smoothly to do it with docker image being pushed to ECR repository, and created lambda function with this docker image, something else always happened at this time.

All the functionalities except uploading the images to S3 bucket works perfect. At first from the message An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records. returned from lambda function, I can tell something went wrong with the access key. Therefore, I tried to re-generate access key and secret access key and replaced with the new generated keys. Still not working...

Then from a Stack Overflow comment, I got to know that I need to change some configuration for project needed to be deployed on AWS Lambda. So I removed the access key, secret access key and region while creating the S3 client. After doing this, the error message changed to An error occurred (AccessDenied) when calling the PutObject operation: Access Denied. Alright, it seemed like I got a little closer to success because this error message looked like it was just the permission issue.

Unfortunately, I spent a day troubleshooting this problem. I tried to add S3 full access permission to the lambda function, to the IAM user I was using to operate, to the S3 bucket...Still got code 206 from my project, which means the image uploading function did not work.

At last, I tried to organize the thought of how the whole system works, like which part needs which part's permission, and who granted this permission. I suddenly got an idea that I should have granted the S3FullAccess to the role who is actually executing the lambda function instead of the IAM user I am using. Finally, the whole system works out. I got to say it is truly kind of tricky to play with AWS permissions😟.

ashleyxm / pdf-reader Goto Github PK

pdf-reader's Introduction

PDF Reader

Demo

Feature and Performance Comparison

Request Query Parameters

Usage Example

Response Fields

How to Set up

How to Run

How to Deploy

Highlights

Challenges Under the Hood

Response Speed Optimization

Deployment

pdf-reader's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent