This project aims to serve as middleware between an original PDF file and being used to construct knowledge base in dify. However, I hope it does more than that.
Test PDF file link: https://coursepals-pdfs.s3.us-west-1.amazonaws.com/test1.pdf
PDF Reader | Jina Reader | Amazon Textract | |
---|---|---|---|
Response Speed | Fast (*) | Very Fast | Very Fast |
Response Content | Text + Image (JSON format) | Text | Text + Image |
Pros | Extra features: 1. Image alternative text 2. Text correction | The response speed for text extraction is fast and the accuracy is acceptable. | The response contains text and image result and the response speed is fast. |
Cons | It takes longer response time if enabling image extraction and text correction function. | No image extraction; accuracy of text result is not high enough. | The content of text and image messed up when they are not arranged strictly vertically . |
Local Deployment | ✅ | ❌ | ❌ |
Open Source | ✅ | 🤔 (Partial) | ❌ |
Note (*): PDF Reader can reach the same fast speed as Jina Reader if query parameters image=False
and correct=False
were set, which disable the two time-consuming features.
image
: Decides on whether to enable generating alternative texts for images in PDF files. It defaults to be True. If you don't want this feature, you can setimage=false
orimage=False
(case of bool value does not matter).correct
: Decides on whether to enable text correction for text in PDF files. By enabling this feature, the text accuracy would improve by 20% by fixing some spacing and typographical errors. However, it would take longer time to process. You can setcorrect=false
orcorrect=False
to disable it.
Local: http://127.0.0.1:8000/[PDF link]?correct=false&image=false
Cloud: https://nv27s8zxgi.execute-api.us-west-1.amazonaws.com/prod/[PDF link]?correct=false&image=false
Note: For now, the cloud service is only responded to the requests with specific token in the header, which means the cloud service is still not open yet.
code
: Status code for current request.200
: Request success.206
: Request partial success, which means part of response can be returned, but some functionality did not work (like image uploading to S3, image caption generation and text correction).400
: Request error.
data
: Extracted content from PDF file, by default including corrected text and alternative text for images. It depends on the query parameters and running status of services under the hood.msg
: Response message. If all the services ran successfully, it would be "success", otherwise, there would be message explaining what was going wrong under the hood.
-
Download the repo:
git clone https://github.com/AshleyXM/pdf-reader.git
-
Install dependencies:
pip install -r requirements.txt
-
Apply for some API keys
What you will need:
- AWS Access Key and Secret Access Key
- AWS S3 Bucket Name
- OpenAI API key
Then, run the below command and replace mine with the keys you applied in
.env
file:cp .env.example .env
🎉🎉🎉 Congrats! You are ready to go now!
Run the below command:
uvicorn app.main:app --reload
Access http://127.0.0.1:8000/ with your browser, and you'll see the home page, which displays some quick guides of the project.
Basically, what you need to do is the following five steps:
- Prepare a Dockerfile
- Build Docker Image
- Push the Image to AWS ECR
- Create Lambda Function with ECR Image
- Configure API Gateway
You can check out deploy-branch
for more details.
- Developed middleware hosted on AWS Lambda using Python FastAPI to facilitate the knowledge base construction from PDF files for customized GPTs in Stanford courses.
- Stored images in AWS S3 to obtain public links and generated alternative text for images in the format
[image caption](image link)
. - Improved text extraction accuracy by 20% by leveraging OpenAI Vision Model to correct spacing and typographical errors.
- Optimized response speed by 67% with asynchronous processing with text correction and image alternative text generation.
- Enhanced project robustness and reliability by implementing exception handling and extensive test cases.
One of the biggest challenges is how to optimize the response speed.
At first, I utilized Azure Computer Vision to get caption and OCR result of each image and leveraged OpenAI LLM gpt-turbo-3.5
to correct the spacing and typographical errors in text content.
However, as the number of pages in PDF file grows, it takes forever to process one PDF file, since even one API call to OpenAI and Azure CV takes several seconds. So I realized that I need to run these tasks asynchronously instead of one by one.
Then here came a new problem. Even though OpenAI provides asynchronous support, Azure CV still does not implement it yet. Therefore, in order to get to improve the response speed. I need to make a tradeoff between abandoning the original plan by adding image processing and figuring out another way to do it.
Fortunately, finally I found marvin toolkit, which is based on OpenAI Vision and provides pretty good asynchronous support for generating image caption, to resolve this problem.
By trying so hard to do some optimization, then response speed improved by around 67%, which is pretty satisfying for our current task.
At first, I followed some tutorials of how to deploy FastAPI project to AWS Lambda. However, the dependency package size of this project is pretty big, even I tried to split them into several more layers, it was still hard to manage the dependencies with AWS layers. So I started to create lambda function with container image. Although fortunately it worked pretty smoothly to do it with docker image being pushed to ECR repository, and created lambda function with this docker image, something else always happened at this time.
All the functionalities except uploading the images to S3 bucket works perfect. At first from the message An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records.
returned from lambda function, I can tell something went wrong with the access key. Therefore, I tried to re-generate access key and secret access key and replaced with the new generated keys. Still not working...
Then from a Stack Overflow comment, I got to know that I need to change some configuration for project needed to be deployed on AWS Lambda. So I removed the access key, secret access key and region while creating the S3 client. After doing this, the error message changed to An error occurred (AccessDenied) when calling the PutObject operation: Access Denied
. Alright, it seemed like I got a little closer to success because this error message looked like it was just the permission issue.
Unfortunately, I spent a day troubleshooting this problem. I tried to add S3 full access permission to the lambda function, to the IAM user I was using to operate, to the S3 bucket...Still got code 206
from my project, which means the image uploading function did not work.
At last, I tried to organize the thought of how the whole system works, like which part needs which part's permission, and who granted this permission. I suddenly got an idea that I should have granted the S3FullAccess to the role who is actually executing the lambda function instead of the IAM user I am using. Finally, the whole system works out. I got to say it is truly kind of tricky to play with AWS permissions😟.