Using PyMuPDF in an RAG (Retrieval-Augmented Generation) Chatbot Environment

This repository contains examples showing how PyMuPDF can be used as a data feed for RAG-based chatbots.

Examples include scripts that start chatbots - either as simple CLI programs in REPL mode or browser-based GUIs. Chatbot scripts follow this general structure:

Extract Text: Use PyMuPDF to extract text from one or more pages from one or more PDFs. Depending on the specific requirement this may be all text or only text contained in tables, the Table of Contents, etc. This will generally be implemented as one or more Python functions called by any of the following events - which implement the actual chatbot functionality.
Indexing the Extracted Text: Index the extracted text for efficient retrieval. This index will act as the knowledge base for the chatbot.
Query Processing: When a user asks a question, process the query to determine the key information needed for a response.
Retrieving Relevant Information: Search your indexed knowledge base for the most relevant pieces of information related to the user's query.
Generating a Response: Use a generative model to generate a response based on the retrieved information.

Installation

As a specialty, folder "helpers" contains a script that is capable to convert PDF pages into text strings in Markdown format (GitHub compatible), which includes standard text as well as table-based text in a consistent and integrated view. This is particularly important in RAG environments.

There is a Python package on PyPI pymupdf4llm (there also is an alias pdf4llm) which provides convenient access to this script:

$ pip install -U pymupdf4llm

This command will automatically install PyMuPDF if required.

Then in your script do

import pymupdf4llm

md_text = pymupdf4llm.to_markdown("input.pdf", pages=None)

# now work with the markdown text, e.g. store as a UTF8-encoded file
import pathlib
pathlib.Path("output.md").write_bytes(md_text.encode())

Instead of the filename string as above, one can also provide a PyMuPDF Document. The pages parameter may be a list of 0-based page numbers or None (the default) whch includes all pages.

Document Support

While PDF is certainly the most important document format worldwide by far, it is worthwhile mentioning that all examples and helper scripts work in the same way and without change for all supported file types.

So for an XPS document or an eBook, simply provide the filename for instance as "input.mobi" and everything else will work as before.

About PyMuPDF

PyMuPDF adds Python bindings and abstractions to MuPDF, a lightweight PDF, XPS, and eBook viewer, renderer, and toolkit. Both PyMuPDF and MuPDF are maintained and developed by Artifex Software, Inc.

PyMuPDF's homepage is located on GitHub.

Community

Join us on Discord here: #pymupdf.

License and Copyright

PyMuPDF is available under open-source AGPL and commercial license agreements. If you determine you cannot meet the requirements of the AGPL, please contact Artifex for more information regarding a commercial license.

louis-udm / pymupdf4rag Goto Github PK

pymupdf4rag's Introduction

Using PyMuPDF in an RAG (Retrieval-Augmented Generation) Chatbot Environment

Installation

Document Support

About PyMuPDF

Community

License and Copyright

pymupdf4rag's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent