Giter Club home page Giter Club logo

pdf2md's Introduction

pdf2md

This project, pdf2md, transforms academic paper PDF files into digestible text files. By analyzing the layout of the PDF file, the application restructures paragraphs and translates desired content. The final result is a conveniently exported text file.

Setup

We recommend setting up a virtual environment for running the application. This can be done with the following commands:

python -m venv .venv

# Activate the virtual environment depending on your platform
# Windows
source .venv/bin/activate

# linux
source .venv/bin/activate   # or
. .venv/bin/activate

Next, install the required Python packages:

pip install -r requirements.txt

In order to run pdf2md, you need to set up a .env file. An example .env file could look like:

CACHE_DIR="./cache"
EXPORT_DIR="./export"
TEXT_FONT="Malgun Gothic"
TEXT_FONT_SIZE=11

# If you want to translate with RapidAPI DeepL API
DEEPL_RAPID_API_KEY=(your RapidAPI key)
DEEPL_RAPID_API_HOST=(your RapidAPI host)
DEEPL_RAPID_API_SRC_LANG=EN
DEEPL_RAPID_API_DST_LANG=KO

# If you want to translate with OpenAI GPT-4
OPENAI_API_KEY=(your open ai key)
PROMPT_DIR="./prompt"

In Korea, we don't have access to the DeepL API yet, so we use RapidAPI, which has similar pricing terms.

To obtain a RapidAPI DeepL API key, please visit the following website: https://rapidapi.com/splintPRO/api/deepl-translator/

To obtain an OpenAI API key, please visit the following website: https://platform.openai.com/

Run

The application can then be run with:

python -m src.main --f (URL or file name)

In case the URL points to an Arxiv or an HuggingFace path, the program will attempt to infer the PDF file path and read from it.

https://arxiv.org/abs/1706.03762 → https://arxiv.org/pdf/1706.03762.pdf
https://huggingface.co/papers/1706.03762 → https://arxiv.org/pdf/1706.03762.pdf

If no arguments are provided, the program will attempt to read a PDF from a URL or file path present in your clipboard.

You can check the list of available fonts in the system with the --l option.

python -m src.main --l

Usage

The user interface is easy to navigate and manipulate. The various functionalities include:

Safe Area: Define a safe area in the PDF to designate the primary content. The safe area can be adjusted by dragging the red rectangle and applies to the whole document.

Visibility: Toggle the visibility of elements in the translation or export. Individual elements can be toggled on and off with a click, or multiple elements can be selected by dragging.

Body: Distinguish body text from non-body elements. Images are automatically marked as non-body and cannot be changed. Use this button to exclude captions when chaining separated paragraphs.

Concat / Split: Merge multiple elements into a single paragraph by dragging, or separate them back into lines with a right click. Merged paragraphs do not include line breaks.

Join / Split: Similar to Concat/Split, but merges paragraphs with line breaks.

Order: Adjust the order of paragraphs. First, click an element to set it as the baseline, then left click another element to place it after the baseline, or right click to place it before. Use the Esc key to cancel the selection.

Chain: Link paragraphs that have been split over several blocks or pages. A single click links the paragraph to the next body paragraph without a line break, a double click links them with a line break. Chained paragraphs are processed together during translation.

Translate: Translates the selected paragraph into Korean. The application currently uses GPT-4, but you can modify the prompt by changing the prompt/translate.txt file.

Roadmap

  • Extraction of images embedded in the PDF.
  • Extraction of tables in the PDF.
  • Proper parsing or image extraction of equations in the PDF.
  • Automatic identification and modification of text attributes such as title, subtitle, and body text through font analysis.
  • Export to markdown (md) files with formatting.
  • Export to MHTML files, including images, tables, and equations.

pdf2md's People

Contributors

eiaserinnys avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.