Readme
About
This is a script that deal with pdf file. At present, the following functions are supported:
- exports the outline/toc of a PDF to a human-readable file. You can see the format here
- imports the toc into a PDF
- extracts annotations (highlights, comments, etc.) from a PDF file, and formats them as org-mode.
Dependency
- Python 3
- PyMuPDF
pip3 install pymupdf
Usage
See pdfannots.py --help
for options and invocation.
python3 pdfhelper.py -h
TOC format
Sample toc file:
Here, you see three ways of customization:
- define page number: create a bookmark called 「The Five Rules」 at page 3
- set first page: create a bookmark called 「1. Toys」 at page 17 (since the first page is 16)
- In fact, you can use any text match “# xxx = number” to set first page to number
- set page gap: create a bookmark called 「4. Numbers Games」 at page 58+(16-1)-2=71 (You have to minus 2 pages gap to get correct page number)
- useful when there are missing pages
Export Annotations
Currently, the following annotation types are supported:
Type | Result |
---|---|
Text | text |
Square | picture, you can set the zoom factor by --image-zoom and ocr the picture by --ocr-api |
Highlight | comment + text |
Underline | comment + text |
Squiggly | comment + text |
StrikeOut | comment + text |
You can customize the note format by
--with-toc
--toc-list-item-format
--annot-list-item-format
Changelog
- 1.3.0
- improve feature
import-toc
: Support set the first page and fix a gap. See more info here
- improve feature
- 1.2.0
- new feature
export-annot
: Export the annotations of pdf
- new feature
- 1.1.0
- new feature
export-toc
: Export the toc of pdf to human-readable file. You can see the format here - new feature
import-toc
: Import the toc of pdf, the toc shares the same format with the exported one
- new feature
Credits
This project is inspired by the following tool:
- 0xabu/pdfannots: Extracts and formats text annotations from a PDF file: based on pdfminer and format as markdown text. It deals with hyphens but donot extract rectangle annot.
- PDFPatcher(Chinese) a great pdf utility tool.