Giter Club home page Giter Club logo

pdfsyntax's Introduction

PDFSyntax

A Python library to inspect and modify the internal structure of a PDF file

Introduction

The project is focused on chapter 7 ("Syntax") of the Portable Document Format (PDF) Specification.

PDFSyntax is lightweight (no dependencies) and written from scratch in pure Python.

  1. CLI: It started as a command-line interface to inspect the internal structure of a PDF file.
  2. API: Now the internal functions are being exposed as a toolkit for PDF read/write operations.

Project status

WORK IN PROGRESS! This is ALPHA quality software. The API may change anytime. Next on TO-DO list:

  • Cut & append pages
  • Lossless compression
  • More filters
  • Improve text extraction
  • Augment text extraction with layout detection

Design

PDFSyntax favors non-destructive edits allowed by the PDF Specification: by default incremental updates are added at the end of the original file.

It is mostly made of simple functions working on built-in types and named tuples. Shallow copying of the Doc object structure performed by pure functions offers some kind of - experimental - immutability.

Installation

You can install from PyPI:

pip install pdfsyntax

CLI overview

Please refer to the CLI README for details.

The general form of the CLI usage is:

python3 -m pdfsyntax COMMAND FILE

You can get quick insights on a PDF file with these commands:

  • overview outputs text data about the structure and the metadata.
  • browse outputs static html data that lets you browse the internal structure of the PDF file: the PDF source is pretty-printed and augmented with hyperlinks.
  • text outputs extracted text spatially, as if it was a kind of scan.

API overview

Please refer to the API README for details.

PDFSyntax is mostly made of simple functions. Example:

>>> from pdfsyntax import readfile, metadata
>>> doc = readfile("samples/simple_text_string.pdf")
>>> metadata(doc) #returns a Python dict whose keys are 'Title', 'Author', 'Subject', etc...

The Doc object is probably the only dedicated class you will need to handle. It is a black box that stores all the internal states of a document:

  • content that is cached/memoized from an original file,
  • modifications that add/modifiy/delete content and that are tracked as incremental updates.
>>> doc
<PDF Doc with 1 revisions(s), ready to start update/revision 2, cache loaded with 0 / 7 objects>

This object exposes as a method the same metadata function, therefore you can get the same result with:

>>> doc.metadata() #returns a Python dict whose keys are 'Title', 'Author', 'Subject', etc...

Low-level functions like get_object or update_object allow you to directly access and manipulate the inner objects of the document structure. You may also use higher-level functions like rotate:

>>> from pdfsyntax import rotate, writefile
>>> doc180 = rotate(doc, 180) #rotate pages by 180°

The orignal object is unchanged and a new object is created with an incremental update (revision 2) that encloses the ongoing orientation modification:

>>> doc180
<PDF Doc with 2 revisions(s), current update/revision containing 1 modifications, cache loaded with 3 / 7 objects>

You then can write the modified PDF to disk. Note that the resulting file contains a new section appended to the original content. You may cut this section to revert the change.

>>> writefile(doc180, "rotated_doc.pdf")

Open-Source, not Open-Contribution yet

PDFSyntax is MIT licensed but is currently closed to contributions.

Personal note: this is a pet projet of mine and my time is limited. First I need to focus on my roadmap (new features and refactoring) and then I will happily accept contributions when everything is a little more stabilised.

pdfsyntax's People

Contributors

desgeeko avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pdfsyntax's Issues

readfile: TypeError: 'NoneType' object is not subscriptable

Hi! I am trying to load this pdf but I have the following error. Any ideas?

I have tested other pdfs, it happens all the time..

Thanks for your work.

{
	"name": "TypeError",
	"message": "'NoneType' object is not subscriptable",
	"stack": "---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 doc = readfile(\"/Users/mascit/Downloads/pdftests tests/dino copy.pdf\")

File ~/miniconda3/envs/pdftests/lib/python3.10/site-packages/pdfsyntax/api.py:69, in readfile(filename)
     67 \"\"\"Read file and initialize doc.\"\"\"
     68 with open(filename, 'rb') as file_obj:
---> 69     doc = load(file_obj, \"SINGLE\")
     70 return doc

File ~/miniconda3/envs/pdftests/lib/python3.10/site-packages/pdfsyntax/api.py:57, in load(file_obj, mode)
     55 \"\"\"Load from file.\"\"\"
     56 fdata = bdata_provider(file_obj, mode)
---> 57 return doc_constructor(fdata)

File ~/miniconda3/envs/pdftests/lib/python3.10/site-packages/pdfsyntax/api.py:50, in doc_constructor(fdata)
     48 cache = build_cache(fdata, index)
     49 doc_initial = Doc(index, cache, data)
---> 50 doc_new_rev = commit(doc_initial)
     51 return doc_new_rev

File ~/miniconda3/envs/pdftests/lib/python3.10/site-packages/pdfsyntax/docstruct.py:332, in commit(doc)
    330 def commit(doc: Doc) -> Doc:
    331     \"\"\"Add new index for incremental update.\"\"\"
--> 332     if len(changes(doc)) == 0:
    333         return doc
    334     nb_rev = len(doc.index)

File ~/miniconda3/envs/pdftests/lib/python3.10/site-packages/pdfsyntax/docstruct.py:175, in changes(doc, rev)
    173     previous = doc.index[rev-1]
    174 for i in range(1, len(current)):
--> 175     iref = get_iref(doc, i, rev)
    176     if i > len(previous)-1:
    177         res.append((iref, 'a'))

File ~/miniconda3/envs/pdftests/lib/python3.10/site-packages/pdfsyntax/docstruct.py:151, in get_iref(doc, o_num, rev)
    149 \"\"\"Build the relevant indirect reference for o_num in a doc revision.\"\"\"
    150 current = doc.index[rev]
--> 151 o_gen = current[o_num]['o_gen']
    152 return complex(o_gen, o_num)

TypeError: 'NoneType' object is not subscriptable"
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.