Giter Club home page Giter Club logo

pdflexer

pdflexer is a PDF parsing library. It is focused on efficient parsing and modification of PDF files and is mainly targeted for users familiar with the pdf spec. It is generally very fast at what it does (eg. splitting / merging / text extract shows multiple times better performance than alternatives). The parsing logic was implemented from scratch but some higher level functionality (eg. filters) were ported from the pdf.js project.

pdflexer differs from existing .net libraries in that it:

  • Is primarly designed for PDF modification (not just reading). Any object / page read from a PDF can be modified and written to others PDFs.
  • Mutable model for page contents. Move, delete, modify existing text and graphics on page (note: in active development)
  • Has lazy parsing features which allow objects to be parsed on demand increasing performance in many cases.
  • Modern .net features (nullable enabled, Span, ArrayPool, Generic math)
  • Designed for direct access to the native PDF objects types. Any higher level objects are simple wrappers areound the native pdf object types (eg PdfPage is a wrapper around a PdfDictionary. The PdfDictionary can be directly modified for features not implemented on PdfPage)
  • Attempts to be performant / efficient. Not a ton of effort has been put in here but it is a goal to keep this in mind.

State of library

Feature WIP Alpha Beta Release
Document access ✔️
General modification
(non page content)
✔️
Merging / splitting ✔️
Streaming writer ✔️
Page content access ✔️
Text extraction ✔️
Image extraction ✔️
Resource dedup ✔️
Content creation ✔️
Content redaction ✔️
Mutable Content ✔️
  • Release - API stable and few breaking changes are expected. Feature has significant test coverage and has been used in real use cases on a wide variety of pdfs
  • Beta - API stable but some breaking changes are expected. Feature has some test coverage and has been used in some real use cased.
  • Alpha - API unstable and breaking changes are expected. Feature generally functional but may lack test coverage and may not have any real use.
  • WIP - API unstable and many breaking changes are expected. Feature may have significant bugs, may lack test coverage and may not have any real use.

Major Gaps

  • Filter support (ascii85, asciihex, ccitt, deflate, lzw, and run length completed)
  • Public API cleanup / documentation. Lots of classes / properties exposed that will likely be internalized.
  • Documentation / examples

Examples

Some examples are available as polyglot notebooks in the /examples/ folder.

pdflexer's Projects

pdfanonymization icon pdfanonymization

Automated PDF anonymization built on Microsoft Presidio and pdflexer redaction

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.