Giter Club home page Giter Club logo

pdfstructure's Introduction

PDF Structural Parser

pdfstructure detects, splits and organises the documents text content into its natural structure as envisioned by the author. The document structure, or hierarchy, stores the relation between chapters, sections and their sub sections in a nested, recursive manner.

pdfstructure is in early development and built on top of pdfminer.six.

  • Paragraph extraction is performed leveraging pdfminer.high_level.extract_pages().
  • Those paragraphs are then grouped together according to some basic (extendable) heuristics.

Document Model

class StructuredDocument:
  metadata: dict
  sections: List[Section]

class Section:
  content:  TextElement
  children: List[Section]
  level:    int

class TextElement:
  text:     LTTextContainer # the extracted paragraph from pdfminer
  style:    Style

Load and parse PDF

Illustration of document structure

The following screenshot contains sections and subsections with their respective content. In that case, the structure can be easily parsed by leveraging the Font Style only.

Example PDF PDF source: github.com/TSiege

Parse PDF

    from pdfstructure.hierarchy.parser import HierarchyParser
    from pdfstructure.source import FileSource

    parser = HierarchyParser() 
    
    # specify source (that implements source.read())
    source = FileSource(path) 
     
    # analyse document and parse as nested data structure
    document = parser.parse_pdf(source)

Serialize Document to String

To export the parsed structure, use a printer implementation.

    from pdfstructure.printer import PrettyStringPrinter

    stringExporter = PrettyStringPrinter()
    prettyString = stringExporter.print(document)

Excerpt of the parsed document (serialized to string)

Parsed data: interview_cheatsheet_pretty.txt

[Search Basics]
	[Breadth First Search]
		[Definition:]
			An algorithm that searches a tree (or graph) by searching levels of the tree first, starting at the root.
			It finds every node on the same level, most often moving left to right.
			While doing this it tracks the children nodes of the nodes on the current level.
			When finished examining a level it moves to the left most node on the next level.
			The bottom-right most node is evaluated last (the node that is deepest and is farthest right of it's level).

		[What you need to know:]
			Optimal for searching a tree that is wider than it is deep.
			Uses a queue to store information about the tree while it traverses a tree.
			Because it uses a queue it is more memory intensive than depth first search.
			The queue uses more memory because it needs to stores pointers

Encode Document to JSON

    from pdfstructure.printer import JsonFilePrinter
    
    printer = JsonFilePrinter()
    file_path = Path("resources/parsed/interview_cheatsheet.json")
    
    printer.print(document, file_path=str(file_path.absolute()))

Parsed data: interview_cheatsheet.json

Excerpt of exported json

{
    "metadata": {
        "style_info": {
            "_data": {
                "7.99": 24,
                "9.6": 1,
                "6.4": 7,
                "7.47": 3,
                "12.8": 7,
                "8.53": 206,
                "10.67": 12,
                "7.25": 14
            },
            "_body_size": 8.53,
            "_min_found_size": 6.4,
            "_max_found_size": 12.8
        },
        "filename": "interview_cheatsheet.pdf"
    },
    "elements": [
     {
            "content": {
                "style": {
                    "bold": true,
                    "italic": false,
                    "font_name": ".SFNSDisplay-Semibold",
                    "mapped_font_size": "xlarge",
                    "mean_size": 12.8,
                    "max_size": 12.806323818403143
                },
                "page": 0,
                "text": "Data Structure Basics"
            },
            "children": [
                {
                    "content": {
                        "style": {
                            "bold": true,
                            "italic": false,
                            "font_name": ".SFNSDisplay-Semibold",
                            "mapped_font_size": "large",
                            "mean_size": 10.6,
                            "max_size": 10.671936515335972
                        },
                        "page": 0,
                        "text": "Array"
                    },
                    "children": [
                        {
                            "content": {
                                "style": {
                                    "bold": true,
                                    "italic": false,
                                    "font_name": ".SFNSText-Semibold",
                                    "mapped_font_size": "middle",
                                    "mean_size": 8.5,
                                    "max_size": 8.537549212268772
                                },
                                "page": 0,
                                "text": "Definition:"
                            },
         ....          

Load JSON as StructuredPdfDocument

Of course, encoded documents can be easily decoded and used for further analysis. However, detailed information like bounding boxes or coordinates for each character are not persisted.

    from pdfstructure.model.document import StructuredPdfDocument

    jsonString = json.load(file)
    document = StructuredPdfDocument.from_json(jsonString)
    
    print(document.title)
``
        $ "interview_cheatsheet.pdf"

Traverse through document structure

Having all paragraphs and sections organised as a general tree, its straight forward to iterate through the layers and search for specific elements like headlines, or extract all main headers like chapter titles.

Two document traversal generators are available that yield each section in-order or in level-order respectively.

    from pdfstructure.hierarchy.traversal import traverse_in_order

    elements_flat_in_order = [element for element in traverse_in_order(document)]

    Exemplary illustration of yield order:
        """
                         5   10
                      /   \    \
                     1     2    3
                   / | \        |
                  a  b  c       x
    
        yield order:
        - [5,1,a,b,c,2,10,3,x]
        """

TODOs

  • Detect the document layout type (Columns, Book, Magazine)

    The provided layout analysis algorithm by pdfminer.six performs well on more straightforward documents with default settings. However, more complicated layouts like scientific papers need custom LAParams settings to retrieve paragraphs in correct reading order.

  • High level diagram of algorithm workflow

  • Performance improvement in terms of speed

pdfstructure's People

Contributors

chrizh avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.