Giter Club home page Giter Club logo

pdf2docx's Introduction

pdf2docx

pdf2docx-test pdf2docx-publish GitHub

  • Parse text, table and layout from PDF file with PyMuPDF
  • Generate docx with python-docx

Features

  • Parse and re-create text format
    • font style, e.g. font name, size, weight, italic and color
    • highlight, underline, strike-through converted from docx
    • highlight, underline, strike-through applied from PDF annotations
  • Parse and re-create list style
  • Parse and re-create table
    • border style, e.g. width, color
    • shading style, i.e. background color
    • merged cells
    • vertical direction cell
  • Rebuild page layout in docx
    • text in horizontal direction: from left to right
    • text in vertical direction: from bottom to top
    • in-line image
    • paragraph layout: horizontal and vertical spacing

It can also be used as a tool to extract table contents since both table content and format/style is parsed.

Limitations

  • Text-based PDF file only
  • Normal reading direction only
    • horizontal/vertical paragraph/line/word
    • no word transformation, e.g. rotation
  • No floating images
  • Full borders table only

Installation

From Pypi

$ pip install pdf2docx

From source code

Clone or download this project, and navigate to the root directory:

$ python setup.py install

Or install it in developing mode:

$ python setup.py develop

Uninstall

$ pip uninstall pdf2docx

Usage

By range of pages

$ pdf2docx test.pdf test.docx --start=5 --end=10

By page numbers

$ pdf2docx test.pdf test.docx --pages=5,7,9
$ pdf2docx --help

NAME
    pdf2docx - Run the pdf2docx parser.

SYNOPSIS
    pdf2docx PDF_FILE DOCX_FILE <flags>

DESCRIPTION
    Run the pdf2docx parser.

POSITIONAL ARGUMENTS
    PDF_FILE
        PDF filename to read from
    DOCX_FILE
        DOCX filename to write to

FLAGS
    --start=START
        first page to process, starting from zero
    --end=END
        last page to process, starting from zero
    --pages=PAGES
        range of pages
    --debug=DEBUG
        create illustration pdf showing layouts if True, else do nothing

NOTES
    You can also use flags syntax for POSITIONAL ARGUMENTS

As a library

''' With this library installed with 
    `pip install pdf2docx`, or `python setup.py install`.
'''

from pdf2docx.main import parse

pdf_file = '/path/to/sample.pdf'
docx_file = 'path/to/sample.docx'

# convert pdf to docx
parse(pdf_file, docx_file, start=0, end=1)

Or just to extract tables,

from pdf2docx.main import extract_tables

pdf_file = '/path/to/sample.pdf'

tables = extract_tables(pdf_file, start=0, end=1)
for table in tables:
    print(table)

# outputs
...
[['Input ', None, None, None, None, None], 
['Description A ', 'mm ', '30.34 ', '35.30 ', '19.30 ', '80.21 '],
['Description B ', '1.00 ', '5.95 ', '6.16 ', '16.48 ', '48.81 '],
['Description C ', '1.00 ', '0.98 ', '0.94 ', '1.03 ', '0.32 '],
['Description D ', 'kg ', '0.84 ', '0.53 ', '0.52 ', '0.33 '],
['Description E ', '1.00 ', '0.15 ', None, None, None],
['Description F ', '1.00 ', '0.86 ', '0.37 ', '0.78 ', '0.01 ']]

Sample

sample_compare.png

pdf2docx's People

Contributors

dothinking avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.