Giter Club home page Giter Club logo

pdftables's Introduction

pdftables - a library for extracting tables from PDF files

image

image

This Readme, and more, is available on ReadTheDocs.

This post on the ScraperWiki blog describes the algorithms used in pdftables, and something of its genesis. This README gives more technical information.

pdftables uses pdfminer to get information on the locations of text elements in a PDF document. pdfminer was chosen as a base because it provides information on the full range of page elements in PDF files, including graphical elements such as lines. Although the algorithms currently used do not use these elements they are planned for future work. As a purely Python library, pdfminer is very portable. The downside of pdfminer is that it is slow, perhaps an order of magnitude slower than alternative C based libraries.

Installation

You need poppler and Cairo. On a Ubuntu and friends you can go:

Then we can install the pip-able requirements from the requirements.txt file:

Usage

First we get a file object to a PDF:

Then we create a PDF element from the file object:

Then we use the get_page() method to select a single page from the document:

You can also loop over all pages in the PDF using get_pages():

Now you have a TableContainer object, you can convert it to ASCII for quick previewing:

table.data is a table that has been found, in the form of a list of lists of strings (ie: a list of rows, each containing the same number of cells).

Command line tool

pdftables includes a command line tool for diagnostic rendering of pages and tables, called pdftables-render. This is installed if you pip install pdftables, or you manually run python setup.py.

This creates separate PNG and SVG files for each page of the specified PDF, in png/ and svg/, with three disagnostic displays per page.

Developing pdftables

Files and folders:

.
|-fixtures
| |-sample_data
|-pdftables
|-test

fixtures contains test fixtures, in particular the sample_data directory contains PDF files which are installed from a different repository by running the download_test_data.sh script.

We're also using data from http://www.tamirhassan.com/competition/dataset-tools.html which is also installed by the download script.

pdftables contains the core code files

test contains tests

pdftables.py - this is the core of the pdftables library

counter.py - implements collections.Counter for the benefit of Python 2.6

display.py - prettily prints a table by implementing the to_string function

numpy_subset.py - partially implements numpy.diff, numpy.arange and numpy.average to avoid a large dependency on numpy.

pdf_document.py - implements PDFDocument to abstract away the underlying PDF class, and ease any conversion to a different underlying PDF library to replace PDFminer

pdftables's People

Contributors

drj11 avatar frabcus avatar morty avatar pwaller avatar zarino avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.