pdftables's Introduction

pdftables - a library for extracting tables from PDF files

pdftables uses pdfminer to get information on the locations of text elements in a PDF document.

First we get a file handle to a PDF:

filepath = os.path.join(PDF_TEST_FILES,SelectedPDF)
fh = open(filepath,'rb')

Then we use our getPDFPage function to selection a single page from the document:

pdfPage = getPDFPage(fh, pagenumber)    
table,diagnosticData = pageToTables(pdfPage, extend_y = False, hints = hints, atomise = False)

Setting the optional extend_y parameter to True extends the grid used to extract the table to the full height of the page. The optional hints parameter is a two element string array, the first element should contain unique text at the top of the table, the second element should contain unique text from the bottom row of the table. Setting the optional atomise parameter to True converts all the text to individual characters this will be slower but will sometimes split closely separated columns.

table is a list of lists of strings. diagnosticData is an object containing diagnostic information which can be displayed using the plotpage function:

fig,ax1 = plotpage(diagnosticData)

Recommend Projects

pombredanne / pdftables Goto Github PK

pdftables's Introduction

pdftables - a library for extracting tables from PDF files

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent