pdftables uses pdfminer to get information on the locations of text elements in a PDF document.
First we get a file handle to a PDF:
filepath = os.path.join(PDF_TEST_FILES,SelectedPDF)
fh = open(filepath,'rb')
Then we use our getPDFPage
function to selection a single page from the document:
pdfPage = getPDFPage(fh, pagenumber)
table,diagnosticData = pageToTables(pdfPage, extend_y = False, hints = hints, atomise = False)
Setting the optional extend_y
parameter to True
extends the grid used to extract the table to the full height of the page.
The optional hints
parameter is a two element string array, the first element should contain unique text at the top of the table,
the second element should contain unique text from the bottom row of the table.
Setting the optional atomise
parameter to True converts all the text to individual characters this will be slower but will sometimes
split closely separated columns.
table
is a list of lists of strings. diagnosticData
is an object containing diagnostic information which can be displayed using
the plotpage
function:
fig,ax1 = plotpage(diagnosticData)