jcushman / pdfquery Goto Github PK

View Code? Open in Web Editor NEW

761.0 30.0 89.0 5.2 MB

A fast and friendly PDF scraping library.

License: MIT License

Python 100.00%

pdfquery's Introduction

PDFQuery

Concise, friendly PDF scraping using JQuery or XPath syntax.

PDFQuery is a light wrapper around pdfminer, lxml and pyquery. It's designed to reliably extract data from sets of PDFs with as little code as possible.

Table of Contents

Concise, friendly PDF scraping using JQuery or XPath syntax.
- Installation
- Quick Start
Usage
Object Reference
- Public Methods
- Public But Less Useful Methods
Documentation for Underlying Libraries

Installation

easy_install pdfquery or pip install pdfquery.

The basic idea is to transform a PDF document into an element tree so we can find items with JQuery-like selectors using pyquery. Suppose we're trying to extract a name from a set of PDFs, but all we know is that it appears underneath the words "Your first name and initial" in each PDF:

>>> pdf = pdfquery.PDFQuery("tests/samples/IRS_1040A.pdf")
>>> pdf.load()
>>> label = pdf.pq('LTTextLineHorizontal:contains("Your first name and initial")')
>>> left_corner = float(label.attr('x0'))
>>> bottom_corner = float(label.attr('y0'))
>>> name = pdf.pq('LTTextLineHorizontal:in_bbox("%s, %s, %s, %s")' % (left_corner, bottom_corner-30, left_corner+150, bottom_corner)).text()
>>> name
'John E.'

Note that we don't have to know where the name is on the page, or what page it's on, or how the PDF has it stored internally.

Performance Note: The initial call to pdf.load() runs very slowly, because the underlying pdfminer library has to compare every element on the page to every other element. See the Caching section to avoid this on subsequent runs.

Now let's extract and format a bunch of data all at once:

>>> pdf = pdfquery.PDFQuery("tests/samples/IRS_1040A.pdf")
>>> pdf.extract( [
     ('with_parent', 'LTPage[pageid="1"]'),
     ('with_formatter', 'text'),

     ('last_name', 'LTTextLineHorizontal:in_bbox("315,680,395,700")'),
     ('spouse', 'LTTextLineHorizontal:in_bbox("170,650,220,680")'),

     ('with_parent', 'LTPage[pageid="2"]'),

     ('oath', 'LTTextLineHorizontal:contains("perjury")', lambda match: match.text()[:30]+"..."),
     ('year', 'LTTextLineHorizontal:contains("Form 1040A (")', lambda match: int(match.text()[-5:-1]))
 ])

Result:

{'last_name': 'Michaels',
 'spouse': 'Susan R.',
 'year': 2007,
 'oath': 'Under penalties of perjury, I ...',}

Usage

Data Models

PDFQuery works by loading a PDF as a pdfminer layout, converting the layout to an etree with lxml.etree, and then applying a pyquery wrapper. All three underlying libraries are exposed, so you can use any of their interfaces to get at the data you want.

First pdfminer opens the document and reads its layout. You can access the pdfminer document at pdf.doc:

>>> pdf = pdfquery.PDFQuery("tests/samples/IRS_1040A.pdf")
>>> pdf.doc
<pdfminer.pdfparser.PDFDocument object at 0xd95c90>
>>> pdf.doc.catalog # fetch attribute of underlying pdfminer document
{'JT': <PDFObjRef:14>, 'PageLabels': <PDFObjRef:10>, 'Type': /Catalog, 'Pages': <PDFObjRef:12>, 'Metadata': <PDFObjRef:13>}

Next the layout is turned into an lxml.etree with a pyquery wrapper. After you call pdf.load() (by far the most expensive operation in the process), you can access the etree at pdf.tree, and the pyquery wrapper at pdf.pq:

>>> pdf.load()
>>> pdf.tree
<lxml.etree._ElementTree object at 0x106a285f0>
>>> pdf.tree.write("test2.xml", pretty_print=True, encoding="utf-8")
>>> pdf.tree.xpath('//*/LTPage')
[<Element LTPage at 0x994cb0>, <Element LTPage at 0x994a58>]
>>> pdf.pq('LTPage[pageid=1] :contains("Your first name")')
[<LTTextLineHorizontal>]

You'll save some time and memory if you call load() with only the page numbers you need. For example:

>>> pdf.load(0, 2, 3, range(4,8))

Under the hood, pdf.tree is basically an XML representation of the layout tree generated by pdfminer.pdfinterp. By default the tree is processed to combine individual character nodes, remove extra spaces, and sort the tree spatially. You can always get back to the original pdfminer Layout object from an element fetched by xpath or pyquery:

>>> pdf.pq(':contains("Your first name and initial")')[0].layout
<LTTextLineHorizontal 143.651,714.694,213.083,721.661 u'Your  first  name  and  initial\n'>

Finding what you want

PDFs are internally messy, so it's usually not helpful to find things based on document structure or element classes the way you would with HTML. Instead the most reliable selectors are the static labels on the page, which you can find by searching for their text contents, and physical location on the page. PDF coordinates are given in points (72 to the inch) starting from the bottom left corner. PDFMiner (and so PDFQuery) describes page locations in terms of bounding boxes, or bboxes. A bbox consists of four coordinates: the X and Y of the lower left corner, and the X and Y of the upper right corner.

If you're scraping text that's always in the same place on the page, the easiest way is to use Acrobat Pro's Measurement Tool, Photoshop, or a similar tool to measure distances (in points) from the lower left corner of the page, and use those distances to craft a selector like :in_bbox("x0,y0,x1,y1") (see below for more on in_bbox).

If you're scraping text that might be in different parts of the page, the same basic technique applies, but you'll first have to find an element with consistent text that appears a consistent distance from the text you want, and then calculate the bbox relative to that element. See the Quick Start for an example of that approach.

If both of those fail, your best bet is to dump the xml using pdf.tree.write(filename, pretty_print=True), and see if you can find any other structure, tags or elements that reliably identify the part you're looking for. This is also helpful when you're trying to figure out why your selectors don't match ...

Custom Selectors

The version of pyquery returned by pdf.pq supports some PDF-specific selectors to find elements by location on the page.

:in_bbox("x0,y0,x1,y1"): Matches only elements that fit entirely within the given bbox.
:overlaps_bbox("x0,y0,x1,y1"): Matches any elements that overlap the given bbox.

If you need a selector that isn't supported, you can write a filtering function returning a boolean:

>>> def big_elements():
    return float(this.get('width',0)) * float(this.get('height',0)) > 40000
>>> pdf.pq('LTPage[page_index="1"] *').filter(big_elements)
[<LTTextBoxHorizontal>, <LTRect>, <LTRect>]

(If you come up with any particularly useful filters, patch them into pdfquery.py as selectors and submit a pull request ...)

Caching

PDFQuery accepts an optional caching argument that will store the results of PDF parsing, so subsequent runs on the same file will be much quicker. For example:

from pdfquery.cache import FileCache
pdfquery.PDFQuery("tests/samples/IRS_1040A.pdf", parse_tree_cacher=FileCache("/tmp/"))

Bulk Data Scraping

Often you're going to want to grab a bunch of different data from a PDF, using the same repetitive process: (1) find an element of the document using a pyquery selector or Xpath; (2) parse the resulting text; and (3) store it in a dict to be used later.

The extract method simplifies that process. Given a list of keywords and selectors:

>>> pdf.extract([
      ('last_name', ':in_bbox("315,680,395,700")'),
      ('year', ':contains("Form 1040A (")', lambda match: int(match.text()[-5:-1]))
 ])

the extract method returns a dictionary (by default) with a pyquery result set for each keyword, optionally processed through the supplied formatting function. In this example the result is:

{'last_name': [<LTTextLineHorizontal>], 'year': 2007}

(It's often helpful to start with ('with_formatter', 'text') so you get results like "Michaels" instead of [<LTTextLineHorizontal>]. See Special Keywords below for more.)

Search Target

By default, extract searches the entire tree (or the part of the document loaded earlier by load(), if it was limited to particular pages). If you want to limit the search to a part of the tree that you fetched with pdf.pq() earlier, pass that in as the second parameter after the list of searches.

Formatting Functions

Notice that the 'year' example above contains an optional third paramater -- a formatting function. The formatting function will be passed a pyquery match result, so lambda match: match.text() will return the text contents of the matched elements.

Filtering Functions

Instead of a string, the selector can be a filtering function returning a boolean:

>>> pdf.extract([('big', big_elements)])
{'big': [<LTPage>, <LTTextBoxHorizontal>, <LTRect>, <LTRect>, <LTPage>, <LTTextBoxHorizontal>, <LTRect>]}

(See Custom Selectors above for how to define functions like big_elements.)

Special Keywords

extract also looks for two special keywords in the list of searches that set defaults for the searches listed afterward. Note that you can include the same special keyword more than once to change the setting, as demonstrated in the Quick Start section. The keywords are:

with_parent

The with_parent keyword limits the following searches to children of the parent search. For example:
>>> pdf.extract([
     ('with_parent','LTPage[page_index="1"]'),
     ('last_name', ':in_bbox("315,680,395,700")') # only matches elements on page 1
 ])

with_formatter

The with_formatter keyword sets a default formatting function that will be called unless a specific one is supplied. For example:

('with_formatter', lambda match: int(match.text()))

will attempt to convert all of the following search results to integers. If you supply a string instead of a function, it will be interpreted as a method name to call on the pyquery search results. For example, the following two lines are equivalent:

('with_formatter', lambda match: match.text())
('with_formatter', 'text')

If you want to stop filtering results, you can use:

('with_formatter', None)

Object Reference

Public Methods

PDFQuery(   file,
            merge_tags=('LTChar', 'LTAnon'),
            round_floats=True,
            round_digits=3,
            input_text_formatter=None,
            normalize_spaces=True,
            resort=True,
            parse_tree_cacher=None,
            laparams={'all_texts':True, 'detect_vertical':True})

Initialization function. Usually you'll only need to pass in the file (file object or path). The rest of the arguments control preprocessing of the element tree:

merge_tags: consecutive runs of these elements will be merged together, with the text of following elements appended to the first element. This is useful for keeping the size of the tree down, but it might help to turn it off if you want to select individual characters regardless of their containers.
round_floats and round_digits: if round_floats is True, numbers will be rounded to round_digits places. This is almost always good.
input_text_formatter: a function that takes a string and returns a modified string, to be applied to the text content of elements.
normalize_spaces: if True (and input_text_formatter isn't otherwise set), sets input_text_formatter to replace s+ with a single space.
resort: if True, elements will be sorted such that any element fully within the bounding box of another element becomes a child of that element, and elements on the same level are sorted top to bottom, left to right.
parse_tree_cacher: an object that knows how to save and load results of parsing a given page range from a given PDF. Pass in FileCache('/tmp/') to save caches to the filesystem.
laparams: parameters for the pdfminer.layout.LAParams object used to initialize pdfminer.converter.PDFPageAggregator. Can be dict, LAParams(), or None.

extract(    searches,
            tree=None,
            as_dict=True)

See "Bulk Data Scraping."

searches: list of searches to run, each consisting of a keyword, selector, and optional formatting function.
tree: pyquery tree to run searches against. By default, targets entire tree loaded by pdf.load()
as_dict: if changed to False, will return a list instead of a dict to preserve the order of the results.

load(*page_numbers)

Initialize the pdf.tree and pdf.pq objects. This will be called implicitly by pdf.extract(), but it's more efficient to call it explicitly with just the page numbers you need. Page numbers can be any combination of integers and lists, e.g. pdf.load(0,2,3,[4,5,6],range(10,15)).

You can call pdf.load(None) if for some reason you want to initialize without loading any pages (like you are only interested in the document info).

Public But Less Useful Methods

These are mostly used internally, but might be helpful sometimes ...

get_layout(page)

Given a page number (zero-indexed) or pdfminer PDFPage object, return the LTPage layout object for that page.

get_layouts()

Return list of all layouts (equivalent to calling get_layout() for each page).

get_page(page_number)

Given a page number, return the appropriate pdfminer PDFPage object.

get_pyquery(tree=None, page_numbers=[])

Wrap a given lxml element tree in pyquery. If no tree is supplied, will generate one from given page numbers, or all page numbers.

get_tree(*page_numbers)

Generate an etree for the given page numbers. *page_numbers can be the same form as in load().

Documentation for Underlying Libraries

PDFMiner (pdf.doc): pdfminer_homepage, pdfminer_documentation.

LXML.etree (pdf.tree): lxml_homepage, tutorial.

PyQuery (pdf.pq): pyquery_documentation.

pdfquery's People

Contributors

Stargazers

Watchers

Forkers

nervestaple bartdegoede gmorada nvdnkpr h4ck3rm1k3 muxuezi aaronjoel jaredhawkins401 mlissner modulexcite pettarin michaelcrain leeoniya tyapochkin mmadsen granitosaurus cristianflorescu akhterwahab thedatashed hobolunch luzc08 ste-winkler soulless313 avd087 angelogemini agrawal-mohit wheeliemow feliciatong airbr mathandpencil yufc2002 softwarevamp gulshang kingkeada jacksongs geowahaha shubhampachori12110095 scruf satadru5 yogjay pyreptile anderperez15 daxitarajput rdeshpande83 kenpyfin chk1 a-w brenma99 jd mikpim01 nahomagidew chaitanya4v iamjoshbinder nicholasmole vestigegroup phuduong85 mebyz optimal-outsource gcrowder subhasree leiloong giserh josesaribeiro annalitvin cabudies silver472 sshuster pombredanne cadre openglshaders wljtcc joskid chenggong0602 isprojects vijuc895 marnuslabuchagne jdixoncs sachin1988shah appintheair kp-forks sdminev parisa-ed z-sarrar washabstract mctrilliams nicolasances

pdfquery's Issues

How to release a file lock on a pdf file (Windows), a.k.a. how to properly close a pdf after querying it ?

How can I run the following code without getting a WinError exception telling me that I cannot remove the pdf file because it is being used by another process (pdfquery):

import os
import pdfquery

filename = 'C:/Documents and Settings/Administrator/document_idc_1.pdf'
pdf = pdfquery.PDFQuery(filename)
pdf.load(1)

os.remove(filename)

Total number of pages

How to know how many pages doc have?
pages = pdf.doc.catalog['Pages'] respond with PDFObjRef:2

CJK languages supported?

Does pdfquery have CJK language support??

PdfQuery - newbee -I need some explications sorry if it is not here to do this, I don't find

Hello Jcushman,
I read many pdf's texts. I don't do annotations popup but I only highlight text in yellow. I wanted to extract (with Python/ pdfMiner /pdfquery) this highlighted text to do some indexation with Whoosh for my studies. I saw that when the text is highlihted the object created in the PDF's file is for example:

20 0 obj
<<
/C [1 1 0]
/F 4
/M (D:20141107203743+01'00')
/P 7 0 R
/T (pibol)
/AP <<
/N 31 0 R
/NM (38048b89-6e9f-4434-9cae2b25dfc8c8a2)
/Rect [112.707338 807.385499 164.672639 816.770264]
/Subj (Surligner)
/Subtype /Highlight
/QuadPoints [114.570002 816.770274 162.809979 816.770274 114.570002
807.385508 162.809979 807.385508]
/CreationDate (D:20141107203743+01'00')
endobj`<<

Unlike a classical annotations popup here there is not the key " /Contents" and it is my problem. I have tried pdfMiner, pyPDF, PyPDF2 and now pdfQuery but but ... I am not very good pythoner and don't find the way to extract the line I want.

I have 2 questions :
Question 1 - With pdfQuery I have tried this :

pdf = pdfquery.PDFQuery("c:\\myDocument.pdf")
pdf.load()
label = pdf.pq('LTTextLineHorizontal:contains("the line I want")')
print label

that gives me this :

<LTTextLineHorizontal bbox="[53.999, 313.813, 189.746, 324.91]" height="11.098" width="135.747" 
word_margin="0.1" x0="53.999" x1="189.746" y0="313.813" y1="324.91"><LTTextBoxHorizontal   
bbox="[53.999, 313.813, 189.746, 324.91]" height="11.098" index="10" width="135.747" 
x0="53.999" x1="189.746" y0="313.813" y1="324.91">the line I want </LTTextBoxHorizontal
</LTTextLineHorizontal>

With this I have the coordonates of my text with <LTTextLineHorizontal bbox....
To test this coordonates I wanted to recuperate the text and only the text with the order ('with_formatter', 'text') explains in your help but how ? I don't understand the way to do this :

pdf.extract([('titleParagraf', ':in_bbox("53.999, 313.813, 189.746, 324.91")',('with_formatter', 'text')) ]) ??

Question 2 : Is it possible with pdfQuery to find one highlihted text in yellow in a text and recuperate this coordonates to extract the text with pdf.extract(['aaaa',':inbbox(coordonatesOf theHighlitedText)]).

I hope don't be too boring and quite clear in my explanations. English it is not my prefered language.
Thanks for your patience and sorry if it was not here to request some help.

Pibol

KeyError: 'Resources' on some file

When opening a file through my code at this file/repo
I don't understand because it seems the PDF file is correctly formatted. You can find the file here

Traceback (most recent call last):
  File "lltToJson.py", line 187, in <module>
    occurences = getFolder()
  File "lltToJson.py", line 173, in getFolder
    occurences[identifier] += getFile(join(path,f))
  File "lltToJson.py", line 111, in getFile
    pdf.load()
  File "/usr/local/lib/python2.7/dist-packages/pdfquery/pdfquery.py", line 288, in load
    self.tree = self.get_tree(*_flatten(page_numbers))
  File "/usr/local/lib/python2.7/dist-packages/pdfquery/pdfquery.py", line 370, in get_tree
    pages = enumerate(self.get_layouts())
  File "/usr/local/lib/python2.7/dist-packages/pdfquery/pdfquery.py", line 470, in get_layouts
    return (self.get_layout(page) for page in self._cached_pages())
  File "/usr/local/lib/python2.7/dist-packages/pdfquery/pdfquery.py", line 497, in _cached_pages
    self._pages += list(self._pages_iter)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfparser.py", line 518, in get_pages
    yield PDFPage(self, pageid, tree)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfparser.py", line 257, in __init__
    self.resources = resolve1(self.attrs['Resources'])
KeyError: 'Resources'

I have literaly no idea why it does not work...

Python 3 compatibility

Is planned to do pdfquery compatible with python3?

Problems running sample code

Hey, I've trying to get the sample code working all day but I keep running into errors.
First, I got tried

import pdfquery

pdf = pdfquery.PDFQuery("pdfs/sample.pdf")
pdf.load()
label = pdf.pq(':contains("Your first name and initial")')

but it returned
Traceback (most recent call last):
File "", line 1, in
File "C:\Python27\lib\site-packages\pyquery\pyquery.py", line 237, in call
(not PY3k and isinstance(args[0], basestring) or
File "C:\Python27\lib\site-packages\pyquery\pyquery.py", line 213, in init
File "C:\Python27\lib\site-packages\pyquery\pyquery.py", line 223, in _css_to_xpath
File "C:\Python27\lib\site-packages\cssselect\xpath.py", line 188, in css_to_xpath
File "C:\Python27\lib\site-packages\cssselect\xpath.py", line 188, in
File "C:\Python27\lib\site-packages\cssselect\xpath.py", line 208, in selector_to_xpath
File "C:\Python27\lib\site-packages\cssselect\xpath.py", line 230, in xpath
File "C:\Python27\lib\site-packages\cssselect\xpath.py", line 260, in xpath_function
File "C:\Python27\lib\site-packages\pyquery\cssselectpatch.py", line 196, in xpath_contains_function
def xpath_gt_function(self, xpath, function):
AttributeError: 'XPathExpr' object has no attribute 'add_post_condition'

Finding that this error was with CSSSelect .8.0 I downgraded to .7.1

but then typing in
>>> import pdfquery

pdf = pdfquery.PDFQuery("C:/Users/Adam/Documents/visual studio 2012/Projects/PDFtoPythonData/PDFtoPythonData/pdfs/sample.pdf")
pdf.load()
label = pdf.pq(':contains("Your first name and initial")')
left_corner = float(label.attr('x0'))
bottom_corner = float(label.attr('y0'))
name = pdf.pq(':in_bbox("%s,%s,%s,%s,")' % (left_corner,bottom_corner-30, left_corner+150, bottom_corner)).text()

resulted in
Traceback (most recent call last):
File "<pyshell#16>", line 1, in
name = pdf.pq(':in_bbox("%s,%s,%s,%s,")' % (left_corner,bottom_corner-30, left_corner+150, bottom_corner)).text()
File "C:\Python27\lib\site-packages\pyquery\pyquery.py", line 241, in call
result = self.class(_args, parent=self, *_kwargs)
File "C:\Python27\lib\site-packages\pyquery\pyquery.py", line 216, in init
xpath = self._css_to_xpath(selector)
File "C:\Python27\lib\site-packages\pyquery\pyquery.py", line 226, in _css_to_xpath
return self._translator.css_to_xpath(selector, prefix)
File "C:\Python27\lib\site-packages\cssselect\xpath.py", line 188, in css_to_xpath
for selector in selectors)
File "C:\Python27\lib\site-packages\cssselect\xpath.py", line 188, in
for selector in selectors)
File "C:\Python27\lib\site-packages\cssselect\xpath.py", line 208, in selector_to_xpath
xpath = self.xpath(tree)
File "C:\Python27\lib\site-packages\cssselect\xpath.py", line 230, in xpath
return method(parsed_selector)
File "C:\Python27\lib\site-packages\cssselect\xpath.py", line 259, in xpath_function
"The pseudo-class :%s() is unknown" % function.name)
ExpressionError: The pseudo-class :in_bbox() is unknown

Not sure if downgrading inadvertedly broke things.

Text in PDF has an extra LTTextBoxHorizontal whereas similar text elsewhere doesn't

Here's the deal. I did this on a PDF:

pdf.extract([
  ('with_parent', 'LTPage[pageid="1"]'),
  ('name', 'LTTextLineHorizontal:contains("24x7 claims assistance")')
])

I got a [<LTTextLineHorizontal>]. Let's say I assign it to a variable result. Then,

In [61]: result
Out[61]: [<LTTextLineHorizontal>]

In [62]: result[0]
Out[62]: <Element LTTextLineHorizontal at 0x10dc621b0>

In [63]: result[0][0]
Out[63]: <Element LTTextBoxHorizontal at 0x10dc62158>

In the same PDF, I do this:

pdf.extract([
  ('with_parent', 'LTPage[pageid="1"]'),
  ('name', 'LTTextLineHorizontal:contains("9810510983")')
])

Again I got a [<LTTextLineHorizontal>]. Then,

In [70]: result
Out[61]: [<LTTextLineHorizontal>]

In [71]: result[0]
Out[62]: <Element LTTextLineHorizontal at 0x10dc621b0>

In [72]: result[0][0]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-64-223da891ff7c> in <module>()
----> 1 result[0][0]

lxml.etree.pyx in lxml.etree._Element.__getitem__ (src/lxml/lxml.etree.c:47744)()

IndexError: list index out of range

I went through the code. I see that you're using _clean_text() in pdfquery.py to keep the text value in the leaf node and erase the value out of its parents.

I'm sorry but I couldn't enough time to debug it fully. Does anyone know why this would happen?

Issue with multi page pdf

Hi there,

I am having trouble in this scenario.

The part containing string that I am matching is in the beginning of page 2, when I tried to retrieve the lines below it using the method shown in README, I am getting the result from the beginning of page 1 instead.

I am pretty sure this behavior is not intentional and actually worried that I am not using the library right.

Could you take a look and let me know if I am doing it wrong?

Thanks in advance

Documentation on caching

I think there's a small mistake in the documentation for caching: FileCache is imported as 'FileCache' but called as 'pdfquery.FileCache'. This works for me:

from pdfquery.cache import FileCache
pdfquery.PDFQuery("tests/samples/IRS_1040A.pdf", parse_tree_cacher=FileCache("/tmp/"))

(It also seems to be formatted differently from the other examples.)

pdf.pq( :inbbox) pulling duplicate values

Running the below code on multiple pdfs, the code pulls duplicate values randomly from each box. I examined the .XML file to make sure there weren't two text boxes layered upon each other, and found no instances of duplicates for each page.

When I say the duplicates are created randomly, I mean that the number of duplicates, which values are duplicated, and the order in which they are pulled into text are random.

I'm curious whether you've seen this before and if there is a fix. It's possible that the pdf's themselves are the problem. Let me know if access to the XML file might help. I can probably strip the sensitive information and send.

Any help would be greatly appreciated!

An example of the text in the box is that shown in the below image. I cannot share the whole pdf due to confidentiality.

#import programs from python libraries
import xlwt
import pdfquery
import csv
import re

pages = raw_input('Please enter the number of pages in the document:    ')

#convert user input to integer
pages = int(pages)

#Path to pdf file for PDFQuery access. PDFQuery is the program that pulls in the data from the pdf
pdf = pdfquery.PDFQuery('D:\New Storage\Coding\Python Projects\Iso Pull\Lack.pdf')

#load pdf to active for PDFQuery
pdf.load(range(0,5))

#cycle through page numbers
for pagenumber in range(0,pages):

    #create a string sub to avoid messiness in the pdf.pq page number callout
    pagesub = 'LTPage[page_index="%s"]' % pagenumber

    #find text in boxes. boxes are inches*72. Lower left corner of box to upper right
    #Also, keep in mind coordinates of BOM and Iso number may need tweaking due to coordinate find

    Item = pdf.pq(pagesub + ' :in_bbox("947.52,379.44,960.48,750.16")').text()
    QTY = pdf.pq(pagesub + ' :in_bbox("960.48,379.44,987.12,750.16")').text()
    Size = pdf.pq(pagesub + ' :in_bbox("987.12,379.44,1020.24,750.16")').text()
    Sch_Minwall = pdf.pq(pagesub + ' :in_bbox("1020.24,379.44,1059.12,750.16")').text()
    Description2 = pdf.pq(pagesub + ' :in_bbox("1059.12,379.44,1203.84,750.16")').text()

'QPDFDocument' object has no attribute 'initialize' error message

I installed pdfquery using pip and also directly cloning from github but the error persists. Whenever I try to create a pdf using pdfquery.PDFQuery("file-name"), it shows following error:

pdf = pdfquery.PDFQuery("/home/bipin/Documents/ProblemAssignment/file.pdf")
Traceback (most recent call last):
File "", line 1, in
File "/home/bipin/src/pdfquery/pdfquery/pdfquery.py", line 187, in init
doc.initialize()
AttributeError: 'QPDFDocument' object has no attribute 'initialize'

I tried using different file and searched the Internet but could not get the solution. Please help me

pdf.load() ValueError on pages with unicode

i tried to load a pdf, and received the following error:

ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

Please help with API usage

First, thanks you for great library.
My question is how I can extract text if I know 'figure' name. For example, I need extract text from XObject named pssMO3_1. I can make xml file with command like pdf.tree.write(fxml, pretty_print=True, encoding="utf-8"), and this file will contain all needed data under figure name="pssMO3_1" tag:

$ cat out.xml 
<?xml version="1.0" encoding="utf-8" ?>
<pages>
<page id="1" bbox="0.000,0.000,2097.638,1629.921" rotate="0">
<figure name="pssMO3_1" bbox="1939.906,240.945,1955.906,445.945">
<text font="JDMBFC+ArialUnicodeMS" bbox="1941.468,245.945,1952.732,251.281" size="5.336">B</text>
<text font="JDMBFC+ArialUnicodeMS" bbox="1941.468,251.281,1952.732,253.057" size="1.776">l</text>
<text font="JDMBFC+ArialUnicodeMS" bbox="1941.468,253.057,1952.732,257.505" size="4.448">a</text>
<text font="JDMBFC+ArialUnicodeMS" bbox="1941.468,257.505,1952.732,261.505" size="4.000">c</text>
<text font="JDMBFC+ArialUnicodeMS" bbox="1941.468,261.505,1952.732,265.505" size="4.000">k</text>
and so on...

How I can extract text ('Black ...') using library API?
Thanks in advance!

unicode problem when processing doc.info

When I use pdfquery processing a scholar pdf, I found a unicode problem in Line 305, pdfquery.py The variable 'v' is a str type, but stores unicode character. For example, v could be '\xfc'. Since 'v' is a str type, it is literately '', 'x', 'f', 'c'.
Line 305,

        root.set(k, unicode(v))

would get a 'UnicodeDecodeError'. I suggest to use

        root.set(k, v.decode('unicode-escape'))

Get pageid of a search object

Hi, I am a newbie with Python and pdfquery . I am writing a python program to extract info from pdf files and then insert into a word document. I am having trouble with a particular object: "minor spill". Specifically, I am trying to scrap the content of the paragraph underneath "6.3 Methods and materials for containment and cleaning up" (the content I want is "Contain spillage, and then collect with an electrically protected vacuum cleaner or by wet-brushing and place in
container for disposal according to local regulations (see section 13). Keep in suitable, closed containers for disposal.", on page 2 of the pdf file. The problem is that for this particular pdf file, my code will also extract "Product This combustible material may be burned in a chemical incinerator equipped with an afterburner and scrubber. Offer surplus and non-recyclable solutions to a licensed disposal company." on p.5. Because I want to work with many pdf files that might have "6.3..." content on different page, I figure if I can pass the pageid in the extract then it should be fine.
My question is, is there a way you can get the pageid of a object (for example: "minor_spill" in my code.
My code is below and I also attach the pdf file:
https://pastebin.com/rwseBSZV

Thank you very much!
PDF file:
932-66-1.pdf

Can't get coordinates.

Hello
I can't get coordinates for my text "green-color-2-2-2". My Script returns "Red green-color-2-2-2"

import pdfquery
import sys
sys.setrecursionlimit(2000)
pdfpath = sys.argv[1]
inputstr = sys.argv[2]
page = int(sys.argv[3])
pdf = pdfquery.PDFQuery(pdfpath)
pdf.load(page)
label = pdf.pq('LTTextLineHorizontal:contains("'+inputstr+'")')[0].layout
print(label)

response

<LTTextLineHorizontal 167.320,142.577,244.579,157.770 u'Red green-color-2-2-2\n'>

How to get the text I need?

PyQuery objects returned by items() have problems

Given a = pdf.pq('LTTextLineHorizontal').items().next()

a.find(':in_bbox("x0,y0,x1,y1")') raises an ExpressionError: The pseudo-class :in_bbox() is unknown
a.parent('LTPage') returns an empty list, even though a.parents().filter(lambda i, a: a.tag == 'LTPage') returns the expected parent (assume here that the LTPage is the direct parent of the element matched by a).

These two calls would have succeeded had a not been a result of the items iterator, like a = pdf.pq('LTTextLineHorizontal[index="13"]')

TypeError: object of type 'PSLiteral' has no len()

Error and stack trace superfically similar to #15

>>> pdf.load()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pdfquery/pdfquery.py", line 288, in load
    self.tree = self.get_tree(*_flatten(page_numbers))
  File "/usr/local/lib/python2.7/dist-packages/pdfquery/pdfquery.py", line 372, in get_tree
    v = smart_unicode_decode(v)
  File "/usr/local/lib/python2.7/dist-packages/pdfquery/pdfquery.py", line 89, in smart_unicode_decode
    detected_encoding = chardet.detect(encoded_string)
  File "/usr/lib/python2.7/dist-packages/chardet/__init__.py", line 24, in detect
    u.feed(aBuf)
  File "/usr/lib/python2.7/dist-packages/chardet/universaldetector.py", line 64, in feed
    aLen = len(aBuf)
TypeError: object of type 'PSLiteral' has no len()

And it's true - PSLiteral doesn't have a length. The following change at line 366 works:

if type(v) == list:
        v = unicode([smart_unicode_decode(item) for item in v])
elif hasattr(v.__class__, '__len__'):
        v = smart_unicode_decode(v)
else:
        v = smart_unicode_decode(v.name)

I don't know if it's actually the correct thing to do though. Maybe PSLiterals should just be dropped on the floor?

unable to read pdf containing Chinese

I am trying to read a pdf that contains Chinese (this one):

import pdfquery

pdf = pdfquery.PDFQuery("Table_A_17Sep2014.pdf")
pdf.load()

error:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-4-71df8f58767e> in <module>()
      2 
      3 pdf = pdfquery.PDFQuery("Table_A_17Sep2014.pdf")
----> 4 pdf.load()

/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.pyc in load(self, *page_numbers)
    319             [<LTPage>, <LTPage>]
    320         """
--> 321         self.tree = self.get_tree(*_flatten(page_numbers))
    322         self.pq = self.get_pyquery(self.tree)
    323 

/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.pyc in get_tree(self, *page_numbers)
    413                     pages = enumerate(self.get_layouts())
    414                 for n, page in pages:
--> 415                     page = self._xmlize(page)
    416                     page.set('page_index', unicode(n))
    417                     page.set('page_label', self.doc.get_page_number(n))

/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.pyc in _xmlize(self, node, root)
    467             last = None
    468             for child in node:
--> 469                 child = self._xmlize(child, root)
    470                 if self.merge_tags and child.tag in self.merge_tags:
    471                     if branch.text and child.text in branch.text:

/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.pyc in _xmlize(self, node, root)
    467             last = None
    468             for child in node:
--> 469                 child = self._xmlize(child, root)
    470                 if self.merge_tags and child.tag in self.merge_tags:
    471                     if branch.text and child.text in branch.text:

/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.pyc in _xmlize(self, node, root)
    467             last = None
    468             for child in node:
--> 469                 child = self._xmlize(child, root)
    470                 if self.merge_tags and child.tag in self.merge_tags:
    471                     if branch.text and child.text in branch.text:

/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.pyc in _xmlize(self, node, root)
    448             tags.update( self._getattrs(node, 'colorspace','bits','imagemask','srcsize','stream','name','pts','linewidth') )
    449         elif type(node) == LTChar:
--> 450             tags.update( self._getattrs(node, 'fontname','adv','upright','size') )
    451         elif type(node) == LTPage:
    452             tags.update( self._getattrs(node, 'pageid','rotate') )

/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.pyc in _getattrs(self, obj, *attrs)
    486     def _getattrs(self, obj, *attrs):
    487         """ Return dictionary of given attrs on given object, if they exist, processing through filter_value(). """
--> 488         return dict( (attr, unicode(self._filter_value(getattr(obj, attr)))) for attr in attrs if hasattr(obj, attr))
    489 
    490     def _filter_value(self, val):

/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.pyc in <genexpr>((attr,))
    486     def _getattrs(self, obj, *attrs):
    487         """ Return dictionary of given attrs on given object, if they exist, processing through filter_value(). """
--> 488         return dict( (attr, unicode(self._filter_value(getattr(obj, attr)))) for attr in attrs if hasattr(obj, attr))
    489 
    490     def _filter_value(self, val):

UnicodeDecodeError: 'ascii' codec can't decode byte 0xb7 in position 7: ordinal not in range(128)

Error with annotations

Found an issue when upgrading from pdfquery 0.2.7 to 0.4.3. Looks like starting in 0.3.0, support for annotations was added. This is what appears to be happening. In the _add_annots() method in pdfquery.py, an annotation object is found by pdfminer. _add_annots() retrieves this object and converts all information into strings (via obj_to_string()). This method is called again and pdfminer returns a cached version of the annotation object, only this time, all the information has been converted into strings by pdfquery. This leads to an error on line 649:

annot['URI'] = resolve1(annot['A'])['URI']

The first time through _add_annots(), resolve1(annot['A']) returns a dict with 'URI' being one of the keys. On the second time through, annot['A'] is a string representation (converted by obj_to_string) of that dict and so the line fails.

I've attached a PDF file (annot.pdf) to show the problem. This file only has one line of text (a company's home page URL) which is being seen as an annotation.

This error has been found with:

pdfquery version 0.3.0, 0.4.x
pdfminer 20140328
python 2.7.1
Fedora Linux 23

If there's any other information that would help, let me know.

Custom selectors don't support partial functions

From my SO question on the same issue.

Background

I'm using pdfquery to parse multiple files like this one.

Problem

I'm trying to write a generalized filer function, building off of the custom selectors mentioned in pdfquery's docs, that can take a specific range as an argument. Because this is referenced I thought I could get around this by supplying a partial function using functools.partial (as seen below)

Input

import pdfquery
import functools

def load_file(PDF_FILE):
    pdf = pdfquery.PDFQuery(PDF_FILE)
    pdf.load()
    return pdf

file_with_table = 'Path to the file mentioned above'
pdf = load_file(file_with_table)


def elements_in_range(x1_range):
    return in_range(x1_range[0], x1_range[1], float(this.get('x1',0)))

x1_part = functools.partial(elements_in_range, (95,350))

pdf.pq('LTPage[page_index="0"] *').filter(x1_part)

But when I do that I get the following attribute error;

Output

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
C:\Anaconda3\lib\site-packages\pyquery\pyquery.py in filter(self, selector)
    597                     if len(args) == 1:
--> 598                         func_globals(selector)['this'] = this
    599                     if callback(selector, i, this):

C:\Anaconda3\lib\site-packages\pyquery\pyquery.py in func_globals(f)
     28 def func_globals(f):
---> 29     return f.__globals__ if PY3k else f.func_globals
     30 

AttributeError: 'functools.partial' object has no attribute '__globals__'

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
<ipython-input-74-d75c2c19f74b> in <module>()
     15 x1_part = functools.partial(elements_in_range, (95,350))
     16 
---> 17 pdf.pq('LTPage[page_index="0"] *').filter(x1_part)

C:\Anaconda3\lib\site-packages\pyquery\pyquery.py in filter(self, selector)
    600                         elements.append(this)
    601             finally:
--> 602                 f_globals = func_globals(selector)
    603                 if 'this' in f_globals:
    604                     del f_globals['this']

C:\Anaconda3\lib\site-packages\pyquery\pyquery.py in func_globals(f)
     27 
     28 def func_globals(f):
---> 29     return f.__globals__ if PY3k else f.func_globals
     30 
     31 

AttributeError: 'functools.partial' object has no attribute '__globals__'

Is there any way to get around this? Or possibly some other way to write a custom selector for pdfquery that can take arguments?

@jcushman
If this is module level problem how difficult would it be to fix?

Other than that I'm really enjoying pdfquery. Thanks!

is there any user manual for this

File "abc.py", line 2, in
import pdfquery
File "build\bdist.win32\egg\pdfquery_init_.py", line 1, in

File "build\bdist.win32\egg\pdfquery\pdfquery.py", line 31, in
File "C:\Python27\lib\site-packages\pyquery_init_.py", line 11, in
from .pyquery import PyQuery
File "C:\Python27\lib\site-packages\pyquery\pyquery.py", line 9, in
from lxml import etree
ImportError: DLL load failed: %1 is not a valid Win32 application.

Where'd my links go?

I'm trying to query the links in a document on a court website, but when I look at the XML, the links seem to be gone.

For example, the document I'm working with is here:

http://apps.courts.ky.gov/supreme/casesummaries/May2015.pdf

Not far down that PDF there's a link to:

http://opinions.kycourts.net/sc/2013-SC-000610-MR.pdf

But if I look at the XML (generated with pdf.tree.write('ky.xml', pretty_print=True, encoding='utf-8')), there doesn't seem to be any links. I've posted the XML here:

https://gist.github.com/mlissner/4cb1eb36e347c2dea00a

Any ideas, or is this something pdfquery doesn't support?

Thanks! It's been interesting playing with this.

cc: @brianwc

error with load() order

Hi,
i don't know if it is one bug but when I try this
`pdf = pdfquery.PDFQuery("d:\Travail\ myPDF.pdf")

document = pdf.load()`

I have this result:
` document = pdf.load()
File "build\bdist.win-amd64\egg\pdfquery\pdfquery.py", line 373, in load

File "build\bdist.win-amd64\egg\pdfquery\pdfquery.py", line 475, in get_tree
File "build\bdist.win-amd64\egg\pdfquery\pdfquery.py", line 596, in
File "build\bdist.win-amd64\egg\pdfquery\pdfquery.py", line 591, in get_layout
File "build\bdist.win-amd64\egg\pdfquery\pdfquery.py", line 639, in _add_annots
TypeError: 'PDFObjRef' object has no attribute 'getitem'`

Bruno

Installing pdfquery should install pdfminer.six library as a dependency

When installing pdfquery, pdfminer version 2014038 is installed as a dependency. However, the six version of pdfminer should be installed.

'dict' object has no attribute 'resolve'

L123 (master) settings = nums[i+1].resolve().

I have a script which uses pdfquery to grab annotated text. This script works for some pdfs, but not others. The pdf where it doesn't work, this line is called. The pdf where it does work, this line is not called.

Tried a bit of debugging, but don't understand this code at all. It happened in version 0.2.3 and I upgraded to see if it would be different, but alas no. Any tips on how to debug this would be great, thanks.

NB: Replacing this line with settings = nums[i+1] stopped the errors and the script worked as expected.

chardet 3.0 seems to have broken something

In pdfquery.py, in smart_unicode_decode is this:

# detect encoding
detected_encoding = chardet.detect(encoded_string)

With chardet 2.3.0, detected_encoding is {'confidence': 0.0, 'encoding': None}

With chardet 3.0.1 (newest as of time of writing this), detected_encoding is None

So it's crashing on the next line, where it does detected_encoding['encoding'].

Presumably, the fix is as simply as changing:

encoding=detected_encoding['encoding'] or 'utf8',

encoding=detected_encoding['encoding'] if (detected_encoding and detected_encoding['encoding']) else 'utf8',

Getting `TypeError: 'PDFObjRef' object is not iterable`

When trying to load a PDF, I get the following error

TypeError: 'PDFObjRef' object is not iterable

The error happens at pdfquery/pdfquery.py line 631

def _add_annots(self, layout, annots):
        """Adds annotations to the layout object
        """
        if annots: # and not isinstance(annots, PDFObjRef):
            for annot in annots:
                annot = annot.resolve()
                if annot.get('Rect') is not None:
                    annot['bbox'] = annot.pop('Rect')  # Rename key
                    annot = self._set_hwxy_attrs(annot)
                try:
                    annot['URI'] = annot['A']['URI']
                except KeyError:
                    pass
                for k, v in annot.iteritems():
                    if not isinstance(v, basestring):
                        annot[k] = unicode_decode_object(v)
                elem = parser.makeelement('Annot', annot)
                layout.add(elem)
        return layout

The error goes away by adding the second check that is commented out from the above code

cssselect/parser.py SelectorSyntaxError on sample code

Running the sample code I'm getting "SelectorSyntaxError: Expected string or ident" from cssselector/parser.py.

Any clue, what this could be?

Initialise PDFQuery from PDF contents

Is it possible to initialise PDFQuery directly from the byte contents of a PDF.
My use case is of a server where the PDF is uploaded and saved in a database as a blob (technically in MongoDB GridFS). The content of the PDF is available to me in memory.
Currently I have created a class to act as a proxy for a file object.

class PseudoPDFFile(object):
    """
    Offers a psudo file interface for pdfquery to load the PDF from memory
    """
    def __init__(self, content):
        self.content = content

    def read(self):
        return self.content

Is there a way to avoid it

Pseudo classes not working

:first :last :even :odd :eq :lt :gt :checked :selected :file

Pseudo classes not working when try use like this:

pdf.pq('LTTextLineHorizontal:last')

Error while loading a document

While loading a document, using PDFQuery.load(), I got the following error

    354                         objid = spec.objid
    355                     spec = dict_value(spec)
--> 356                     self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
    357             elif k == 'ColorSpace':
    358                 for (csid, spec) in dict_value(v).iteritems():

/home/glyfix/projects/ENV/glyfix/local/lib/python2.7/site-packages/pdfminer/pdfinterp.pyc in get_font(self, objid, spec)
    202                     if k in spec:
    203                         subspec[k] = resolve1(spec[k])
--> 204                 font = self.get_font(None, subspec)
    205             else:
    206                 if STRICT:

/home/glyfix/projects/ENV/glyfix/local/lib/python2.7/site-packages/pdfminer/pdfinterp.pyc in get_font(self, objid, spec)
    193             elif subtype in ('CIDFontType0', 'CIDFontType2'):
    194                 # CID Font
--> 195                 font = PDFCIDFont(self, spec)
    196             elif subtype == 'Type0':
    197                 # Type0 Font

/home/glyfix/projects/ENV/glyfix/local/lib/python2.7/site-packages/pdfminer/pdffont.pyc in __init__(self, rsrcmgr, spec)
    663             self.fontfile = stream_value(descriptor.get('FontFile2'))
    664             ttf = TrueTypeFont(self.basefont,
--> 665                                BytesIO(self.fontfile.get_data()))
    666         self.unicode_map = None
    667         if 'ToUnicode' in spec:

/home/glyfix/projects/ENV/glyfix/local/lib/python2.7/site-packages/pdfminer/pdffont.pyc in __init__(self, name, fp)
    384         (ntables, _1, _2, _3) = struct.unpack('>HHHH', fp.read(8))
    385         for _ in xrange(ntables):
--> 386             (name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
    387             self.tables[name] = (offset, length)
    388         return

error: unpack requires a string argument of length 16

error: invalid command 'bdist_wheel'

Using python 3.5.3 on Linux I get errors about missing bdist_wheel when installing pdfquery like this:

python3 -m venv .pyvenv
source .pyvenv/bin/activate
pip install pdfquery

When I do pip install wheel before installing pdfquery, setup outputs no errors. So should wheel be added to the dependencies?

The output I get when omitting pip install wheel:

$ pip install pdfquery
Collecting pdfquery
  Using cached pdfquery-0.4.3.tar.gz
Collecting cssselect>=0.7.1 (from pdfquery)
  Using cached cssselect-1.0.1-py2.py3-none-any.whl
Collecting chardet (from pdfquery)
  Using cached chardet-3.0.4-py2.py3-none-any.whl
Collecting lxml>=3.0 (from pdfquery)
  Using cached lxml-4.0.0-cp35-cp35m-manylinux1_x86_64.whl
Collecting pdfminer.six (from pdfquery)
  Using cached pdfminer.six-20170720.tar.gz
Collecting pyquery>=1.2.2 (from pdfquery)
  Using cached pyquery-1.2.17-py2.py3-none-any.whl
Collecting roman>=1.4.0 (from pdfquery)
  Using cached roman-2.0.0.zip
Collecting six (from pdfminer.six->pdfquery)
  Using cached six-1.11.0-py2.py3-none-any.whl
Collecting pycryptodome (from pdfminer.six->pdfquery)
  Using cached pycryptodome-3.4.7.tar.gz
Building wheels for collected packages: pdfquery, pdfminer.six, roman, pycryptodome
  Running setup.py bdist_wheel for pdfquery ... error
  Complete output from command /XYZ/.pyvenv/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-ahjiqctx/pdfquery/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/tmpgbpb2kiapip-wheel- --python-tag cp35:
  usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
     or: -c --help [cmd1 cmd2 ...]
     or: -c --help-commands
     or: -c cmd --help
  
  error: invalid command 'bdist_wheel'
  
  ----------------------------------------
  Failed building wheel for pdfquery
  Running setup.py clean for pdfquery
  Running setup.py bdist_wheel for pdfminer.six ... error
  Complete output from command /XYZ/.pyvenv/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-ahjiqctx/pdfminer.six/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/tmpliwxpa75pip-wheel- --python-tag cp35:
  usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
     or: -c --help [cmd1 cmd2 ...]
     or: -c --help-commands
     or: -c cmd --help
  
  error: invalid command 'bdist_wheel'
  
  ----------------------------------------
  Failed building wheel for pdfminer.six
  Running setup.py clean for pdfminer.six
  Running setup.py bdist_wheel for roman ... error
  Complete output from command /XYZ/.pyvenv/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-ahjiqctx/roman/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/tmpf7270_kjpip-wheel- --python-tag cp35:
  usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
     or: -c --help [cmd1 cmd2 ...]
     or: -c --help-commands
     or: -c cmd --help
  
  error: invalid command 'bdist_wheel'
  
  ----------------------------------------
  Failed building wheel for roman
  Running setup.py clean for roman
  Running setup.py bdist_wheel for pycryptodome ... error
  Complete output from command /XYZ/.pyvenv/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-ahjiqctx/pycryptodome/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/tmpd5nhx36kpip-wheel- --python-tag cp35:
  usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
     or: -c --help [cmd1 cmd2 ...]
     or: -c --help-commands
     or: -c cmd --help
  
  error: invalid command 'bdist_wheel'
  
  ----------------------------------------
  Failed building wheel for pycryptodome
  Running setup.py clean for pycryptodome
Failed to build pdfquery pdfminer.six roman pycryptodome
Installing collected packages: cssselect, chardet, lxml, six, pycryptodome, pdfminer.six, pyquery, roman, pdfquery
  Running setup.py install for pycryptodome ... done
  Running setup.py install for pdfminer.six ... done
  Running setup.py install for roman ... done
  Running setup.py install for pdfquery ... done
Successfully installed chardet-3.0.4 cssselect-1.0.1 lxml-4.0.0 pdfminer.six-20170720 pdfquery-0.4.3 pycryptodome-3.4.7 pyquery-1.2.17 roman-2.0.0 six-1.11.0

Large Memory Usage

I have a very large PDF (about 1000 pages). Since I didn't think it would be wise to load the entire PDF into memory at the same time, I decided to iterate over the pages, calling pdf.load on each page individually thinking this would only load one page in at a time. However, it seems that memory usage continues to grow every time pdf.load is called, like the previous data is not being released. Any ideas? I'm running out of memory (16GB) after about 400 pages.

LTTextLineHorizontal.text is null while shows in layout

solved.

use two or more consecutive 'in_bbox'

Hi guys! o/
I wanna know if has a way to execute one 'in_bbox' followed by another 'in_bbox'. For example:

first_bbox = pdf_query.pq('LTTextLineHorizontal:in_bbox("%s, %s, %s, %s")' % (x0, y0, x1, y1))
second_bbox = first_bbox.pq('LTTextLineHorizontal:in_bbox("%s, %s, %s, %s")' % (x0, y0, x1, y1))

Save pdf

Any chance of saving a pdf in the pdf format instead of pdfxml? I would really prefer to use pdfquery in pdfparanoia rather than manual text manipulation of pdf streams.

trying to run the example sample code.

0 down vote favorite

I just installed pdfquery in my machine, and I'm trying to run the example sample code:

import pdfquery
pdf = pdfquery.PDFQuery("examples/sample.pdf")
pdf.load()
label = pdf.pq(':contains("Your first name and initial")')
left_corner = float(label.attr('x0'))
bottom_corner = float(label.attr('y0'))
name = pdf.pq(':in_bbox("%s, %s, %s, %s")' % (left_corner, bottom_corner-30, left_corner+150, bottom_corner)).text()
print name

the problem is that I get this error

Traceback (most recent call last):
File "testePdfQuery.py", line 1, in
import pdfquery
File "/home/ubuntu/Downloads/pdfquery-0.1.3/pdfquery/init.py", line 1, in
from .pdfquery import PDFQuery
File "/home/ubuntu/Downloads/pdfquery-0.1.3/pdfquery/pdfquery.py", line 23, in
cssselect.Function._xpath_in_bbox = _xpath_in_bbox
AttributeError: 'module' object has no attribute 'Function'

any ideas how I can fix this and run the example? Thanks in advance.

lxml.etree.XPathEvalError: Invalid expression

I can't get the example from the README working.

This is what I have done:

$ sudo easy_install pip
$ sudo pip install pdfquery
$ wget https://raw.github.com/jcushman/pdfquery/master/examples/sample.pdf
$ python
Python 2.7.1 (r271:86832, Jul 31 2011, 19:30:53) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pdfquery
>>> pdf = pdfquery.PDFQuery("sample.pdf")
>>> pdf.load()
>>> label = pdf.pq(':contains("Your first name and initial")')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/pyquery/pyquery.py", line 247, in __call__
    result = self.__class__(*args, parent=self, **kwargs)
  File "/Library/Python/2.7/site-packages/pyquery/pyquery.py", line 223, in __init__
    for tag in elements]
  File "lxml.etree.pyx", line 1444, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:41726)
  File "xpath.pxi", line 321, in lxml.etree.XPathElementEvaluator.__call__ (src/lxml/lxml.etree.c:117867)
  File "xpath.pxi", line 239, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:117044)
  File "xpath.pxi", line 225, in lxml.etree._XPathEvaluatorBase._raise_eval_error (src/lxml/lxml.etree.c:116913)
lxml.etree.XPathEvalError: Invalid expression

I'm using Mac OS X 10.7.4. Output from pip freeze at https://gist.github.com/3082390 if that can help in any way. (I'm not a Python guy.)

pdf.load() in pdfquery.py - 'dict' object has no attribute 'resolve'

Using Python version 3.5 (w/ Anaconda 2.4.0); sorry I don't have much more to add than a bug report. I've just been looking for something in Python 3.x to convert a PDF into text and preserving its layout (a la pdftotext from poppler)...so pdfquery is probably beyond my plaintext needs. But figured you'd be interested in knowing.

Reproducible code:

curl \
  https://static.googleusercontent.com/media/www.google.com/en//selfdrivingcar/files/reports/report-0515.pdf \
  -o g.pdf

import pdfquery
pdf = pdfquery.PDFQuery("g.pdf")
pdf.load()

AttributeError                            Traceback (most recent call last)
<ipython-input-3-4357470f507b> in <module>()
----> 1 pdf.load()

/Users/dtown/.pyenv/versions/anaconda3-2.4.0/lib/python3.5/site-packages/pdfquery/pdfquery.py in load(self, *page_numbers)
    381         [<LTPage>, <LTPage>]
    382         """
--> 383         self.tree = self.get_tree(*_flatten(page_numbers))
    384         self.pq = self.get_pyquery(self.tree)
    385 

/Users/dtown/.pyenv/versions/anaconda3-2.4.0/lib/python3.5/site-packages/pdfquery/pdfquery.py in get_tree(self, *page_numbers)
    483                 else:
    484                     pages = enumerate(self.get_layouts())
--> 485                 for n, page in pages:
    486                     page = self._xmlize(page)
    487                     page.set('page_index', obj_to_string(n))

/Users/dtown/.pyenv/versions/anaconda3-2.4.0/lib/python3.5/site-packages/pdfquery/pdfquery.py in <genexpr>(.0)
    604     def get_layouts(self):
    605         """ Get list of PDFMiner Layout objects for each page. """
--> 606         return (self.get_layout(page) for page in self._cached_pages())
    607 
    608     def _cached_pages(self, target_page=-1):

/Users/dtown/.pyenv/versions/anaconda3-2.4.0/lib/python3.5/site-packages/pdfquery/pdfquery.py in get_layout(self, page)
    599         self.interpreter.process_page(page)
    600         layout = self.device.get_result()
--> 601         layout = self._add_annots(layout, page.annots)
    602         return layout
    603 

/Users/dtown/.pyenv/versions/anaconda3-2.4.0/lib/python3.5/site-packages/pdfquery/pdfquery.py in _add_annots(self, layout, annots)
    642                 annots = annots.resolve()
    643             for annot in annots:
--> 644                 annot = annot.resolve()
    645                 if annot.get('Rect') is not None:
    646                     annot['bbox'] = annot.pop('Rect')  # Rename key

AttributeError: 'dict' object has no attribute 'resolve'

438 page PDF takes ~700 sec and ~4GB RAM to load

http://www.atmel.com/Images/Atmel-7766-8-bit-AVR-ATmega16U4-32U4_Datasheet.pdf

During these 700 seconds, only one core is working on 100%, would it be possible to cut the work in 8 in my case by letting the rest of the cores in on the fun?

Is this expected load time? This is using a Intel Core i7-7700K 4 cores with HT => 8 threads, 16GB ram, macOS Sierra, Python 3.6.

However.. using FileCache brings down subsequent runs to 0.8 seconds load time.

Perhaps you could provide some expected performance metrics in the README to quantify "runs very slowly"?

pdf = pdfquery.PDFQuery(args.datasheet, parse_tree_cacher=FileCache("/tmp/"))
t0 = time.time()
pdf.load()
t1 = time.time()

print("Loaded in {} seconds".format(t1-t0))

ValueError: Invalid attribute name u'AAPL:AKExtras'

Processing a PDF with annotations that have a colon in their key value gives an exception:

Traceback (most recent call last):
  File "test_ocr.py", line 633, in test_petition
    analyze = analyze_bankruptcy_petition(pdf_txt = pdf_txt, pdf_fp = file)
  File "program.py", line 255, in analyze_bankruptcy_petition
    pdfq.load(*pages_to_analyze)
  File "..\libs\pdfquery\pdfquery.py", line 385, in load
    self.tree = self.get_tree(*_flatten(page_numbers))
  File "..\libs\pdfquery\pdfquery.py", line 484, in get_tree
    _flatten(page_numbers)]
  File "..\libs\pdfquery\pdfquery.py", line 603, in get_layout
    layout = self._add_annots(layout, page.annots)
  File "..\libs\pdfquery\pdfquery.py", line 663, in _add_annots
    elem = parser.makeelement('Annot', annot)
  File "parser.pxi", line 878, in lxml.etree._BaseParser.makeelement (src/lxml/lxml.etree.c:74798)
  File "apihelpers.pxi", line 156, in lxml.etree._makeElement (src/lxml/lxml.etree.c:12231)
  File "apihelpers.pxi", line 144, in lxml.etree._makeElement (src/lxml/lxml.etree.c:12106)
  File "apihelpers.pxi", line 298, in lxml.etree._initNodeAttributes (src/lxml/lxml.etree.c:13603)
  File "apihelpers.pxi", line 1554, in lxml.etree._attributeValidOrRaise (src/lxml/lxml.etree.c:24197)
ValueError: Invalid attribute name u'AAPL:AKExtras'

pdf.load() ValueError on pages with unicode.

I've been trying to load up this pdf
And pages 1 and 2 load fine where pages 3 and 4 give:

  File "/home/reb/project/rowreader.py", line 62, in extract_rows
    self.pdf.load(page)  # page 2 in this case (which is page 3 in pdf)
  File "/home/reb/projects/venv/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 373, in load
    self.tree = self.get_tree(*_flatten(page_numbers))
  File "/home/reb/projects/venv/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 476, in get_tree
    page = self._xmlize(page)
  File "/home/reb/projects/venv/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 541, in _xmlize
    child = self._xmlize(child, root)
  File "/home/reb/projects/venv/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 535, in _xmlize
    branch.text = node.get_text()
  File "src/lxml/lxml.etree.pyx", line 1031, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:55347)
  File "src/lxml/apihelpers.pxi", line 711, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:24667)
  File "src/lxml/apihelpers.pxi", line 699, in lxml.etree._createTextNode (src/lxml/lxml.etree.c:24516)
  File "src/lxml/apihelpers.pxi", line 1439, in lxml.etree._utf8 (src/lxml/lxml.etree.c:32441)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

It's really peculiar, the only visable difference between pages 1-2 and 3-4 are that pages 3-4 have unicode stars, could they be the characters that break the lxml tree load in ?

Documentation correction

LTPage[page_index=1] should be LTPage[page_index="1"], in the few places it is mentioned
thank you.

'PDFObjRef' object has no attribute 'getitem'

Hello,

I'm trying to parse some pdf files using pdfquery and it seems that for a couple of pdf's(not all of them) I receive the following error:

File "my_path/my_script.py", line 244, in set_description pdf.load()
  File "/my_path/.virtualenvs/dev/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 373, in load
    self.tree = self.get_tree(*_flatten(page_numbers))
  File "/my_path/.virtualenvs/dev/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 475, in get_tree
    for n, page in pages:
  File "/my_path/.virtualenvs/dev/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 596, in <genexpr>
    return (self.get_layout(page) for page in self._cached_pages())
  File "/my_path/.virtualenvs/dev/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 591, in get_layout
    layout = self._add_annots(layout, page.annots)
  File "/my_path/.virtualenvs/dev/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 639, in _add_annots
    annot['URI'] = annot['A']['URI']
TypeError: 'PDFObjRef' object has no attribute '__getitem__'

Below is a list with just a couple of pdf's that raises the above error:
http://www.genomecanada.ca/medias/pdf/en/genomesciencescentrebc.pdf
http://www.genomecanada.ca/medias/pdf/fr/genomesciencescentrebc.pdf
http://www.genomecanada.ca/medias/pdf/en/universityvictoria.pdf
http://www.genomecanada.ca/medias/pdf/fr/universityvictoria.pdf
http://www.genomecanada.ca/medias/pdf/fr/centreforappliedgenomicsogi.pdf

Maybe someone will be able to find a fix for it?

Thanks!

Syntax Error on Python 2.6.6

Python 2.6.6 (r266:84292, Nov 21 2013, 10:50:32)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import pdfquery
Traceback (most recent call last):
File "", line 1, in
File "pdfquery/init.py", line 1, in
from .pdfquery import PDFQuery
File "pdfquery/pdfquery.py", line 45
_comp_bbox_keys_required = {'x0', 'x1', 'y0', 'y1'}
^
SyntaxError: invalid syntax

pdf query not catching some text in page

I am using the following code

tax = pdf.pq('LTTextLineHorizontal:in_bbox("%s, %s, %s, %s")' % (
col['tax']['start'], line, col['tax']['end'], line+10)).text()

where i am expecting to catch text something like '8 G1' or '32 G1'
here it catches value '32 G1' but not '8 G1'
actually any single digit value is not caught here.
'589 TKTT 1253925356 14APR17 FVVV D CA 4,440 3,425 8 G1 450 YQ
75 YR 3,386 1.01 39 0.00'
above what my line in pdf line is.
it is catching values at that posssition before and after but not here and in situations like this one.
please help with it
Mayuresh A

TypeError: object of type 'PDFObjRef' has no len()

I think this time it is your python and not pdfminer. (Let's hope ?) File available here

Traceback (most recent call last):
  File "lltToJson.py", line 521, in <module>
    main(sys.argv[1:])
  File "lltToJson.py", line 494, in main
    occurences = llt.getFolder()
  File "lltToJson.py", line 227, in getFolder
    occurences[identifier] += self.getFile(join(path,f))
  File "lltToJson.py", line 164, in getFile
    pdf.load()
  File "/usr/local/lib/python2.7/dist-packages/pdfquery/pdfquery.py", line 288, in load
    self.tree = self.get_tree(*_flatten(page_numbers))
  File "/usr/local/lib/python2.7/dist-packages/pdfquery/pdfquery.py", line 365, in get_tree
    root.set(k, smart_unicode_decode(v))
  File "/usr/local/lib/python2.7/dist-packages/pdfquery/pdfquery.py", line 89, in smart_unicode_decode
    detected_encoding = chardet.detect(encoded_string)
  File "/usr/lib/python2.7/dist-packages/chardet/__init__.py", line 24, in detect
    u.feed(aBuf)
  File "/usr/lib/python2.7/dist-packages/chardet/universaldetector.py", line 64, in feed
    aLen = len(aBuf)
TypeError: object of type 'PDFObjRef' has no len()

'PDFObjRef' object does not support indexing

`import pdfquery
import sys

pdf = pdfquery.PDFQuery(sys.argv[1])
pdf.load()`

Traceback (most recent call last): File "bin/parse_pdf.py", line 6, in <module> pdf.load() File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 385, in load self.tree = self.get_tree(*_flatten(page_numbers)) File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 487, in get_tree for n, page in pages: File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 608, in <genexpr> return (self.get_layout(page) for page in self._cached_pages()) File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 603, in get_layout layout = self._add_annots(layout, page.annots) File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 647, in _add_annots annot = self._set_hwxy_attrs(annot) File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 665, in _set_hwxy_attrs attr['x0'] = bbox[0] TypeError: 'PDFObjRef' object does not support indexing