jcushman / pdfquery Goto Github PK

A fast and friendly PDF scraping library.

License: MIT License

Python 100.00%

pdfquery's Issues

Initialise PDFQuery from PDF contents

Is it possible to initialise PDFQuery directly from the byte contents of a PDF.
My use case is of a server where the PDF is uploaded and saved in a database as a blob (technically in MongoDB GridFS). The content of the PDF is available to me in memory.
Currently I have created a class to act as a proxy for a file object.

class PseudoPDFFile(object):
    """
    Offers a psudo file interface for pdfquery to load the PDF from memory
    """
    def __init__(self, content):
        self.content = content

    def read(self):
        return self.content

Is there a way to avoid it

PyQuery objects returned by items() have problems

Given a = pdf.pq('LTTextLineHorizontal').items().next()

a.find(':in_bbox("x0,y0,x1,y1")') raises an ExpressionError: The pseudo-class :in_bbox() is unknown
a.parent('LTPage') returns an empty list, even though a.parents().filter(lambda i, a: a.tag == 'LTPage') returns the expected parent (assume here that the LTPage is the direct parent of the element matched by a).

These two calls would have succeeded had a not been a result of the items iterator, like a = pdf.pq('LTTextLineHorizontal[index="13"]')

Python 3 compatibility

Is planned to do pdfquery compatible with python3?

Installing pdfquery should install pdfminer.six library as a dependency

When installing pdfquery, pdfminer version 2014038 is installed as a dependency. However, the six version of pdfminer should be installed.

Error while loading a document

While loading a document, using PDFQuery.load(), I got the following error

    354                         objid = spec.objid
    355                     spec = dict_value(spec)
--> 356                     self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
    357             elif k == 'ColorSpace':
    358                 for (csid, spec) in dict_value(v).iteritems():

/home/glyfix/projects/ENV/glyfix/local/lib/python2.7/site-packages/pdfminer/pdfinterp.pyc in get_font(self, objid, spec)
    202                     if k in spec:
    203                         subspec[k] = resolve1(spec[k])
--> 204                 font = self.get_font(None, subspec)
    205             else:
    206                 if STRICT:

/home/glyfix/projects/ENV/glyfix/local/lib/python2.7/site-packages/pdfminer/pdfinterp.pyc in get_font(self, objid, spec)
    193             elif subtype in ('CIDFontType0', 'CIDFontType2'):
    194                 # CID Font
--> 195                 font = PDFCIDFont(self, spec)
    196             elif subtype == 'Type0':
    197                 # Type0 Font

/home/glyfix/projects/ENV/glyfix/local/lib/python2.7/site-packages/pdfminer/pdffont.pyc in __init__(self, rsrcmgr, spec)
    663             self.fontfile = stream_value(descriptor.get('FontFile2'))
    664             ttf = TrueTypeFont(self.basefont,
--> 665                                BytesIO(self.fontfile.get_data()))
    666         self.unicode_map = None
    667         if 'ToUnicode' in spec:

/home/glyfix/projects/ENV/glyfix/local/lib/python2.7/site-packages/pdfminer/pdffont.pyc in __init__(self, name, fp)
    384         (ntables, _1, _2, _3) = struct.unpack('>HHHH', fp.read(8))
    385         for _ in xrange(ntables):
--> 386             (name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
    387             self.tables[name] = (offset, length)
    388         return

error: unpack requires a string argument of length 16

Error with annotations

Found an issue when upgrading from pdfquery 0.2.7 to 0.4.3. Looks like starting in 0.3.0, support for annotations was added. This is what appears to be happening. In the _add_annots() method in pdfquery.py, an annotation object is found by pdfminer. _add_annots() retrieves this object and converts all information into strings (via obj_to_string()). This method is called again and pdfminer returns a cached version of the annotation object, only this time, all the information has been converted into strings by pdfquery. This leads to an error on line 649:

annot['URI'] = resolve1(annot['A'])['URI']

The first time through _add_annots(), resolve1(annot['A']) returns a dict with 'URI' being one of the keys. On the second time through, annot['A'] is a string representation (converted by obj_to_string) of that dict and so the line fails.

I've attached a PDF file (annot.pdf) to show the problem. This file only has one line of text (a company's home page URL) which is being seen as an annotation.

This error has been found with:

pdfquery version 0.3.0, 0.4.x
pdfminer 20140328
python 2.7.1
Fedora Linux 23

If there's any other information that would help, let me know.

unable to read pdf containing Chinese

I am trying to read a pdf that contains Chinese (this one):

import pdfquery

pdf = pdfquery.PDFQuery("Table_A_17Sep2014.pdf")
pdf.load()

error:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-4-71df8f58767e> in <module>()
      2 
      3 pdf = pdfquery.PDFQuery("Table_A_17Sep2014.pdf")
----> 4 pdf.load()

/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.pyc in load(self, *page_numbers)
    319             [<LTPage>, <LTPage>]
    320         """
--> 321         self.tree = self.get_tree(*_flatten(page_numbers))
    322         self.pq = self.get_pyquery(self.tree)
    323 

/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.pyc in get_tree(self, *page_numbers)
    413                     pages = enumerate(self.get_layouts())
    414                 for n, page in pages:
--> 415                     page = self._xmlize(page)
    416                     page.set('page_index', unicode(n))
    417                     page.set('page_label', self.doc.get_page_number(n))

/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.pyc in _xmlize(self, node, root)
    467             last = None
    468             for child in node:
--> 469                 child = self._xmlize(child, root)
    470                 if self.merge_tags and child.tag in self.merge_tags:
    471                     if branch.text and child.text in branch.text:

/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.pyc in _xmlize(self, node, root)
    467             last = None
    468             for child in node:
--> 469                 child = self._xmlize(child, root)
    470                 if self.merge_tags and child.tag in self.merge_tags:
    471                     if branch.text and child.text in branch.text:

/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.pyc in _xmlize(self, node, root)
    467             last = None
    468             for child in node:
--> 469                 child = self._xmlize(child, root)
    470                 if self.merge_tags and child.tag in self.merge_tags:
    471                     if branch.text and child.text in branch.text:

/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.pyc in _xmlize(self, node, root)
    448             tags.update( self._getattrs(node, 'colorspace','bits','imagemask','srcsize','stream','name','pts','linewidth') )
    449         elif type(node) == LTChar:
--> 450             tags.update( self._getattrs(node, 'fontname','adv','upright','size') )
    451         elif type(node) == LTPage:
    452             tags.update( self._getattrs(node, 'pageid','rotate') )

/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.pyc in _getattrs(self, obj, *attrs)
    486     def _getattrs(self, obj, *attrs):
    487         """ Return dictionary of given attrs on given object, if they exist, processing through filter_value(). """
--> 488         return dict( (attr, unicode(self._filter_value(getattr(obj, attr)))) for attr in attrs if hasattr(obj, attr))
    489 
    490     def _filter_value(self, val):

/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.pyc in <genexpr>((attr,))
    486     def _getattrs(self, obj, *attrs):
    487         """ Return dictionary of given attrs on given object, if they exist, processing through filter_value(). """
--> 488         return dict( (attr, unicode(self._filter_value(getattr(obj, attr)))) for attr in attrs if hasattr(obj, attr))
    489 
    490     def _filter_value(self, val):

UnicodeDecodeError: 'ascii' codec can't decode byte 0xb7 in position 7: ordinal not in range(128)

error: invalid command 'bdist_wheel'

Using python 3.5.3 on Linux I get errors about missing bdist_wheel when installing pdfquery like this:

python3 -m venv .pyvenv
source .pyvenv/bin/activate
pip install pdfquery

When I do pip install wheel before installing pdfquery, setup outputs no errors. So should wheel be added to the dependencies?

The output I get when omitting pip install wheel:

$ pip install pdfquery
Collecting pdfquery
  Using cached pdfquery-0.4.3.tar.gz
Collecting cssselect>=0.7.1 (from pdfquery)
  Using cached cssselect-1.0.1-py2.py3-none-any.whl
Collecting chardet (from pdfquery)
  Using cached chardet-3.0.4-py2.py3-none-any.whl
Collecting lxml>=3.0 (from pdfquery)
  Using cached lxml-4.0.0-cp35-cp35m-manylinux1_x86_64.whl
Collecting pdfminer.six (from pdfquery)
  Using cached pdfminer.six-20170720.tar.gz
Collecting pyquery>=1.2.2 (from pdfquery)
  Using cached pyquery-1.2.17-py2.py3-none-any.whl
Collecting roman>=1.4.0 (from pdfquery)
  Using cached roman-2.0.0.zip
Collecting six (from pdfminer.six->pdfquery)
  Using cached six-1.11.0-py2.py3-none-any.whl
Collecting pycryptodome (from pdfminer.six->pdfquery)
  Using cached pycryptodome-3.4.7.tar.gz
Building wheels for collected packages: pdfquery, pdfminer.six, roman, pycryptodome
  Running setup.py bdist_wheel for pdfquery ... error
  Complete output from command /XYZ/.pyvenv/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-ahjiqctx/pdfquery/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/tmpgbpb2kiapip-wheel- --python-tag cp35:
  usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
     or: -c --help [cmd1 cmd2 ...]
     or: -c --help-commands
     or: -c cmd --help
  
  error: invalid command 'bdist_wheel'
  
  ----------------------------------------
  Failed building wheel for pdfquery
  Running setup.py clean for pdfquery
  Running setup.py bdist_wheel for pdfminer.six ... error
  Complete output from command /XYZ/.pyvenv/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-ahjiqctx/pdfminer.six/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/tmpliwxpa75pip-wheel- --python-tag cp35:
  usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
     or: -c --help [cmd1 cmd2 ...]
     or: -c --help-commands
     or: -c cmd --help
  
  error: invalid command 'bdist_wheel'
  
  ----------------------------------------
  Failed building wheel for pdfminer.six
  Running setup.py clean for pdfminer.six
  Running setup.py bdist_wheel for roman ... error
  Complete output from command /XYZ/.pyvenv/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-ahjiqctx/roman/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/tmpf7270_kjpip-wheel- --python-tag cp35:
  usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
     or: -c --help [cmd1 cmd2 ...]
     or: -c --help-commands
     or: -c cmd --help
  
  error: invalid command 'bdist_wheel'
  
  ----------------------------------------
  Failed building wheel for roman
  Running setup.py clean for roman
  Running setup.py bdist_wheel for pycryptodome ... error
  Complete output from command /XYZ/.pyvenv/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-ahjiqctx/pycryptodome/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/tmpd5nhx36kpip-wheel- --python-tag cp35:
  usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
     or: -c --help [cmd1 cmd2 ...]
     or: -c --help-commands
     or: -c cmd --help
  
  error: invalid command 'bdist_wheel'
  
  ----------------------------------------
  Failed building wheel for pycryptodome
  Running setup.py clean for pycryptodome
Failed to build pdfquery pdfminer.six roman pycryptodome
Installing collected packages: cssselect, chardet, lxml, six, pycryptodome, pdfminer.six, pyquery, roman, pdfquery
  Running setup.py install for pycryptodome ... done
  Running setup.py install for pdfminer.six ... done
  Running setup.py install for roman ... done
  Running setup.py install for pdfquery ... done
Successfully installed chardet-3.0.4 cssselect-1.0.1 lxml-4.0.0 pdfminer.six-20170720 pdfquery-0.4.3 pycryptodome-3.4.7 pyquery-1.2.17 roman-2.0.0 six-1.11.0

pdf.pq( :inbbox) pulling duplicate values

Running the below code on multiple pdfs, the code pulls duplicate values randomly from each box. I examined the .XML file to make sure there weren't two text boxes layered upon each other, and found no instances of duplicates for each page.

When I say the duplicates are created randomly, I mean that the number of duplicates, which values are duplicated, and the order in which they are pulled into text are random.

I'm curious whether you've seen this before and if there is a fix. It's possible that the pdf's themselves are the problem. Let me know if access to the XML file might help. I can probably strip the sensitive information and send.

Any help would be greatly appreciated!

An example of the text in the box is that shown in the below image. I cannot share the whole pdf due to confidentiality.

#import programs from python libraries
import xlwt
import pdfquery
import csv
import re

pages = raw_input('Please enter the number of pages in the document:    ')

#convert user input to integer
pages = int(pages)

#Path to pdf file for PDFQuery access. PDFQuery is the program that pulls in the data from the pdf
pdf = pdfquery.PDFQuery('D:\New Storage\Coding\Python Projects\Iso Pull\Lack.pdf')

#load pdf to active for PDFQuery
pdf.load(range(0,5))

#cycle through page numbers
for pagenumber in range(0,pages):

    #create a string sub to avoid messiness in the pdf.pq page number callout
    pagesub = 'LTPage[page_index="%s"]' % pagenumber

    #find text in boxes. boxes are inches*72. Lower left corner of box to upper right
    #Also, keep in mind coordinates of BOM and Iso number may need tweaking due to coordinate find

    Item = pdf.pq(pagesub + ' :in_bbox("947.52,379.44,960.48,750.16")').text()
    QTY = pdf.pq(pagesub + ' :in_bbox("960.48,379.44,987.12,750.16")').text()
    Size = pdf.pq(pagesub + ' :in_bbox("987.12,379.44,1020.24,750.16")').text()
    Sch_Minwall = pdf.pq(pagesub + ' :in_bbox("1020.24,379.44,1059.12,750.16")').text()
    Description2 = pdf.pq(pagesub + ' :in_bbox("1059.12,379.44,1203.84,750.16")').text()

pdf.load() ValueError on pages with unicode

i tried to load a pdf, and received the following error:

ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

'QPDFDocument' object has no attribute 'initialize' error message

I installed pdfquery using pip and also directly cloning from github but the error persists. Whenever I try to create a pdf using pdfquery.PDFQuery("file-name"), it shows following error:

pdf = pdfquery.PDFQuery("/home/bipin/Documents/ProblemAssignment/file.pdf")
Traceback (most recent call last):
File "", line 1, in
File "/home/bipin/src/pdfquery/pdfquery/pdfquery.py", line 187, in init
doc.initialize()
AttributeError: 'QPDFDocument' object has no attribute 'initialize'

I tried using different file and searched the Internet but could not get the solution. Please help me

TypeError: object of type 'PDFObjRef' has no len()

I think this time it is your python and not pdfminer. (Let's hope ?) File available here

Traceback (most recent call last):
  File "lltToJson.py", line 521, in <module>
    main(sys.argv[1:])
  File "lltToJson.py", line 494, in main
    occurences = llt.getFolder()
  File "lltToJson.py", line 227, in getFolder
    occurences[identifier] += self.getFile(join(path,f))
  File "lltToJson.py", line 164, in getFile
    pdf.load()
  File "/usr/local/lib/python2.7/dist-packages/pdfquery/pdfquery.py", line 288, in load
    self.tree = self.get_tree(*_flatten(page_numbers))
  File "/usr/local/lib/python2.7/dist-packages/pdfquery/pdfquery.py", line 365, in get_tree
    root.set(k, smart_unicode_decode(v))
  File "/usr/local/lib/python2.7/dist-packages/pdfquery/pdfquery.py", line 89, in smart_unicode_decode
    detected_encoding = chardet.detect(encoded_string)
  File "/usr/lib/python2.7/dist-packages/chardet/__init__.py", line 24, in detect
    u.feed(aBuf)
  File "/usr/lib/python2.7/dist-packages/chardet/universaldetector.py", line 64, in feed
    aLen = len(aBuf)
TypeError: object of type 'PDFObjRef' has no len()

CJK languages supported?

Does pdfquery have CJK language support??

Large Memory Usage

I have a very large PDF (about 1000 pages). Since I didn't think it would be wise to load the entire PDF into memory at the same time, I decided to iterate over the pages, calling pdf.load on each page individually thinking this would only load one page in at a time. However, it seems that memory usage continues to grow every time pdf.load is called, like the previous data is not being released. Any ideas? I'm running out of memory (16GB) after about 400 pages.

trying to run the example sample code.

0 down vote favorite

I just installed pdfquery in my machine, and I'm trying to run the example sample code:

import pdfquery
pdf = pdfquery.PDFQuery("examples/sample.pdf")
pdf.load()
label = pdf.pq(':contains("Your first name and initial")')
left_corner = float(label.attr('x0'))
bottom_corner = float(label.attr('y0'))
name = pdf.pq(':in_bbox("%s, %s, %s, %s")' % (left_corner, bottom_corner-30, left_corner+150, bottom_corner)).text()
print name

the problem is that I get this error

Traceback (most recent call last):
File "testePdfQuery.py", line 1, in
import pdfquery
File "/home/ubuntu/Downloads/pdfquery-0.1.3/pdfquery/init.py", line 1, in
from .pdfquery import PDFQuery
File "/home/ubuntu/Downloads/pdfquery-0.1.3/pdfquery/pdfquery.py", line 23, in
cssselect.Function._xpath_in_bbox = _xpath_in_bbox
AttributeError: 'module' object has no attribute 'Function'

any ideas how I can fix this and run the example? Thanks in advance.

Issue with multi page pdf

Hi there,

I am having trouble in this scenario.

The part containing string that I am matching is in the beginning of page 2, when I tried to retrieve the lines below it using the method shown in README, I am getting the result from the beginning of page 1 instead.

I am pretty sure this behavior is not intentional and actually worried that I am not using the library right.

Could you take a look and let me know if I am doing it wrong?

Thanks in advance

use two or more consecutive 'in_bbox'

Hi guys! o/
I wanna know if has a way to execute one 'in_bbox' followed by another 'in_bbox'. For example:

first_bbox = pdf_query.pq('LTTextLineHorizontal:in_bbox("%s, %s, %s, %s")' % (x0, y0, x1, y1))
second_bbox = first_bbox.pq('LTTextLineHorizontal:in_bbox("%s, %s, %s, %s")' % (x0, y0, x1, y1))

'PDFObjRef' object has no attribute 'getitem'

Hello,

I'm trying to parse some pdf files using pdfquery and it seems that for a couple of pdf's(not all of them) I receive the following error:

File "my_path/my_script.py", line 244, in set_description pdf.load()
  File "/my_path/.virtualenvs/dev/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 373, in load
    self.tree = self.get_tree(*_flatten(page_numbers))
  File "/my_path/.virtualenvs/dev/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 475, in get_tree
    for n, page in pages:
  File "/my_path/.virtualenvs/dev/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 596, in <genexpr>
    return (self.get_layout(page) for page in self._cached_pages())
  File "/my_path/.virtualenvs/dev/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 591, in get_layout
    layout = self._add_annots(layout, page.annots)
  File "/my_path/.virtualenvs/dev/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 639, in _add_annots
    annot['URI'] = annot['A']['URI']
TypeError: 'PDFObjRef' object has no attribute '__getitem__'

Below is a list with just a couple of pdf's that raises the above error:
http://www.genomecanada.ca/medias/pdf/en/genomesciencescentrebc.pdf
http://www.genomecanada.ca/medias/pdf/fr/genomesciencescentrebc.pdf
http://www.genomecanada.ca/medias/pdf/en/universityvictoria.pdf
http://www.genomecanada.ca/medias/pdf/fr/universityvictoria.pdf
http://www.genomecanada.ca/medias/pdf/fr/centreforappliedgenomicsogi.pdf

Maybe someone will be able to find a fix for it?

Thanks!

is there any user manual for this

File "abc.py", line 2, in
import pdfquery
File "build\bdist.win32\egg\pdfquery_init_.py", line 1, in

File "build\bdist.win32\egg\pdfquery\pdfquery.py", line 31, in
File "C:\Python27\lib\site-packages\pyquery_init_.py", line 11, in
from .pyquery import PyQuery
File "C:\Python27\lib\site-packages\pyquery\pyquery.py", line 9, in
from lxml import etree
ImportError: DLL load failed: %1 is not a valid Win32 application.

LTTextLineHorizontal.text is null while shows in layout

solved.

PdfQuery - newbee -I need some explications sorry if it is not here to do this, I don't find

Hello Jcushman,
I read many pdf's texts. I don't do annotations popup but I only highlight text in yellow. I wanted to extract (with Python/ pdfMiner /pdfquery) this highlighted text to do some indexation with Whoosh for my studies. I saw that when the text is highlihted the object created in the PDF's file is for example:

20 0 obj
<<
/C [1 1 0]
/F 4
/M (D:20141107203743+01'00')
/P 7 0 R
/T (pibol)
/AP <<
/N 31 0 R
/NM (38048b89-6e9f-4434-9cae2b25dfc8c8a2)
/Rect [112.707338 807.385499 164.672639 816.770264]
/Subj (Surligner)
/Subtype /Highlight
/QuadPoints [114.570002 816.770274 162.809979 816.770274 114.570002
807.385508 162.809979 807.385508]
/CreationDate (D:20141107203743+01'00')
endobj`<<

Unlike a classical annotations popup here there is not the key " /Contents" and it is my problem. I have tried pdfMiner, pyPDF, PyPDF2 and now pdfQuery but but ... I am not very good pythoner and don't find the way to extract the line I want.

I have 2 questions :
Question 1 - With pdfQuery I have tried this :

pdf = pdfquery.PDFQuery("c:\\myDocument.pdf")
pdf.load()
label = pdf.pq('LTTextLineHorizontal:contains("the line I want")')
print label

that gives me this :

<LTTextLineHorizontal bbox="[53.999, 313.813, 189.746, 324.91]" height="11.098" width="135.747" 
word_margin="0.1" x0="53.999" x1="189.746" y0="313.813" y1="324.91"><LTTextBoxHorizontal   
bbox="[53.999, 313.813, 189.746, 324.91]" height="11.098" index="10" width="135.747" 
x0="53.999" x1="189.746" y0="313.813" y1="324.91">the line I want </LTTextBoxHorizontal
</LTTextLineHorizontal>

With this I have the coordonates of my text with <LTTextLineHorizontal bbox....
To test this coordonates I wanted to recuperate the text and only the text with the order ('with_formatter', 'text') explains in your help but how ? I don't understand the way to do this :

pdf.extract([('titleParagraf', ':in_bbox("53.999, 313.813, 189.746, 324.91")',('with_formatter', 'text')) ]) ??

Question 2 : Is it possible with pdfQuery to find one highlihted text in yellow in a text and recuperate this coordonates to extract the text with pdf.extract(['aaaa',':inbbox(coordonatesOf theHighlitedText)]).

I hope don't be too boring and quite clear in my explanations. English it is not my prefered language.
Thanks for your patience and sorry if it was not here to request some help.

Pibol

How to release a file lock on a pdf file (Windows), a.k.a. how to properly close a pdf after querying it ?

How can I run the following code without getting a WinError exception telling me that I cannot remove the pdf file because it is being used by another process (pdfquery):

import os
import pdfquery

filename = 'C:/Documents and Settings/Administrator/document_idc_1.pdf'
pdf = pdfquery.PDFQuery(filename)
pdf.load(1)

os.remove(filename)

Problems running sample code

Hey, I've trying to get the sample code working all day but I keep running into errors.
First, I got tried

import pdfquery

pdf = pdfquery.PDFQuery("pdfs/sample.pdf")
pdf.load()
label = pdf.pq(':contains("Your first name and initial")')

but it returned
Traceback (most recent call last):
File "", line 1, in
File "C:\Python27\lib\site-packages\pyquery\pyquery.py", line 237, in call
(not PY3k and isinstance(args[0], basestring) or
File "C:\Python27\lib\site-packages\pyquery\pyquery.py", line 213, in init
File "C:\Python27\lib\site-packages\pyquery\pyquery.py", line 223, in _css_to_xpath
File "C:\Python27\lib\site-packages\cssselect\xpath.py", line 188, in css_to_xpath
File "C:\Python27\lib\site-packages\cssselect\xpath.py", line 188, in
File "C:\Python27\lib\site-packages\cssselect\xpath.py", line 208, in selector_to_xpath
File "C:\Python27\lib\site-packages\cssselect\xpath.py", line 230, in xpath
File "C:\Python27\lib\site-packages\cssselect\xpath.py", line 260, in xpath_function
File "C:\Python27\lib\site-packages\pyquery\cssselectpatch.py", line 196, in xpath_contains_function
def xpath_gt_function(self, xpath, function):
AttributeError: 'XPathExpr' object has no attribute 'add_post_condition'

Finding that this error was with CSSSelect .8.0 I downgraded to .7.1

but then typing in
>>> import pdfquery

pdf = pdfquery.PDFQuery("C:/Users/Adam/Documents/visual studio 2012/Projects/PDFtoPythonData/PDFtoPythonData/pdfs/sample.pdf")
pdf.load()
label = pdf.pq(':contains("Your first name and initial")')
left_corner = float(label.attr('x0'))
bottom_corner = float(label.attr('y0'))
name = pdf.pq(':in_bbox("%s,%s,%s,%s,")' % (left_corner,bottom_corner-30, left_corner+150, bottom_corner)).text()

resulted in
Traceback (most recent call last):
File "<pyshell#16>", line 1, in
name = pdf.pq(':in_bbox("%s,%s,%s,%s,")' % (left_corner,bottom_corner-30, left_corner+150, bottom_corner)).text()
File "C:\Python27\lib\site-packages\pyquery\pyquery.py", line 241, in call
result = self.class(_args, parent=self, *_kwargs)
File "C:\Python27\lib\site-packages\pyquery\pyquery.py", line 216, in init
xpath = self._css_to_xpath(selector)
File "C:\Python27\lib\site-packages\pyquery\pyquery.py", line 226, in _css_to_xpath
return self._translator.css_to_xpath(selector, prefix)
File "C:\Python27\lib\site-packages\cssselect\xpath.py", line 188, in css_to_xpath
for selector in selectors)
File "C:\Python27\lib\site-packages\cssselect\xpath.py", line 188, in
for selector in selectors)
File "C:\Python27\lib\site-packages\cssselect\xpath.py", line 208, in selector_to_xpath
xpath = self.xpath(tree)
File "C:\Python27\lib\site-packages\cssselect\xpath.py", line 230, in xpath
return method(parsed_selector)
File "C:\Python27\lib\site-packages\cssselect\xpath.py", line 259, in xpath_function
"The pseudo-class :%s() is unknown" % function.name)
ExpressionError: The pseudo-class :in_bbox() is unknown

Not sure if downgrading inadvertedly broke things.

Pseudo classes not working

:first :last :even :odd :eq :lt :gt :checked :selected :file

Pseudo classes not working when try use like this:

pdf.pq('LTTextLineHorizontal:last')

Can't get coordinates.

Hello
I can't get coordinates for my text "green-color-2-2-2". My Script returns "Red green-color-2-2-2"

import pdfquery
import sys
sys.setrecursionlimit(2000)
pdfpath = sys.argv[1]
inputstr = sys.argv[2]
page = int(sys.argv[3])
pdf = pdfquery.PDFQuery(pdfpath)
pdf.load(page)
label = pdf.pq('LTTextLineHorizontal:contains("'+inputstr+'")')[0].layout
print(label)

response

<LTTextLineHorizontal 167.320,142.577,244.579,157.770 u'Red green-color-2-2-2\n'>

How to get the text I need?

Please help with API usage

First, thanks you for great library.
My question is how I can extract text if I know 'figure' name. For example, I need extract text from XObject named pssMO3_1. I can make xml file with command like pdf.tree.write(fxml, pretty_print=True, encoding="utf-8"), and this file will contain all needed data under figure name="pssMO3_1" tag:

$ cat out.xml 
<?xml version="1.0" encoding="utf-8" ?>
<pages>
<page id="1" bbox="0.000,0.000,2097.638,1629.921" rotate="0">
<figure name="pssMO3_1" bbox="1939.906,240.945,1955.906,445.945">
<text font="JDMBFC+ArialUnicodeMS" bbox="1941.468,245.945,1952.732,251.281" size="5.336">B</text>
<text font="JDMBFC+ArialUnicodeMS" bbox="1941.468,251.281,1952.732,253.057" size="1.776">l</text>
<text font="JDMBFC+ArialUnicodeMS" bbox="1941.468,253.057,1952.732,257.505" size="4.448">a</text>
<text font="JDMBFC+ArialUnicodeMS" bbox="1941.468,257.505,1952.732,261.505" size="4.000">c</text>
<text font="JDMBFC+ArialUnicodeMS" bbox="1941.468,261.505,1952.732,265.505" size="4.000">k</text>
and so on...

How I can extract text ('Black ...') using library API?
Thanks in advance!

Custom selectors don't support partial functions

From my SO question on the same issue.

Background

I'm using pdfquery to parse multiple files like this one.

Problem

I'm trying to write a generalized filer function, building off of the custom selectors mentioned in pdfquery's docs, that can take a specific range as an argument. Because this is referenced I thought I could get around this by supplying a partial function using functools.partial (as seen below)

Input

import pdfquery
import functools

def load_file(PDF_FILE):
    pdf = pdfquery.PDFQuery(PDF_FILE)
    pdf.load()
    return pdf

file_with_table = 'Path to the file mentioned above'
pdf = load_file(file_with_table)


def elements_in_range(x1_range):
    return in_range(x1_range[0], x1_range[1], float(this.get('x1',0)))

x1_part = functools.partial(elements_in_range, (95,350))

pdf.pq('LTPage[page_index="0"] *').filter(x1_part)

But when I do that I get the following attribute error;

Output

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
C:\Anaconda3\lib\site-packages\pyquery\pyquery.py in filter(self, selector)
    597                     if len(args) == 1:
--> 598                         func_globals(selector)['this'] = this
    599                     if callback(selector, i, this):

C:\Anaconda3\lib\site-packages\pyquery\pyquery.py in func_globals(f)
     28 def func_globals(f):
---> 29     return f.__globals__ if PY3k else f.func_globals
     30 

AttributeError: 'functools.partial' object has no attribute '__globals__'

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
<ipython-input-74-d75c2c19f74b> in <module>()
     15 x1_part = functools.partial(elements_in_range, (95,350))
     16 
---> 17 pdf.pq('LTPage[page_index="0"] *').filter(x1_part)

C:\Anaconda3\lib\site-packages\pyquery\pyquery.py in filter(self, selector)
    600                         elements.append(this)
    601             finally:
--> 602                 f_globals = func_globals(selector)
    603                 if 'this' in f_globals:
    604                     del f_globals['this']

C:\Anaconda3\lib\site-packages\pyquery\pyquery.py in func_globals(f)
     27 
     28 def func_globals(f):
---> 29     return f.__globals__ if PY3k else f.func_globals
     30 
     31 

AttributeError: 'functools.partial' object has no attribute '__globals__'

Is there any way to get around this? Or possibly some other way to write a custom selector for pdfquery that can take arguments?

@jcushman
If this is module level problem how difficult would it be to fix?

Other than that I'm really enjoying pdfquery. Thanks!

error with load() order

Hi,
i don't know if it is one bug but when I try this
`pdf = pdfquery.PDFQuery("d:\Travail\ myPDF.pdf")

document = pdf.load()`

I have this result:
` document = pdf.load()
File "build\bdist.win-amd64\egg\pdfquery\pdfquery.py", line 373, in load

File "build\bdist.win-amd64\egg\pdfquery\pdfquery.py", line 475, in get_tree
File "build\bdist.win-amd64\egg\pdfquery\pdfquery.py", line 596, in
File "build\bdist.win-amd64\egg\pdfquery\pdfquery.py", line 591, in get_layout
File "build\bdist.win-amd64\egg\pdfquery\pdfquery.py", line 639, in _add_annots
TypeError: 'PDFObjRef' object has no attribute 'getitem'`

Bruno

Where'd my links go?

I'm trying to query the links in a document on a court website, but when I look at the XML, the links seem to be gone.

For example, the document I'm working with is here:

http://apps.courts.ky.gov/supreme/casesummaries/May2015.pdf

Not far down that PDF there's a link to:

http://opinions.kycourts.net/sc/2013-SC-000610-MR.pdf

But if I look at the XML (generated with pdf.tree.write('ky.xml', pretty_print=True, encoding='utf-8')), there doesn't seem to be any links. I've posted the XML here:

https://gist.github.com/mlissner/4cb1eb36e347c2dea00a

Any ideas, or is this something pdfquery doesn't support?

Thanks! It's been interesting playing with this.

cc: @brianwc

TypeError: object of type 'PSLiteral' has no len()

Error and stack trace superfically similar to #15

>>> pdf.load()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pdfquery/pdfquery.py", line 288, in load
    self.tree = self.get_tree(*_flatten(page_numbers))
  File "/usr/local/lib/python2.7/dist-packages/pdfquery/pdfquery.py", line 372, in get_tree
    v = smart_unicode_decode(v)
  File "/usr/local/lib/python2.7/dist-packages/pdfquery/pdfquery.py", line 89, in smart_unicode_decode
    detected_encoding = chardet.detect(encoded_string)
  File "/usr/lib/python2.7/dist-packages/chardet/__init__.py", line 24, in detect
    u.feed(aBuf)
  File "/usr/lib/python2.7/dist-packages/chardet/universaldetector.py", line 64, in feed
    aLen = len(aBuf)
TypeError: object of type 'PSLiteral' has no len()

And it's true - PSLiteral doesn't have a length. The following change at line 366 works:

if type(v) == list:
        v = unicode([smart_unicode_decode(item) for item in v])
elif hasattr(v.__class__, '__len__'):
        v = smart_unicode_decode(v)
else:
        v = smart_unicode_decode(v.name)

I don't know if it's actually the correct thing to do though. Maybe PSLiterals should just be dropped on the floor?

pdf.load() ValueError on pages with unicode.

I've been trying to load up this pdf
And pages 1 and 2 load fine where pages 3 and 4 give:

  File "/home/reb/project/rowreader.py", line 62, in extract_rows
    self.pdf.load(page)  # page 2 in this case (which is page 3 in pdf)
  File "/home/reb/projects/venv/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 373, in load
    self.tree = self.get_tree(*_flatten(page_numbers))
  File "/home/reb/projects/venv/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 476, in get_tree
    page = self._xmlize(page)
  File "/home/reb/projects/venv/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 541, in _xmlize
    child = self._xmlize(child, root)
  File "/home/reb/projects/venv/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 535, in _xmlize
    branch.text = node.get_text()
  File "src/lxml/lxml.etree.pyx", line 1031, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:55347)
  File "src/lxml/apihelpers.pxi", line 711, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:24667)
  File "src/lxml/apihelpers.pxi", line 699, in lxml.etree._createTextNode (src/lxml/lxml.etree.c:24516)
  File "src/lxml/apihelpers.pxi", line 1439, in lxml.etree._utf8 (src/lxml/lxml.etree.c:32441)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

It's really peculiar, the only visable difference between pages 1-2 and 3-4 are that pages 3-4 have unicode stars, could they be the characters that break the lxml tree load in ?

unicode problem when processing doc.info

When I use pdfquery processing a scholar pdf, I found a unicode problem in Line 305, pdfquery.py The variable 'v' is a str type, but stores unicode character. For example, v could be '\xfc'. Since 'v' is a str type, it is literately '', 'x', 'f', 'c'.
Line 305,

        root.set(k, unicode(v))

would get a 'UnicodeDecodeError'. I suggest to use

        root.set(k, v.decode('unicode-escape'))

KeyError: 'Resources' on some file

When opening a file through my code at this file/repo
I don't understand because it seems the PDF file is correctly formatted. You can find the file here

Traceback (most recent call last):
  File "lltToJson.py", line 187, in <module>
    occurences = getFolder()
  File "lltToJson.py", line 173, in getFolder
    occurences[identifier] += getFile(join(path,f))
  File "lltToJson.py", line 111, in getFile
    pdf.load()
  File "/usr/local/lib/python2.7/dist-packages/pdfquery/pdfquery.py", line 288, in load
    self.tree = self.get_tree(*_flatten(page_numbers))
  File "/usr/local/lib/python2.7/dist-packages/pdfquery/pdfquery.py", line 370, in get_tree
    pages = enumerate(self.get_layouts())
  File "/usr/local/lib/python2.7/dist-packages/pdfquery/pdfquery.py", line 470, in get_layouts
    return (self.get_layout(page) for page in self._cached_pages())
  File "/usr/local/lib/python2.7/dist-packages/pdfquery/pdfquery.py", line 497, in _cached_pages
    self._pages += list(self._pages_iter)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfparser.py", line 518, in get_pages
    yield PDFPage(self, pageid, tree)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfparser.py", line 257, in __init__
    self.resources = resolve1(self.attrs['Resources'])
KeyError: 'Resources'

I have literaly no idea why it does not work...

pdf query not catching some text in page

I am using the following code

tax = pdf.pq('LTTextLineHorizontal:in_bbox("%s, %s, %s, %s")' % (
col['tax']['start'], line, col['tax']['end'], line+10)).text()

where i am expecting to catch text something like '8 G1' or '32 G1'
here it catches value '32 G1' but not '8 G1'
actually any single digit value is not caught here.
'589 TKTT 1253925356 14APR17 FVVV D CA 4,440 3,425 8 G1 450 YQ
75 YR 3,386 1.01 39 0.00'
above what my line in pdf line is.
it is catching values at that posssition before and after but not here and in situations like this one.
please help with it
Mayuresh A

Syntax Error on Python 2.6.6

Python 2.6.6 (r266:84292, Nov 21 2013, 10:50:32)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import pdfquery
Traceback (most recent call last):
File "", line 1, in
File "pdfquery/init.py", line 1, in
from .pdfquery import PDFQuery
File "pdfquery/pdfquery.py", line 45
_comp_bbox_keys_required = {'x0', 'x1', 'y0', 'y1'}
^
SyntaxError: invalid syntax

Documentation on caching

I think there's a small mistake in the documentation for caching: FileCache is imported as 'FileCache' but called as 'pdfquery.FileCache'. This works for me:

from pdfquery.cache import FileCache
pdfquery.PDFQuery("tests/samples/IRS_1040A.pdf", parse_tree_cacher=FileCache("/tmp/"))

(It also seems to be formatted differently from the other examples.)

'PDFObjRef' object does not support indexing

`import pdfquery
import sys

pdf = pdfquery.PDFQuery(sys.argv[1])
pdf.load()`

Traceback (most recent call last): File "bin/parse_pdf.py", line 6, in <module> pdf.load() File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 385, in load self.tree = self.get_tree(*_flatten(page_numbers)) File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 487, in get_tree for n, page in pages: File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 608, in <genexpr> return (self.get_layout(page) for page in self._cached_pages()) File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 603, in get_layout layout = self._add_annots(layout, page.annots) File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 647, in _add_annots annot = self._set_hwxy_attrs(annot) File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 665, in _set_hwxy_attrs attr['x0'] = bbox[0] TypeError: 'PDFObjRef' object does not support indexing

pdf.load() in pdfquery.py - 'dict' object has no attribute 'resolve'

Using Python version 3.5 (w/ Anaconda 2.4.0); sorry I don't have much more to add than a bug report. I've just been looking for something in Python 3.x to convert a PDF into text and preserving its layout (a la pdftotext from poppler)...so pdfquery is probably beyond my plaintext needs. But figured you'd be interested in knowing.

Reproducible code:

curl \
  https://static.googleusercontent.com/media/www.google.com/en//selfdrivingcar/files/reports/report-0515.pdf \
  -o g.pdf

import pdfquery
pdf = pdfquery.PDFQuery("g.pdf")
pdf.load()

AttributeError                            Traceback (most recent call last)
<ipython-input-3-4357470f507b> in <module>()
----> 1 pdf.load()

/Users/dtown/.pyenv/versions/anaconda3-2.4.0/lib/python3.5/site-packages/pdfquery/pdfquery.py in load(self, *page_numbers)
    381         [<LTPage>, <LTPage>]
    382         """
--> 383         self.tree = self.get_tree(*_flatten(page_numbers))
    384         self.pq = self.get_pyquery(self.tree)
    385 

/Users/dtown/.pyenv/versions/anaconda3-2.4.0/lib/python3.5/site-packages/pdfquery/pdfquery.py in get_tree(self, *page_numbers)
    483                 else:
    484                     pages = enumerate(self.get_layouts())
--> 485                 for n, page in pages:
    486                     page = self._xmlize(page)
    487                     page.set('page_index', obj_to_string(n))

/Users/dtown/.pyenv/versions/anaconda3-2.4.0/lib/python3.5/site-packages/pdfquery/pdfquery.py in <genexpr>(.0)
    604     def get_layouts(self):
    605         """ Get list of PDFMiner Layout objects for each page. """
--> 606         return (self.get_layout(page) for page in self._cached_pages())
    607 
    608     def _cached_pages(self, target_page=-1):

/Users/dtown/.pyenv/versions/anaconda3-2.4.0/lib/python3.5/site-packages/pdfquery/pdfquery.py in get_layout(self, page)
    599         self.interpreter.process_page(page)
    600         layout = self.device.get_result()
--> 601         layout = self._add_annots(layout, page.annots)
    602         return layout
    603 

/Users/dtown/.pyenv/versions/anaconda3-2.4.0/lib/python3.5/site-packages/pdfquery/pdfquery.py in _add_annots(self, layout, annots)
    642                 annots = annots.resolve()
    643             for annot in annots:
--> 644                 annot = annot.resolve()
    645                 if annot.get('Rect') is not None:
    646                     annot['bbox'] = annot.pop('Rect')  # Rename key

AttributeError: 'dict' object has no attribute 'resolve'

lxml.etree.XPathEvalError: Invalid expression

I can't get the example from the README working.

This is what I have done:

$ sudo easy_install pip
$ sudo pip install pdfquery
$ wget https://raw.github.com/jcushman/pdfquery/master/examples/sample.pdf
$ python
Python 2.7.1 (r271:86832, Jul 31 2011, 19:30:53) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pdfquery
>>> pdf = pdfquery.PDFQuery("sample.pdf")
>>> pdf.load()
>>> label = pdf.pq(':contains("Your first name and initial")')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/pyquery/pyquery.py", line 247, in __call__
    result = self.__class__(*args, parent=self, **kwargs)
  File "/Library/Python/2.7/site-packages/pyquery/pyquery.py", line 223, in __init__
    for tag in elements]
  File "lxml.etree.pyx", line 1444, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:41726)
  File "xpath.pxi", line 321, in lxml.etree.XPathElementEvaluator.__call__ (src/lxml/lxml.etree.c:117867)
  File "xpath.pxi", line 239, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:117044)
  File "xpath.pxi", line 225, in lxml.etree._XPathEvaluatorBase._raise_eval_error (src/lxml/lxml.etree.c:116913)
lxml.etree.XPathEvalError: Invalid expression

I'm using Mac OS X 10.7.4. Output from pip freeze at https://gist.github.com/3082390 if that can help in any way. (I'm not a Python guy.)

Save pdf

Any chance of saving a pdf in the pdf format instead of pdfxml? I would really prefer to use pdfquery in pdfparanoia rather than manual text manipulation of pdf streams.

chardet 3.0 seems to have broken something

In pdfquery.py, in smart_unicode_decode is this:

# detect encoding
detected_encoding = chardet.detect(encoded_string)

With chardet 2.3.0, detected_encoding is {'confidence': 0.0, 'encoding': None}

With chardet 3.0.1 (newest as of time of writing this), detected_encoding is None

So it's crashing on the next line, where it does detected_encoding['encoding'].

Presumably, the fix is as simply as changing:

encoding=detected_encoding['encoding'] or 'utf8',

encoding=detected_encoding['encoding'] if (detected_encoding and detected_encoding['encoding']) else 'utf8',

Get pageid of a search object

Hi, I am a newbie with Python and pdfquery . I am writing a python program to extract info from pdf files and then insert into a word document. I am having trouble with a particular object: "minor spill". Specifically, I am trying to scrap the content of the paragraph underneath "6.3 Methods and materials for containment and cleaning up" (the content I want is "Contain spillage, and then collect with an electrically protected vacuum cleaner or by wet-brushing and place in
container for disposal according to local regulations (see section 13). Keep in suitable, closed containers for disposal.", on page 2 of the pdf file. The problem is that for this particular pdf file, my code will also extract "Product This combustible material may be burned in a chemical incinerator equipped with an afterburner and scrubber. Offer surplus and non-recyclable solutions to a licensed disposal company." on p.5. Because I want to work with many pdf files that might have "6.3..." content on different page, I figure if I can pass the pageid in the extract then it should be fine.
My question is, is there a way you can get the pageid of a object (for example: "minor_spill" in my code.
My code is below and I also attach the pdf file:
https://pastebin.com/rwseBSZV

Thank you very much!
PDF file:
932-66-1.pdf

cssselect/parser.py SelectorSyntaxError on sample code

Running the sample code I'm getting "SelectorSyntaxError: Expected string or ident" from cssselector/parser.py.

Any clue, what this could be?

Text in PDF has an extra LTTextBoxHorizontal whereas similar text elsewhere doesn't

Here's the deal. I did this on a PDF:

pdf.extract([
  ('with_parent', 'LTPage[pageid="1"]'),
  ('name', 'LTTextLineHorizontal:contains("24x7 claims assistance")')
])

I got a [<LTTextLineHorizontal>]. Let's say I assign it to a variable result. Then,

In [61]: result
Out[61]: [<LTTextLineHorizontal>]

In [62]: result[0]
Out[62]: <Element LTTextLineHorizontal at 0x10dc621b0>

In [63]: result[0][0]
Out[63]: <Element LTTextBoxHorizontal at 0x10dc62158>

In the same PDF, I do this:

pdf.extract([
  ('with_parent', 'LTPage[pageid="1"]'),
  ('name', 'LTTextLineHorizontal:contains("9810510983")')
])

Again I got a [<LTTextLineHorizontal>]. Then,

In [70]: result
Out[61]: [<LTTextLineHorizontal>]

In [71]: result[0]
Out[62]: <Element LTTextLineHorizontal at 0x10dc621b0>

In [72]: result[0][0]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-64-223da891ff7c> in <module>()
----> 1 result[0][0]

lxml.etree.pyx in lxml.etree._Element.__getitem__ (src/lxml/lxml.etree.c:47744)()

IndexError: list index out of range

I went through the code. I see that you're using _clean_text() in pdfquery.py to keep the text value in the leaf node and erase the value out of its parents.

I'm sorry but I couldn't enough time to debug it fully. Does anyone know why this would happen?

Getting `TypeError: 'PDFObjRef' object is not iterable`

When trying to load a PDF, I get the following error

TypeError: 'PDFObjRef' object is not iterable

The error happens at pdfquery/pdfquery.py line 631

def _add_annots(self, layout, annots):
        """Adds annotations to the layout object
        """
        if annots: # and not isinstance(annots, PDFObjRef):
            for annot in annots:
                annot = annot.resolve()
                if annot.get('Rect') is not None:
                    annot['bbox'] = annot.pop('Rect')  # Rename key
                    annot = self._set_hwxy_attrs(annot)
                try:
                    annot['URI'] = annot['A']['URI']
                except KeyError:
                    pass
                for k, v in annot.iteritems():
                    if not isinstance(v, basestring):
                        annot[k] = unicode_decode_object(v)
                elem = parser.makeelement('Annot', annot)
                layout.add(elem)
        return layout

The error goes away by adding the second check that is commented out from the above code

'dict' object has no attribute 'resolve'

L123 (master) settings = nums[i+1].resolve().

I have a script which uses pdfquery to grab annotated text. This script works for some pdfs, but not others. The pdf where it doesn't work, this line is called. The pdf where it does work, this line is not called.

Tried a bit of debugging, but don't understand this code at all. It happened in version 0.2.3 and I upgraded to see if it would be different, but alas no. Any tips on how to debug this would be great, thanks.

NB: Replacing this line with settings = nums[i+1] stopped the errors and the script worked as expected.

ValueError: Invalid attribute name u'AAPL:AKExtras'

Processing a PDF with annotations that have a colon in their key value gives an exception:

Traceback (most recent call last):
  File "test_ocr.py", line 633, in test_petition
    analyze = analyze_bankruptcy_petition(pdf_txt = pdf_txt, pdf_fp = file)
  File "program.py", line 255, in analyze_bankruptcy_petition
    pdfq.load(*pages_to_analyze)
  File "..\libs\pdfquery\pdfquery.py", line 385, in load
    self.tree = self.get_tree(*_flatten(page_numbers))
  File "..\libs\pdfquery\pdfquery.py", line 484, in get_tree
    _flatten(page_numbers)]
  File "..\libs\pdfquery\pdfquery.py", line 603, in get_layout
    layout = self._add_annots(layout, page.annots)
  File "..\libs\pdfquery\pdfquery.py", line 663, in _add_annots
    elem = parser.makeelement('Annot', annot)
  File "parser.pxi", line 878, in lxml.etree._BaseParser.makeelement (src/lxml/lxml.etree.c:74798)
  File "apihelpers.pxi", line 156, in lxml.etree._makeElement (src/lxml/lxml.etree.c:12231)
  File "apihelpers.pxi", line 144, in lxml.etree._makeElement (src/lxml/lxml.etree.c:12106)
  File "apihelpers.pxi", line 298, in lxml.etree._initNodeAttributes (src/lxml/lxml.etree.c:13603)
  File "apihelpers.pxi", line 1554, in lxml.etree._attributeValidOrRaise (src/lxml/lxml.etree.c:24197)
ValueError: Invalid attribute name u'AAPL:AKExtras'

Documentation correction

LTPage[page_index=1] should be LTPage[page_index="1"], in the few places it is mentioned
thank you.

Total number of pages

How to know how many pages doc have?
pages = pdf.doc.catalog['Pages'] respond with PDFObjRef:2

438 page PDF takes ~700 sec and ~4GB RAM to load

http://www.atmel.com/Images/Atmel-7766-8-bit-AVR-ATmega16U4-32U4_Datasheet.pdf

During these 700 seconds, only one core is working on 100%, would it be possible to cut the work in 8 in my case by letting the rest of the cores in on the fun?

Is this expected load time? This is using a Intel Core i7-7700K 4 cores with HT => 8 threads, 16GB ram, macOS Sierra, Python 3.6.

However.. using FileCache brings down subsequent runs to 0.8 seconds load time.

Perhaps you could provide some expected performance metrics in the README to quantify "runs very slowly"?

pdf = pdfquery.PDFQuery(args.datasheet, parse_tree_cacher=FileCache("/tmp/"))
t0 = time.time()
pdf.load()
t1 = time.time()

print("Loaded in {} seconds".format(t1-t0))

jcushman / pdfquery Goto Github PK

pdfquery's Issues

Background

Problem

Input

Output

Recommend Projects

Recommend Topics

Recommend Org