jcushman / pdfquery Goto Github PK
View Code? Open in Web Editor NEWA fast and friendly PDF scraping library.
License: MIT License
A fast and friendly PDF scraping library.
License: MIT License
Is it possible to initialise PDFQuery directly from the byte contents of a PDF.
My use case is of a server where the PDF is uploaded and saved in a database as a blob (technically in MongoDB GridFS). The content of the PDF is available to me in memory.
Currently I have created a class to act as a proxy for a file object.
class PseudoPDFFile(object):
"""
Offers a psudo file interface for pdfquery to load the PDF from memory
"""
def __init__(self, content):
self.content = content
def read(self):
return self.content
Is there a way to avoid it
Given a = pdf.pq('LTTextLineHorizontal').items().next()
a.find(':in_bbox("x0,y0,x1,y1")')
raises an ExpressionError: The pseudo-class :in_bbox() is unknown
a.parent('LTPage')
returns an empty list, even though a.parents().filter(lambda i, a: a.tag == 'LTPage')
returns the expected parent (assume here that the LTPage is the direct parent of the element matched by a
).These two calls would have succeeded had a
not been a result of the items
iterator, like a = pdf.pq('LTTextLineHorizontal[index="13"]')
Is planned to do pdfquery compatible with python3?
When installing pdfquery, pdfminer version 2014038 is installed as a dependency. However, the six version of pdfminer should be installed.
While loading a document, using PDFQuery.load()
, I got the following error
354 objid = spec.objid
355 spec = dict_value(spec)
--> 356 self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
357 elif k == 'ColorSpace':
358 for (csid, spec) in dict_value(v).iteritems():
/home/glyfix/projects/ENV/glyfix/local/lib/python2.7/site-packages/pdfminer/pdfinterp.pyc in get_font(self, objid, spec)
202 if k in spec:
203 subspec[k] = resolve1(spec[k])
--> 204 font = self.get_font(None, subspec)
205 else:
206 if STRICT:
/home/glyfix/projects/ENV/glyfix/local/lib/python2.7/site-packages/pdfminer/pdfinterp.pyc in get_font(self, objid, spec)
193 elif subtype in ('CIDFontType0', 'CIDFontType2'):
194 # CID Font
--> 195 font = PDFCIDFont(self, spec)
196 elif subtype == 'Type0':
197 # Type0 Font
/home/glyfix/projects/ENV/glyfix/local/lib/python2.7/site-packages/pdfminer/pdffont.pyc in __init__(self, rsrcmgr, spec)
663 self.fontfile = stream_value(descriptor.get('FontFile2'))
664 ttf = TrueTypeFont(self.basefont,
--> 665 BytesIO(self.fontfile.get_data()))
666 self.unicode_map = None
667 if 'ToUnicode' in spec:
/home/glyfix/projects/ENV/glyfix/local/lib/python2.7/site-packages/pdfminer/pdffont.pyc in __init__(self, name, fp)
384 (ntables, _1, _2, _3) = struct.unpack('>HHHH', fp.read(8))
385 for _ in xrange(ntables):
--> 386 (name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
387 self.tables[name] = (offset, length)
388 return
error: unpack requires a string argument of length 16
Found an issue when upgrading from pdfquery 0.2.7 to 0.4.3. Looks like starting in 0.3.0, support for annotations was added. This is what appears to be happening. In the _add_annots() method in pdfquery.py, an annotation object is found by pdfminer. _add_annots() retrieves this object and converts all information into strings (via obj_to_string()). This method is called again and pdfminer returns a cached version of the annotation object, only this time, all the information has been converted into strings by pdfquery. This leads to an error on line 649:
annot['URI'] = resolve1(annot['A'])['URI']
The first time through _add_annots(), resolve1(annot['A']) returns a dict with 'URI' being one of the keys. On the second time through, annot['A'] is a string representation (converted by obj_to_string) of that dict and so the line fails.
I've attached a PDF file (annot.pdf) to show the problem. This file only has one line of text (a company's home page URL) which is being seen as an annotation.
This error has been found with:
If there's any other information that would help, let me know.
I am trying to read a pdf that contains Chinese (this one):
import pdfquery
pdf = pdfquery.PDFQuery("Table_A_17Sep2014.pdf")
pdf.load()
error:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-4-71df8f58767e> in <module>()
2
3 pdf = pdfquery.PDFQuery("Table_A_17Sep2014.pdf")
----> 4 pdf.load()
/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.pyc in load(self, *page_numbers)
319 [<LTPage>, <LTPage>]
320 """
--> 321 self.tree = self.get_tree(*_flatten(page_numbers))
322 self.pq = self.get_pyquery(self.tree)
323
/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.pyc in get_tree(self, *page_numbers)
413 pages = enumerate(self.get_layouts())
414 for n, page in pages:
--> 415 page = self._xmlize(page)
416 page.set('page_index', unicode(n))
417 page.set('page_label', self.doc.get_page_number(n))
/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.pyc in _xmlize(self, node, root)
467 last = None
468 for child in node:
--> 469 child = self._xmlize(child, root)
470 if self.merge_tags and child.tag in self.merge_tags:
471 if branch.text and child.text in branch.text:
/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.pyc in _xmlize(self, node, root)
467 last = None
468 for child in node:
--> 469 child = self._xmlize(child, root)
470 if self.merge_tags and child.tag in self.merge_tags:
471 if branch.text and child.text in branch.text:
/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.pyc in _xmlize(self, node, root)
467 last = None
468 for child in node:
--> 469 child = self._xmlize(child, root)
470 if self.merge_tags and child.tag in self.merge_tags:
471 if branch.text and child.text in branch.text:
/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.pyc in _xmlize(self, node, root)
448 tags.update( self._getattrs(node, 'colorspace','bits','imagemask','srcsize','stream','name','pts','linewidth') )
449 elif type(node) == LTChar:
--> 450 tags.update( self._getattrs(node, 'fontname','adv','upright','size') )
451 elif type(node) == LTPage:
452 tags.update( self._getattrs(node, 'pageid','rotate') )
/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.pyc in _getattrs(self, obj, *attrs)
486 def _getattrs(self, obj, *attrs):
487 """ Return dictionary of given attrs on given object, if they exist, processing through filter_value(). """
--> 488 return dict( (attr, unicode(self._filter_value(getattr(obj, attr)))) for attr in attrs if hasattr(obj, attr))
489
490 def _filter_value(self, val):
/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.pyc in <genexpr>((attr,))
486 def _getattrs(self, obj, *attrs):
487 """ Return dictionary of given attrs on given object, if they exist, processing through filter_value(). """
--> 488 return dict( (attr, unicode(self._filter_value(getattr(obj, attr)))) for attr in attrs if hasattr(obj, attr))
489
490 def _filter_value(self, val):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb7 in position 7: ordinal not in range(128)
Using python 3.5.3 on Linux I get errors about missing bdist_wheel
when installing pdfquery
like this:
python3 -m venv .pyvenv
source .pyvenv/bin/activate
pip install pdfquery
When I do pip install wheel
before installing pdfquery
, setup outputs no errors. So should wheel
be added to the dependencies?
The output I get when omitting pip install wheel
:
$ pip install pdfquery
Collecting pdfquery
Using cached pdfquery-0.4.3.tar.gz
Collecting cssselect>=0.7.1 (from pdfquery)
Using cached cssselect-1.0.1-py2.py3-none-any.whl
Collecting chardet (from pdfquery)
Using cached chardet-3.0.4-py2.py3-none-any.whl
Collecting lxml>=3.0 (from pdfquery)
Using cached lxml-4.0.0-cp35-cp35m-manylinux1_x86_64.whl
Collecting pdfminer.six (from pdfquery)
Using cached pdfminer.six-20170720.tar.gz
Collecting pyquery>=1.2.2 (from pdfquery)
Using cached pyquery-1.2.17-py2.py3-none-any.whl
Collecting roman>=1.4.0 (from pdfquery)
Using cached roman-2.0.0.zip
Collecting six (from pdfminer.six->pdfquery)
Using cached six-1.11.0-py2.py3-none-any.whl
Collecting pycryptodome (from pdfminer.six->pdfquery)
Using cached pycryptodome-3.4.7.tar.gz
Building wheels for collected packages: pdfquery, pdfminer.six, roman, pycryptodome
Running setup.py bdist_wheel for pdfquery ... error
Complete output from command /XYZ/.pyvenv/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-ahjiqctx/pdfquery/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/tmpgbpb2kiapip-wheel- --python-tag cp35:
usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
or: -c --help [cmd1 cmd2 ...]
or: -c --help-commands
or: -c cmd --help
error: invalid command 'bdist_wheel'
----------------------------------------
Failed building wheel for pdfquery
Running setup.py clean for pdfquery
Running setup.py bdist_wheel for pdfminer.six ... error
Complete output from command /XYZ/.pyvenv/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-ahjiqctx/pdfminer.six/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/tmpliwxpa75pip-wheel- --python-tag cp35:
usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
or: -c --help [cmd1 cmd2 ...]
or: -c --help-commands
or: -c cmd --help
error: invalid command 'bdist_wheel'
----------------------------------------
Failed building wheel for pdfminer.six
Running setup.py clean for pdfminer.six
Running setup.py bdist_wheel for roman ... error
Complete output from command /XYZ/.pyvenv/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-ahjiqctx/roman/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/tmpf7270_kjpip-wheel- --python-tag cp35:
usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
or: -c --help [cmd1 cmd2 ...]
or: -c --help-commands
or: -c cmd --help
error: invalid command 'bdist_wheel'
----------------------------------------
Failed building wheel for roman
Running setup.py clean for roman
Running setup.py bdist_wheel for pycryptodome ... error
Complete output from command /XYZ/.pyvenv/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-ahjiqctx/pycryptodome/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/tmpd5nhx36kpip-wheel- --python-tag cp35:
usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
or: -c --help [cmd1 cmd2 ...]
or: -c --help-commands
or: -c cmd --help
error: invalid command 'bdist_wheel'
----------------------------------------
Failed building wheel for pycryptodome
Running setup.py clean for pycryptodome
Failed to build pdfquery pdfminer.six roman pycryptodome
Installing collected packages: cssselect, chardet, lxml, six, pycryptodome, pdfminer.six, pyquery, roman, pdfquery
Running setup.py install for pycryptodome ... done
Running setup.py install for pdfminer.six ... done
Running setup.py install for roman ... done
Running setup.py install for pdfquery ... done
Successfully installed chardet-3.0.4 cssselect-1.0.1 lxml-4.0.0 pdfminer.six-20170720 pdfquery-0.4.3 pycryptodome-3.4.7 pyquery-1.2.17 roman-2.0.0 six-1.11.0
Running the below code on multiple pdfs, the code pulls duplicate values randomly from each box. I examined the .XML file to make sure there weren't two text boxes layered upon each other, and found no instances of duplicates for each page.
When I say the duplicates are created randomly, I mean that the number of duplicates, which values are duplicated, and the order in which they are pulled into text are random.
I'm curious whether you've seen this before and if there is a fix. It's possible that the pdf's themselves are the problem. Let me know if access to the XML file might help. I can probably strip the sensitive information and send.
Any help would be greatly appreciated!
An example of the text in the box is that shown in the below image. I cannot share the whole pdf due to confidentiality.
#import programs from python libraries
import xlwt
import pdfquery
import csv
import re
pages = raw_input('Please enter the number of pages in the document: ')
#convert user input to integer
pages = int(pages)
#Path to pdf file for PDFQuery access. PDFQuery is the program that pulls in the data from the pdf
pdf = pdfquery.PDFQuery('D:\New Storage\Coding\Python Projects\Iso Pull\Lack.pdf')
#load pdf to active for PDFQuery
pdf.load(range(0,5))
#cycle through page numbers
for pagenumber in range(0,pages):
#create a string sub to avoid messiness in the pdf.pq page number callout
pagesub = 'LTPage[page_index="%s"]' % pagenumber
#find text in boxes. boxes are inches*72. Lower left corner of box to upper right
#Also, keep in mind coordinates of BOM and Iso number may need tweaking due to coordinate find
Item = pdf.pq(pagesub + ' :in_bbox("947.52,379.44,960.48,750.16")').text()
QTY = pdf.pq(pagesub + ' :in_bbox("960.48,379.44,987.12,750.16")').text()
Size = pdf.pq(pagesub + ' :in_bbox("987.12,379.44,1020.24,750.16")').text()
Sch_Minwall = pdf.pq(pagesub + ' :in_bbox("1020.24,379.44,1059.12,750.16")').text()
Description2 = pdf.pq(pagesub + ' :in_bbox("1059.12,379.44,1203.84,750.16")').text()
i tried to load a pdf, and received the following error:
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
I installed pdfquery using pip and also directly cloning from github but the error persists. Whenever I try to create a pdf using pdfquery.PDFQuery("file-name"), it shows following error:
pdf = pdfquery.PDFQuery("/home/bipin/Documents/ProblemAssignment/file.pdf")
Traceback (most recent call last):
File "", line 1, in
File "/home/bipin/src/pdfquery/pdfquery/pdfquery.py", line 187, in init
doc.initialize()
AttributeError: 'QPDFDocument' object has no attribute 'initialize'
I tried using different file and searched the Internet but could not get the solution. Please help me
I think this time it is your python and not pdfminer. (Let's hope ?) File available here
Traceback (most recent call last):
File "lltToJson.py", line 521, in <module>
main(sys.argv[1:])
File "lltToJson.py", line 494, in main
occurences = llt.getFolder()
File "lltToJson.py", line 227, in getFolder
occurences[identifier] += self.getFile(join(path,f))
File "lltToJson.py", line 164, in getFile
pdf.load()
File "/usr/local/lib/python2.7/dist-packages/pdfquery/pdfquery.py", line 288, in load
self.tree = self.get_tree(*_flatten(page_numbers))
File "/usr/local/lib/python2.7/dist-packages/pdfquery/pdfquery.py", line 365, in get_tree
root.set(k, smart_unicode_decode(v))
File "/usr/local/lib/python2.7/dist-packages/pdfquery/pdfquery.py", line 89, in smart_unicode_decode
detected_encoding = chardet.detect(encoded_string)
File "/usr/lib/python2.7/dist-packages/chardet/__init__.py", line 24, in detect
u.feed(aBuf)
File "/usr/lib/python2.7/dist-packages/chardet/universaldetector.py", line 64, in feed
aLen = len(aBuf)
TypeError: object of type 'PDFObjRef' has no len()
Does pdfquery have CJK language support??
I have a very large PDF (about 1000 pages). Since I didn't think it would be wise to load the entire PDF into memory at the same time, I decided to iterate over the pages, calling pdf.load on each page individually thinking this would only load one page in at a time. However, it seems that memory usage continues to grow every time pdf.load is called, like the previous data is not being released. Any ideas? I'm running out of memory (16GB) after about 400 pages.
0 down vote favorite
I just installed pdfquery in my machine, and I'm trying to run the example sample code:
import pdfquery
pdf = pdfquery.PDFQuery("examples/sample.pdf")
pdf.load()
label = pdf.pq(':contains("Your first name and initial")')
left_corner = float(label.attr('x0'))
bottom_corner = float(label.attr('y0'))
name = pdf.pq(':in_bbox("%s, %s, %s, %s")' % (left_corner, bottom_corner-30, left_corner+150, bottom_corner)).text()
print name
the problem is that I get this error
Traceback (most recent call last):
File "testePdfQuery.py", line 1, in
import pdfquery
File "/home/ubuntu/Downloads/pdfquery-0.1.3/pdfquery/init.py", line 1, in
from .pdfquery import PDFQuery
File "/home/ubuntu/Downloads/pdfquery-0.1.3/pdfquery/pdfquery.py", line 23, in
cssselect.Function._xpath_in_bbox = _xpath_in_bbox
AttributeError: 'module' object has no attribute 'Function'
any ideas how I can fix this and run the example? Thanks in advance.
Hi there,
I am having trouble in this scenario.
The part containing string that I am matching is in the beginning of page 2, when I tried to retrieve the lines below it using the method shown in README, I am getting the result from the beginning of page 1 instead.
I am pretty sure this behavior is not intentional and actually worried that I am not using the library right.
Could you take a look and let me know if I am doing it wrong?
Thanks in advance
Hi guys! o/
I wanna know if has a way to execute one 'in_bbox' followed by another 'in_bbox'. For example:
first_bbox = pdf_query.pq('LTTextLineHorizontal:in_bbox("%s, %s, %s, %s")' % (x0, y0, x1, y1))
second_bbox = first_bbox.pq('LTTextLineHorizontal:in_bbox("%s, %s, %s, %s")' % (x0, y0, x1, y1))
Hello,
I'm trying to parse some pdf files using pdfquery and it seems that for a couple of pdf's(not all of them) I receive the following error:
File "my_path/my_script.py", line 244, in set_description pdf.load()
File "/my_path/.virtualenvs/dev/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 373, in load
self.tree = self.get_tree(*_flatten(page_numbers))
File "/my_path/.virtualenvs/dev/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 475, in get_tree
for n, page in pages:
File "/my_path/.virtualenvs/dev/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 596, in <genexpr>
return (self.get_layout(page) for page in self._cached_pages())
File "/my_path/.virtualenvs/dev/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 591, in get_layout
layout = self._add_annots(layout, page.annots)
File "/my_path/.virtualenvs/dev/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 639, in _add_annots
annot['URI'] = annot['A']['URI']
TypeError: 'PDFObjRef' object has no attribute '__getitem__'
Below is a list with just a couple of pdf's that raises the above error:
http://www.genomecanada.ca/medias/pdf/en/genomesciencescentrebc.pdf
http://www.genomecanada.ca/medias/pdf/fr/genomesciencescentrebc.pdf
http://www.genomecanada.ca/medias/pdf/en/universityvictoria.pdf
http://www.genomecanada.ca/medias/pdf/fr/universityvictoria.pdf
http://www.genomecanada.ca/medias/pdf/fr/centreforappliedgenomicsogi.pdf
Maybe someone will be able to find a fix for it?
Thanks!
File "abc.py", line 2, in
import pdfquery
File "build\bdist.win32\egg\pdfquery_init_.py", line 1, in
File "build\bdist.win32\egg\pdfquery\pdfquery.py", line 31, in
File "C:\Python27\lib\site-packages\pyquery_init_.py", line 11, in
from .pyquery import PyQuery
File "C:\Python27\lib\site-packages\pyquery\pyquery.py", line 9, in
from lxml import etree
ImportError: DLL load failed: %1 is not a valid Win32 application.
solved.
Hello Jcushman,
I read many pdf's texts. I don't do annotations popup but I only highlight text in yellow. I wanted to extract (with Python/ pdfMiner /pdfquery) this highlighted text to do some indexation with Whoosh for my studies. I saw that when the text is highlihted the object created in the PDF's file is for example:
20 0 obj
<<
/C [1 1 0]
/F 4
/M (D:20141107203743+01'00')
/P 7 0 R
/T (pibol)
/AP <<
/N 31 0 R
/NM (38048b89-6e9f-4434-9cae2b25dfc8c8a2)
/Rect [112.707338 807.385499 164.672639 816.770264]
/Subj (Surligner)
/Subtype /Highlight
/QuadPoints [114.570002 816.770274 162.809979 816.770274 114.570002
807.385508 162.809979 807.385508]
/CreationDate (D:20141107203743+01'00')
endobj`<<
Unlike a classical annotations popup here there is not the key " /Contents" and it is my problem. I have tried pdfMiner, pyPDF, PyPDF2 and now pdfQuery but but ... I am not very good pythoner and don't find the way to extract the line I want.
I have 2 questions :
Question 1 - With pdfQuery I have tried this :
pdf = pdfquery.PDFQuery("c:\\myDocument.pdf")
pdf.load()
label = pdf.pq('LTTextLineHorizontal:contains("the line I want")')
print label
that gives me this :
<LTTextLineHorizontal bbox="[53.999, 313.813, 189.746, 324.91]" height="11.098" width="135.747"
word_margin="0.1" x0="53.999" x1="189.746" y0="313.813" y1="324.91"><LTTextBoxHorizontal
bbox="[53.999, 313.813, 189.746, 324.91]" height="11.098" index="10" width="135.747"
x0="53.999" x1="189.746" y0="313.813" y1="324.91">the line I want </LTTextBoxHorizontal
</LTTextLineHorizontal>
With this I have the coordonates of my text with <LTTextLineHorizontal bbox....
To test this coordonates I wanted to recuperate the text and only the text with the order ('with_formatter', 'text') explains in your help but how ? I don't understand the way to do this :
pdf.extract([('titleParagraf', ':in_bbox("53.999, 313.813, 189.746, 324.91")',('with_formatter', 'text')) ]) ??
Question 2 : Is it possible with pdfQuery to find one highlihted text in yellow in a text and recuperate this coordonates to extract the text with pdf.extract(['aaaa',':inbbox(coordonatesOf theHighlitedText)]).
I hope don't be too boring and quite clear in my explanations. English it is not my prefered language.
Thanks for your patience and sorry if it was not here to request some help.
Pibol
How can I run the following code without getting a WinError exception telling me that I cannot remove the pdf file because it is being used by another process (pdfquery):
import os
import pdfquery
filename = 'C:/Documents and Settings/Administrator/document_idc_1.pdf'
pdf = pdfquery.PDFQuery(filename)
pdf.load(1)
os.remove(filename)
Hey, I've trying to get the sample code working all day but I keep running into errors.
First, I got tried
import pdfquery
pdf = pdfquery.PDFQuery("pdfs/sample.pdf")
pdf.load()
label = pdf.pq(':contains("Your first name and initial")')
but it returned
Traceback (most recent call last):
File "", line 1, in
File "C:\Python27\lib\site-packages\pyquery\pyquery.py", line 237, in call
(not PY3k and isinstance(args[0], basestring) or
File "C:\Python27\lib\site-packages\pyquery\pyquery.py", line 213, in init
File "C:\Python27\lib\site-packages\pyquery\pyquery.py", line 223, in _css_to_xpath
File "C:\Python27\lib\site-packages\cssselect\xpath.py", line 188, in css_to_xpath
File "C:\Python27\lib\site-packages\cssselect\xpath.py", line 188, in
File "C:\Python27\lib\site-packages\cssselect\xpath.py", line 208, in selector_to_xpath
File "C:\Python27\lib\site-packages\cssselect\xpath.py", line 230, in xpath
File "C:\Python27\lib\site-packages\cssselect\xpath.py", line 260, in xpath_function
File "C:\Python27\lib\site-packages\pyquery\cssselectpatch.py", line 196, in xpath_contains_function
def xpath_gt_function(self, xpath, function):
AttributeError: 'XPathExpr' object has no attribute 'add_post_condition'
Finding that this error was with CSSSelect .8.0 I downgraded to .7.1
but then typing in
>>> import pdfquery
pdf = pdfquery.PDFQuery("C:/Users/Adam/Documents/visual studio 2012/Projects/PDFtoPythonData/PDFtoPythonData/pdfs/sample.pdf")
pdf.load()
label = pdf.pq(':contains("Your first name and initial")')
left_corner = float(label.attr('x0'))
bottom_corner = float(label.attr('y0'))
name = pdf.pq(':in_bbox("%s,%s,%s,%s,")' % (left_corner,bottom_corner-30, left_corner+150, bottom_corner)).text()
resulted in
Traceback (most recent call last):
File "<pyshell#16>", line 1, in
name = pdf.pq(':in_bbox("%s,%s,%s,%s,")' % (left_corner,bottom_corner-30, left_corner+150, bottom_corner)).text()
File "C:\Python27\lib\site-packages\pyquery\pyquery.py", line 241, in call
result = self.class(_args, parent=self, *_kwargs)
File "C:\Python27\lib\site-packages\pyquery\pyquery.py", line 216, in init
xpath = self._css_to_xpath(selector)
File "C:\Python27\lib\site-packages\pyquery\pyquery.py", line 226, in _css_to_xpath
return self._translator.css_to_xpath(selector, prefix)
File "C:\Python27\lib\site-packages\cssselect\xpath.py", line 188, in css_to_xpath
for selector in selectors)
File "C:\Python27\lib\site-packages\cssselect\xpath.py", line 188, in
for selector in selectors)
File "C:\Python27\lib\site-packages\cssselect\xpath.py", line 208, in selector_to_xpath
xpath = self.xpath(tree)
File "C:\Python27\lib\site-packages\cssselect\xpath.py", line 230, in xpath
return method(parsed_selector)
File "C:\Python27\lib\site-packages\cssselect\xpath.py", line 259, in xpath_function
"The pseudo-class :%s() is unknown" % function.name)
ExpressionError: The pseudo-class :in_bbox() is unknown
Not sure if downgrading inadvertedly broke things.
:first :last :even :odd :eq :lt :gt :checked :selected :file
Pseudo classes not working when try use like this:
pdf.pq('LTTextLineHorizontal:last')
Hello
I can't get coordinates for my text "green-color-2-2-2". My Script returns "Red green-color-2-2-2"
import pdfquery
import sys
sys.setrecursionlimit(2000)
pdfpath = sys.argv[1]
inputstr = sys.argv[2]
page = int(sys.argv[3])
pdf = pdfquery.PDFQuery(pdfpath)
pdf.load(page)
label = pdf.pq('LTTextLineHorizontal:contains("'+inputstr+'")')[0].layout
print(label)
response
<LTTextLineHorizontal 167.320,142.577,244.579,157.770 u'Red green-color-2-2-2\n'>
How to get the text I need?
First, thanks you for great library.
My question is how I can extract text if I know 'figure' name. For example, I need extract text from XObject named pssMO3_1. I can make xml file with command like pdf.tree.write(fxml, pretty_print=True, encoding="utf-8")
, and this file will contain all needed data under figure name="pssMO3_1" tag:
$ cat out.xml
<?xml version="1.0" encoding="utf-8" ?>
<pages>
<page id="1" bbox="0.000,0.000,2097.638,1629.921" rotate="0">
<figure name="pssMO3_1" bbox="1939.906,240.945,1955.906,445.945">
<text font="JDMBFC+ArialUnicodeMS" bbox="1941.468,245.945,1952.732,251.281" size="5.336">B</text>
<text font="JDMBFC+ArialUnicodeMS" bbox="1941.468,251.281,1952.732,253.057" size="1.776">l</text>
<text font="JDMBFC+ArialUnicodeMS" bbox="1941.468,253.057,1952.732,257.505" size="4.448">a</text>
<text font="JDMBFC+ArialUnicodeMS" bbox="1941.468,257.505,1952.732,261.505" size="4.000">c</text>
<text font="JDMBFC+ArialUnicodeMS" bbox="1941.468,261.505,1952.732,265.505" size="4.000">k</text>
and so on...
How I can extract text ('Black ...') using library API?
Thanks in advance!
From my SO question on the same issue.
I'm using pdfquery to parse multiple files like this one.
I'm trying to write a generalized filer function, building off of the custom selectors mentioned in pdfquery's docs, that can take a specific range as an argument. Because this
is referenced I thought I could get around this by supplying a partial function using functools.partial
(as seen below)
import pdfquery
import functools
def load_file(PDF_FILE):
pdf = pdfquery.PDFQuery(PDF_FILE)
pdf.load()
return pdf
file_with_table = 'Path to the file mentioned above'
pdf = load_file(file_with_table)
def elements_in_range(x1_range):
return in_range(x1_range[0], x1_range[1], float(this.get('x1',0)))
x1_part = functools.partial(elements_in_range, (95,350))
pdf.pq('LTPage[page_index="0"] *').filter(x1_part)
But when I do that I get the following attribute error;
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
C:\Anaconda3\lib\site-packages\pyquery\pyquery.py in filter(self, selector)
597 if len(args) == 1:
--> 598 func_globals(selector)['this'] = this
599 if callback(selector, i, this):
C:\Anaconda3\lib\site-packages\pyquery\pyquery.py in func_globals(f)
28 def func_globals(f):
---> 29 return f.__globals__ if PY3k else f.func_globals
30
AttributeError: 'functools.partial' object has no attribute '__globals__'
During handling of the above exception, another exception occurred:
AttributeError Traceback (most recent call last)
<ipython-input-74-d75c2c19f74b> in <module>()
15 x1_part = functools.partial(elements_in_range, (95,350))
16
---> 17 pdf.pq('LTPage[page_index="0"] *').filter(x1_part)
C:\Anaconda3\lib\site-packages\pyquery\pyquery.py in filter(self, selector)
600 elements.append(this)
601 finally:
--> 602 f_globals = func_globals(selector)
603 if 'this' in f_globals:
604 del f_globals['this']
C:\Anaconda3\lib\site-packages\pyquery\pyquery.py in func_globals(f)
27
28 def func_globals(f):
---> 29 return f.__globals__ if PY3k else f.func_globals
30
31
AttributeError: 'functools.partial' object has no attribute '__globals__'
Is there any way to get around this? Or possibly some other way to write a custom selector for pdfquery that can take arguments?
@jcushman
If this is module level problem how difficult would it be to fix?
Other than that I'm really enjoying pdfquery. Thanks!
Hi,
i don't know if it is one bug but when I try this
`pdf = pdfquery.PDFQuery("d:\Travail\ myPDF.pdf")
document = pdf.load()`
I have this result:
` document = pdf.load()
File "build\bdist.win-amd64\egg\pdfquery\pdfquery.py", line 373, in load
File "build\bdist.win-amd64\egg\pdfquery\pdfquery.py", line 475, in get_tree
File "build\bdist.win-amd64\egg\pdfquery\pdfquery.py", line 596, in
File "build\bdist.win-amd64\egg\pdfquery\pdfquery.py", line 591, in get_layout
File "build\bdist.win-amd64\egg\pdfquery\pdfquery.py", line 639, in _add_annots
TypeError: 'PDFObjRef' object has no attribute 'getitem'`
Bruno
I'm trying to query the links in a document on a court website, but when I look at the XML, the links seem to be gone.
For example, the document I'm working with is here:
http://apps.courts.ky.gov/supreme/casesummaries/May2015.pdf
Not far down that PDF there's a link to:
http://opinions.kycourts.net/sc/2013-SC-000610-MR.pdf
But if I look at the XML (generated with pdf.tree.write('ky.xml', pretty_print=True, encoding='utf-8')
), there doesn't seem to be any links. I've posted the XML here:
https://gist.github.com/mlissner/4cb1eb36e347c2dea00a
Any ideas, or is this something pdfquery doesn't support?
Thanks! It's been interesting playing with this.
cc: @brianwc
Error and stack trace superfically similar to #15
>>> pdf.load()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pdfquery/pdfquery.py", line 288, in load
self.tree = self.get_tree(*_flatten(page_numbers))
File "/usr/local/lib/python2.7/dist-packages/pdfquery/pdfquery.py", line 372, in get_tree
v = smart_unicode_decode(v)
File "/usr/local/lib/python2.7/dist-packages/pdfquery/pdfquery.py", line 89, in smart_unicode_decode
detected_encoding = chardet.detect(encoded_string)
File "/usr/lib/python2.7/dist-packages/chardet/__init__.py", line 24, in detect
u.feed(aBuf)
File "/usr/lib/python2.7/dist-packages/chardet/universaldetector.py", line 64, in feed
aLen = len(aBuf)
TypeError: object of type 'PSLiteral' has no len()
And it's true - PSLiteral doesn't have a length. The following change at line 366 works:
if type(v) == list:
v = unicode([smart_unicode_decode(item) for item in v])
elif hasattr(v.__class__, '__len__'):
v = smart_unicode_decode(v)
else:
v = smart_unicode_decode(v.name)
I don't know if it's actually the correct thing to do though. Maybe PSLiterals should just be dropped on the floor?
I've been trying to load up this pdf
And pages 1 and 2 load fine where pages 3 and 4 give:
File "/home/reb/project/rowreader.py", line 62, in extract_rows
self.pdf.load(page) # page 2 in this case (which is page 3 in pdf)
File "/home/reb/projects/venv/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 373, in load
self.tree = self.get_tree(*_flatten(page_numbers))
File "/home/reb/projects/venv/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 476, in get_tree
page = self._xmlize(page)
File "/home/reb/projects/venv/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 541, in _xmlize
child = self._xmlize(child, root)
File "/home/reb/projects/venv/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 535, in _xmlize
branch.text = node.get_text()
File "src/lxml/lxml.etree.pyx", line 1031, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:55347)
File "src/lxml/apihelpers.pxi", line 711, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:24667)
File "src/lxml/apihelpers.pxi", line 699, in lxml.etree._createTextNode (src/lxml/lxml.etree.c:24516)
File "src/lxml/apihelpers.pxi", line 1439, in lxml.etree._utf8 (src/lxml/lxml.etree.c:32441)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
It's really peculiar, the only visable difference between pages 1-2 and 3-4 are that pages 3-4 have unicode stars, could they be the characters that break the lxml tree load in ?
When I use pdfquery processing a scholar pdf, I found a unicode problem in Line 305, pdfquery.py The variable 'v' is a str type, but stores unicode character. For example, v could be '\xfc'. Since 'v' is a str type, it is literately '', 'x', 'f', 'c'.
Line 305,
root.set(k, unicode(v))
would get a 'UnicodeDecodeError'. I suggest to use
root.set(k, v.decode('unicode-escape'))
When opening a file through my code at this file/repo
I don't understand because it seems the PDF file is correctly formatted. You can find the file here
Traceback (most recent call last):
File "lltToJson.py", line 187, in <module>
occurences = getFolder()
File "lltToJson.py", line 173, in getFolder
occurences[identifier] += getFile(join(path,f))
File "lltToJson.py", line 111, in getFile
pdf.load()
File "/usr/local/lib/python2.7/dist-packages/pdfquery/pdfquery.py", line 288, in load
self.tree = self.get_tree(*_flatten(page_numbers))
File "/usr/local/lib/python2.7/dist-packages/pdfquery/pdfquery.py", line 370, in get_tree
pages = enumerate(self.get_layouts())
File "/usr/local/lib/python2.7/dist-packages/pdfquery/pdfquery.py", line 470, in get_layouts
return (self.get_layout(page) for page in self._cached_pages())
File "/usr/local/lib/python2.7/dist-packages/pdfquery/pdfquery.py", line 497, in _cached_pages
self._pages += list(self._pages_iter)
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfparser.py", line 518, in get_pages
yield PDFPage(self, pageid, tree)
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfparser.py", line 257, in __init__
self.resources = resolve1(self.attrs['Resources'])
KeyError: 'Resources'
I have literaly no idea why it does not work...
I am using the following code
tax = pdf.pq('LTTextLineHorizontal:in_bbox("%s, %s, %s, %s")' % (
col['tax']['start'], line, col['tax']['end'], line+10)).text()
where i am expecting to catch text something like '8 G1' or '32 G1'
here it catches value '32 G1' but not '8 G1'
actually any single digit value is not caught here.
'589 TKTT 1253925356 14APR17 FVVV D CA 4,440 3,425 8 G1 450 YQ
75 YR 3,386 1.01 39 0.00'
above what my line in pdf line is.
it is catching values at that posssition before and after but not here and in situations like this one.
please help with it
Mayuresh A
Python 2.6.6 (r266:84292, Nov 21 2013, 10:50:32)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
import pdfquery
Traceback (most recent call last):
File "", line 1, in
File "pdfquery/init.py", line 1, in
from .pdfquery import PDFQuery
File "pdfquery/pdfquery.py", line 45
_comp_bbox_keys_required = {'x0', 'x1', 'y0', 'y1'}
^
SyntaxError: invalid syntax
I think there's a small mistake in the documentation for caching: FileCache is imported as 'FileCache' but called as 'pdfquery.FileCache'. This works for me:
from pdfquery.cache import FileCache
pdfquery.PDFQuery("tests/samples/IRS_1040A.pdf", parse_tree_cacher=FileCache("/tmp/"))
(It also seems to be formatted differently from the other examples.)
`import pdfquery
import sys
pdf = pdfquery.PDFQuery(sys.argv[1])
pdf.load()`
Traceback (most recent call last): File "bin/parse_pdf.py", line 6, in <module> pdf.load() File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 385, in load self.tree = self.get_tree(*_flatten(page_numbers)) File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 487, in get_tree for n, page in pages: File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 608, in <genexpr> return (self.get_layout(page) for page in self._cached_pages()) File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 603, in get_layout layout = self._add_annots(layout, page.annots) File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 647, in _add_annots annot = self._set_hwxy_attrs(annot) File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 665, in _set_hwxy_attrs attr['x0'] = bbox[0] TypeError: 'PDFObjRef' object does not support indexing
Using Python version 3.5 (w/ Anaconda 2.4.0); sorry I don't have much more to add than a bug report. I've just been looking for something in Python 3.x to convert a PDF into text and preserving its layout (a la pdftotext from poppler)...so pdfquery is probably beyond my plaintext needs. But figured you'd be interested in knowing.
Reproducible code:
curl \
https://static.googleusercontent.com/media/www.google.com/en//selfdrivingcar/files/reports/report-0515.pdf \
-o g.pdf
import pdfquery
pdf = pdfquery.PDFQuery("g.pdf")
pdf.load()
AttributeError Traceback (most recent call last)
<ipython-input-3-4357470f507b> in <module>()
----> 1 pdf.load()
/Users/dtown/.pyenv/versions/anaconda3-2.4.0/lib/python3.5/site-packages/pdfquery/pdfquery.py in load(self, *page_numbers)
381 [<LTPage>, <LTPage>]
382 """
--> 383 self.tree = self.get_tree(*_flatten(page_numbers))
384 self.pq = self.get_pyquery(self.tree)
385
/Users/dtown/.pyenv/versions/anaconda3-2.4.0/lib/python3.5/site-packages/pdfquery/pdfquery.py in get_tree(self, *page_numbers)
483 else:
484 pages = enumerate(self.get_layouts())
--> 485 for n, page in pages:
486 page = self._xmlize(page)
487 page.set('page_index', obj_to_string(n))
/Users/dtown/.pyenv/versions/anaconda3-2.4.0/lib/python3.5/site-packages/pdfquery/pdfquery.py in <genexpr>(.0)
604 def get_layouts(self):
605 """ Get list of PDFMiner Layout objects for each page. """
--> 606 return (self.get_layout(page) for page in self._cached_pages())
607
608 def _cached_pages(self, target_page=-1):
/Users/dtown/.pyenv/versions/anaconda3-2.4.0/lib/python3.5/site-packages/pdfquery/pdfquery.py in get_layout(self, page)
599 self.interpreter.process_page(page)
600 layout = self.device.get_result()
--> 601 layout = self._add_annots(layout, page.annots)
602 return layout
603
/Users/dtown/.pyenv/versions/anaconda3-2.4.0/lib/python3.5/site-packages/pdfquery/pdfquery.py in _add_annots(self, layout, annots)
642 annots = annots.resolve()
643 for annot in annots:
--> 644 annot = annot.resolve()
645 if annot.get('Rect') is not None:
646 annot['bbox'] = annot.pop('Rect') # Rename key
AttributeError: 'dict' object has no attribute 'resolve'
I can't get the example from the README working.
This is what I have done:
$ sudo easy_install pip
$ sudo pip install pdfquery
$ wget https://raw.github.com/jcushman/pdfquery/master/examples/sample.pdf
$ python
Python 2.7.1 (r271:86832, Jul 31 2011, 19:30:53)
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pdfquery
>>> pdf = pdfquery.PDFQuery("sample.pdf")
>>> pdf.load()
>>> label = pdf.pq(':contains("Your first name and initial")')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/pyquery/pyquery.py", line 247, in __call__
result = self.__class__(*args, parent=self, **kwargs)
File "/Library/Python/2.7/site-packages/pyquery/pyquery.py", line 223, in __init__
for tag in elements]
File "lxml.etree.pyx", line 1444, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:41726)
File "xpath.pxi", line 321, in lxml.etree.XPathElementEvaluator.__call__ (src/lxml/lxml.etree.c:117867)
File "xpath.pxi", line 239, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:117044)
File "xpath.pxi", line 225, in lxml.etree._XPathEvaluatorBase._raise_eval_error (src/lxml/lxml.etree.c:116913)
lxml.etree.XPathEvalError: Invalid expression
I'm using Mac OS X 10.7.4. Output from pip freeze
at https://gist.github.com/3082390 if that can help in any way. (I'm not a Python guy.)
Any chance of saving a pdf in the pdf format instead of pdfxml? I would really prefer to use pdfquery in pdfparanoia rather than manual text manipulation of pdf streams.
In pdfquery.py, in smart_unicode_decode
is this:
# detect encoding
detected_encoding = chardet.detect(encoded_string)
With chardet 2.3.0, detected_encoding
is {'confidence': 0.0, 'encoding': None}
With chardet 3.0.1 (newest as of time of writing this), detected_encoding
is None
So it's crashing on the next line, where it does detected_encoding['encoding']
.
Presumably, the fix is as simply as changing:
encoding=detected_encoding['encoding'] or 'utf8',
to
encoding=detected_encoding['encoding'] if (detected_encoding and detected_encoding['encoding']) else 'utf8',
Hi, I am a newbie with Python and pdfquery . I am writing a python program to extract info from pdf files and then insert into a word document. I am having trouble with a particular object: "minor spill". Specifically, I am trying to scrap the content of the paragraph underneath "6.3 Methods and materials for containment and cleaning up" (the content I want is "Contain spillage, and then collect with an electrically protected vacuum cleaner or by wet-brushing and place in
container for disposal according to local regulations (see section 13). Keep in suitable, closed containers for disposal.", on page 2 of the pdf file. The problem is that for this particular pdf file, my code will also extract "Product This combustible material may be burned in a chemical incinerator equipped with an afterburner and scrubber. Offer surplus and non-recyclable solutions to a licensed disposal company." on p.5. Because I want to work with many pdf files that might have "6.3..." content on different page, I figure if I can pass the pageid in the extract then it should be fine.
My question is, is there a way you can get the pageid of a object (for example: "minor_spill" in my code.
My code is below and I also attach the pdf file:
https://pastebin.com/rwseBSZV
Thank you very much!
PDF file:
932-66-1.pdf
Running the sample code I'm getting "SelectorSyntaxError: Expected string or ident" from cssselector/parser.py.
Any clue, what this could be?
Here's the deal. I did this on a PDF:
pdf.extract([
('with_parent', 'LTPage[pageid="1"]'),
('name', 'LTTextLineHorizontal:contains("24x7 claims assistance")')
])
I got a [<LTTextLineHorizontal>]
. Let's say I assign it to a variable result
. Then,
In [61]: result
Out[61]: [<LTTextLineHorizontal>]
In [62]: result[0]
Out[62]: <Element LTTextLineHorizontal at 0x10dc621b0>
In [63]: result[0][0]
Out[63]: <Element LTTextBoxHorizontal at 0x10dc62158>
In the same PDF, I do this:
pdf.extract([
('with_parent', 'LTPage[pageid="1"]'),
('name', 'LTTextLineHorizontal:contains("9810510983")')
])
Again I got a [<LTTextLineHorizontal>]
. Then,
In [70]: result
Out[61]: [<LTTextLineHorizontal>]
In [71]: result[0]
Out[62]: <Element LTTextLineHorizontal at 0x10dc621b0>
In [72]: result[0][0]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-64-223da891ff7c> in <module>()
----> 1 result[0][0]
lxml.etree.pyx in lxml.etree._Element.__getitem__ (src/lxml/lxml.etree.c:47744)()
IndexError: list index out of range
I went through the code. I see that you're using _clean_text()
in pdfquery.py
to keep the text value in the leaf node and erase the value out of its parents.
I'm sorry but I couldn't enough time to debug it fully. Does anyone know why this would happen?
When trying to load a PDF, I get the following error
TypeError: 'PDFObjRef' object is not iterable
The error happens at pdfquery/pdfquery.py line 631
def _add_annots(self, layout, annots):
"""Adds annotations to the layout object
"""
if annots: # and not isinstance(annots, PDFObjRef):
for annot in annots:
annot = annot.resolve()
if annot.get('Rect') is not None:
annot['bbox'] = annot.pop('Rect') # Rename key
annot = self._set_hwxy_attrs(annot)
try:
annot['URI'] = annot['A']['URI']
except KeyError:
pass
for k, v in annot.iteritems():
if not isinstance(v, basestring):
annot[k] = unicode_decode_object(v)
elem = parser.makeelement('Annot', annot)
layout.add(elem)
return layout
The error goes away by adding the second check that is commented out from the above code
L123 (master) settings = nums[i+1].resolve()
.
I have a script which uses pdfquery to grab annotated text. This script works for some pdfs, but not others. The pdf where it doesn't work, this line is called. The pdf where it does work, this line is not called.
Tried a bit of debugging, but don't understand this code at all. It happened in version 0.2.3 and I upgraded to see if it would be different, but alas no. Any tips on how to debug this would be great, thanks.
NB: Replacing this line with settings = nums[i+1]
stopped the errors and the script worked as expected.
Processing a PDF with annotations that have a colon in their key value gives an exception:
Traceback (most recent call last):
File "test_ocr.py", line 633, in test_petition
analyze = analyze_bankruptcy_petition(pdf_txt = pdf_txt, pdf_fp = file)
File "program.py", line 255, in analyze_bankruptcy_petition
pdfq.load(*pages_to_analyze)
File "..\libs\pdfquery\pdfquery.py", line 385, in load
self.tree = self.get_tree(*_flatten(page_numbers))
File "..\libs\pdfquery\pdfquery.py", line 484, in get_tree
_flatten(page_numbers)]
File "..\libs\pdfquery\pdfquery.py", line 603, in get_layout
layout = self._add_annots(layout, page.annots)
File "..\libs\pdfquery\pdfquery.py", line 663, in _add_annots
elem = parser.makeelement('Annot', annot)
File "parser.pxi", line 878, in lxml.etree._BaseParser.makeelement (src/lxml/lxml.etree.c:74798)
File "apihelpers.pxi", line 156, in lxml.etree._makeElement (src/lxml/lxml.etree.c:12231)
File "apihelpers.pxi", line 144, in lxml.etree._makeElement (src/lxml/lxml.etree.c:12106)
File "apihelpers.pxi", line 298, in lxml.etree._initNodeAttributes (src/lxml/lxml.etree.c:13603)
File "apihelpers.pxi", line 1554, in lxml.etree._attributeValidOrRaise (src/lxml/lxml.etree.c:24197)
ValueError: Invalid attribute name u'AAPL:AKExtras'
LTPage[page_index=1] should be LTPage[page_index="1"], in the few places it is mentioned
thank you.
How to know how many pages doc have?
pages = pdf.doc.catalog['Pages'] respond with PDFObjRef:2
http://www.atmel.com/Images/Atmel-7766-8-bit-AVR-ATmega16U4-32U4_Datasheet.pdf
During these 700 seconds, only one core is working on 100%, would it be possible to cut the work in 8 in my case by letting the rest of the cores in on the fun?
Is this expected load time? This is using a Intel Core i7-7700K 4 cores with HT => 8 threads, 16GB ram, macOS Sierra, Python 3.6.
However.. using FileCache brings down subsequent runs to 0.8 seconds load time.
Perhaps you could provide some expected performance metrics in the README to quantify "runs very slowly"?
pdf = pdfquery.PDFQuery(args.datasheet, parse_tree_cacher=FileCache("/tmp/"))
t0 = time.time()
pdf.load()
t1 = time.time()
print("Loaded in {} seconds".format(t1-t0))
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.