Comments (5)
Hmm, sorry to hear that. I can confirm that the example you gave works for
me (on Windows 7), so this is some kind of dependency problem.
Looking at your pip freeze, the only difference I see is I'm
running lxml==2.3 instead of lxml==2.3.4 . So you could try downgrading to
lxml 2.3 and see if that fixes it -- that would be good to know, and would
get things moving for you.
I'll see if I can reproduce your bug when I'm in front of my Mac. In the
meantime, if downgrading lxml doesn't work you might be able to get what
you need by accessing the lxml etree directly instead of going through
pyquery. Once you call pdf.load(), pdf.tree should be an etree object that
you can search as described here:
http://lxml.de/tutorial.html#elementpath. You can also dump its
contents to a text file with pdf.tree.write().
On Tue, Jul 10, 2012 at 5:56 AM, Patrik Ragnarsson <
[email protected]
wrote:
I can't get the example from the README working.
This is what I have done:
$ sudo easy_install pip $ sudo pip install pdfquery $ wget
https://raw.github.com/jcushman/pdfquery/master/examples/sample.pdf
$ python
Python 2.7.1 (r271:86832, Jul 31 2011, 19:30:53)
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build
2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more
information.
>>> import pdfquery
>>> pdf = pdfquery.PDFQuery("sample.pdf")
>>> pdf.load()
>>> label = pdf.pq(':contains("Your first name and initial")')
Traceback (most recent call last):
File "", line 1, in
File "/Library/Python/2.7/site-packages/pyquery/pyquery.py",
line 247, in call
result = self.class(_args, parent=self, *_kwargs)
File "/Library/Python/2.7/site-packages/pyquery/pyquery.py",
line 223, in init
for tag in elements]
File "lxml.etree.pyx", line 1444, in lxml.etree._Element.xpath
(src/lxml/lxml.etree.c:41726)
File "xpath.pxi", line 321, in
lxml.etree.XPathElementEvaluator.call (src/lxml/lxml.etree.c:117867)
File "xpath.pxi", line 239, in
lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:117044)
File "xpath.pxi", line 225, in
lxml.etree._XPathEvaluatorBase._raise_eval_error
(src/lxml/lxml.etree.c:116913)
lxml.etree.XPathEvalError: Invalid expressionI'm using Mac OS X 10.7.4. Output from
pip freeze
at
https://gist.github.com/3082390 if that can help in any way. (I'm not a
Python guy.)
Reply to this email directly or view it on GitHub:
#1
from pdfquery.
Uninstalling lxml just to be sure.
$ sudo pip uninstall lxml
Uninstalling lxml:
/Library/Python/2.7/site-packages/lxml
/Library/Python/2.7/site-packages/lxml-2.3.4-py2.7.egg-info
Proceed (y/n)? y
Successfully uninstalled lxml
$ sudo pip install lxml==2.3
Downloading/unpacking lxml==2.3
Downloading lxml-2.3.tar.gz (3.2Mb): 3.2Mb downloaded
Running setup.py egg_info for package lxml
Building lxml version 2.3.
Building without Cython.
Using build configuration of libxslt 1.1.24
warning: no previously-included files found matching '*.py'
Installing collected packages: lxml
Running setup.py install for lxml
Building lxml version 2.3.
Building without Cython.
Using build configuration of libxslt 1.1.24
building 'lxml.etree' extension
llvm-gcc-4.2 -fno-strict-aliasing -fno-common -dynamic -g -Os -pipe -fno-common -fno-strict-aliasing -fwrapv -mno-fused-madd -DENABLE_DTRACE -DMACOSX -DNDEBUG -Wall -Wstrict-prototypes -Wshorten-64-to-32 -DNDEBUG -g -fwrapv -Os -Wall -Wstrict-prototypes -DENABLE_DTRACE -arch i386 -arch x86_64 -pipe -I/usr/include/libxml2 -I/System/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c src/lxml/lxml.etree.c -o build/temp.macosx-10.7-intel-2.7/src/lxml/lxml.etree.o -w -flat_namespace
llvm-gcc-4.2 -Wl,-F. -bundle -undefined dynamic_lookup -Wl,-F. -arch i386 -arch x86_64 build/temp.macosx-10.7-intel-2.7/src/lxml/lxml.etree.o -lxslt -lexslt -lxml2 -lz -lm -o build/lib.macosx-10.7-intel-2.7/lxml/etree.so
building 'lxml.objectify' extension
llvm-gcc-4.2 -fno-strict-aliasing -fno-common -dynamic -g -Os -pipe -fno-common -fno-strict-aliasing -fwrapv -mno-fused-madd -DENABLE_DTRACE -DMACOSX -DNDEBUG -Wall -Wstrict-prototypes -Wshorten-64-to-32 -DNDEBUG -g -fwrapv -Os -Wall -Wstrict-prototypes -DENABLE_DTRACE -arch i386 -arch x86_64 -pipe -I/usr/include/libxml2 -I/System/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c src/lxml/lxml.objectify.c -o build/temp.macosx-10.7-intel-2.7/src/lxml/lxml.objectify.o -w -flat_namespace
llvm-gcc-4.2 -Wl,-F. -bundle -undefined dynamic_lookup -Wl,-F. -arch i386 -arch x86_64 build/temp.macosx-10.7-intel-2.7/src/lxml/lxml.objectify.o -lxslt -lexslt -lxml2 -lz -lm -o build/lib.macosx-10.7-intel-2.7/lxml/objectify.so
Successfully installed lxml
Cleaning up...
Trying again:
$ python
Python 2.7.1 (r271:86832, Jul 31 2011, 19:30:53)
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pdfquery
>>> pdf = pdfquery.PDFQuery("sample.pdf")
>>> pdf.load()
>>> label = pdf.pq(':contains("Your first name and initial")')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/pyquery/pyquery.py", line 247, in __call__
result = self.__class__(*args, parent=self, **kwargs)
File "/Library/Python/2.7/site-packages/pyquery/pyquery.py", line 223, in __init__
for tag in elements]
File "lxml.etree.pyx", line 1459, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:40530)
File "xpath.pxi", line 324, in lxml.etree.XPathElementEvaluator.__call__ (src/lxml/lxml.etree.c:113864)
File "xpath.pxi", line 242, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:113063)
File "xpath.pxi", line 228, in lxml.etree._XPathEvaluatorBase._raise_eval_error (src/lxml/lxml.etree.c:112935)
lxml.etree.XPathEvalError: Invalid expression
Output from pdf.tree.write()
here: https://gist.github.com/3098670
from pdfquery.
I still haven't had time to try this on a Mac, but if you feel like digging
deeper, what's happening here is pyquery is translating the jquery selector
into an xpath, and lxml is choking on it, so it's gotta be one of those two
libraries. Here's what it looks like when I run the same search directly
with lxml:
pdf.tree.xpath("//*[contains(text(), 'Your first name and initial')]")
[<Element LTTextLineHorizontal at 0x2c63b40>]
If that doesn't work for you, it's definitely a problem with your lxml
library (since it looks like your pdf.tree is being generated properly). If
it does, the next step would be to throw in some debugging statements near
'File "/Library/Python/2.7/site-packages/pyquery/pyquery.py", line 223' and
try to figure out what xpath is being generated. (It won't necessarily be
the same as mine -- not sure how pyquery translates that stuff exactly --
but it'll either be valid xpath or not.)
Sorry for the hassle -- the whole point of this library is to put a
friendly shine on a messy problem, so I'd love to get it sorted out.
On Thu, Jul 12, 2012 at 11:05 AM, Patrik Ragnarsson <
[email protected]
wrote:
Uninstalling lxml just to be sure.
$ sudo pip uninstall lxml Uninstalling lxml: /Library/Python/2.7/site-packages/lxml /Library/Python/2.7/site-packages/lxml-2.3.4-py2.7.egg-info Proceed (y/n)? y Successfully uninstalled lxml $ sudo pip install lxml==2.3 Downloading/unpacking lxml==2.3 Downloading lxml-2.3.tar.gz (3.2Mb): 3.2Mb downloaded Running setup.py egg_info for package lxml Building lxml version 2.3. Building without Cython. Using build configuration of libxslt 1.1.24 warning: no previously-included files found
matching '*.py'
Installing collected packages: lxml
Running setup.py install for lxml
Building lxml version 2.3.
Building without Cython.
Using build configuration of libxslt 1.1.24
building 'lxml.etree' extension
llvm-gcc-4.2 -fno-strict-aliasing -fno-common
-dynamic -g -Os -pipe -fno-common -fno-strict-aliasing -fwrapv
-mno-fused-madd -DENABLE_DTRACE -DMACOSX -DNDEBUG -Wall -Wstrict-prototypes
-Wshorten-64-to-32 -DNDEBUG -g -fwrapv -Os -Wall -Wstrict-prototypes
-DENABLE_DTRACE -arch i386 -arch x86_64 -pipe -I/usr/include/libxml2
-I/System/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7
-c src/lxml/lxml.etree.c -o
build/temp.macosx-10.7-intel-2.7/src/lxml/lxml.etree.o -w -flat_namespace
llvm-gcc-4.2 -Wl,-F. -bundle -undefined
dynamic_lookup -Wl,-F. -arch i386 -arch x86_64
build/temp.macosx-10.7-intel-2.7/src/lxml/lxml.etree.o -lxslt -lexslt
-lxml2 -lz -lm -o build/lib.macosx-10.7-intel-2.7/lxml/etree.so
building 'lxml.objectify' extension
llvm-gcc-4.2 -fno-strict-aliasing -fno-common
-dynamic -g -Os -pipe -fno-common -fno-strict-aliasing -fwrapv
-mno-fused-madd -DENABLE_DTRACE -DMACOSX -DNDEBUG -Wall -Wstrict-prototypes
-Wshorten-64-to-32 -DNDEBUG -g -fwrapv -Os -Wall -Wstrict-prototypes
-DENABLE_DTRACE -arch i386 -arch x86_64 -pipe -I/usr/include/libxml2
-I/System/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7
-c src/lxml/lxml.objectify.c -o
build/temp.macosx-10.7-intel-2.7/src/lxml/lxml.objectify.o -w
-flat_namespace
llvm-gcc-4.2 -Wl,-F. -bundle -undefined
dynamic_lookup -Wl,-F. -arch i386 -arch x86_64
build/temp.macosx-10.7-intel-2.7/src/lxml/lxml.objectify.o -lxslt -lexslt
-lxml2 -lz -lm -o build/lib.macosx-10.7-intel-2.7/lxml/objectify.soSuccessfully installed lxml Cleaning up...
Trying again:
$ python Python 2.7.1 (r271:86832, Jul 31 2011, 19:30:53) [GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build
2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more
information.
>>> import pdfquery
>>> pdf = pdfquery.PDFQuery("sample.pdf")
>>> pdf.load()
>>> label = pdf.pq(':contains("Your first name and initial")')
Traceback (most recent call last):
File "", line 1, in
File
"/Library/Python/2.7/site-packages/pyquery/pyquery.py", line 247, in
call
result = self.class(_args, parent=self,
*_kwargs)
File
"/Library/Python/2.7/site-packages/pyquery/pyquery.py", line 223, in
init
for tag in elements]
File "lxml.etree.pyx", line 1459, in
lxml.etree._Element.xpath (src/lxml/lxml.etree.c:40530)
File "xpath.pxi", line 324, in
lxml.etree.XPathElementEvaluator.call (src/lxml/lxml.etree.c:113864)
File "xpath.pxi", line 242, in
lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:113063)
File "xpath.pxi", line 228, in
lxml.etree._XPathEvaluatorBase._raise_eval_error
(src/lxml/lxml.etree.c:112935)
lxml.etree.XPathEvalError: Invalid expressionOutput from
pdf.tree.write()
here: https://gist.github.com/3098670
Reply to this email directly or view it on GitHub:
#1 (comment)
from pdfquery.
Yeah, parsing PDFs is a pain. I'm not sure if I'm going to use pdfquery in my project, for now I'm using pdftohtml -xml
from poppler, but I like to help you solving this problem anyway, though my response can be a bit slow.
Your xpath query works fine for me:
>>> pdf.tree.xpath("//*[contains(text(), 'Your first name and initial')]")
[<Element LTTextLineHorizontal at 0x10d901e30>]
I added print xpath
on line 222 in /Library/Python/2.7/site-packages/pyquery/pyquery.py
and tried the example again:
>>> label = pdf.pq(':contains("Your first name and initial")')
descendant-or-self::*[contains(text(), '[<STRING 'Your first name and initial' at 10>]')]
...
I'm not sure what to make of it, looks correct?
from pdfquery.
Oh, yeah -- no, it should be 'Your first name and initial' instead of '[<STRING 'Your first name and initial' at 10>]'.
I narrowed this down to a regression in pyquery 1.2.1, which I reported here:
https://bitbucket.org/olauzanne/pyquery/issue/52/problem-with-contains-selector-121
Until that's fixed, you can either
(1) Comment out the xpath_contains_function function from /Library/Python/2.7/site-packages/pyquery/cssselectpatch.py, or
(2) Downgrade pyquery to 1.1.1 (this might make sense until they sort out any related regressions)
Thanks for the great debugging info -- it was a huge help.
--Jack
from pdfquery.
Related Issues (20)
- Can't get coordinates.
- Pseudo classes not working
- How does pdfquery determine the index?
- can load the pages I need HOT 1
- Can't concat str to bytes HOT 3
- ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters HOT 1
- PdfQuery | .extract problem
- loading file with filecache AttributeError: 'NoneType' object has no attribute 'writestr' HOT 1
- windows only: pdfquery is locking the opended pdf-file HOT 1
- Extract all words with their coordinates.
- cache collision HOT 1
- can't concat str to bytes EASY FIX -- please update! HOT 3
- recommend you use pdfminer rather than pdfquery HOT 1
- Not able to detect horizontal lines properly.
- Coordinates to locator
- Is this project still alive? HOT 3
- Python 2 dependency problem: pyquery
- Support for password protected pdf files
- AttributeError: module 'pdfquery' has no attribute 'PDFQuery'
- TypeError: 'PDFObjRef' object is not subscriptable
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pdfquery.