Giter Club home page Giter Club logo

Comments (5)

jcushman avatar jcushman commented on July 29, 2024

Hmm, sorry to hear that. I can confirm that the example you gave works for
me (on Windows 7), so this is some kind of dependency problem.

Looking at your pip freeze, the only difference I see is I'm
running lxml==2.3 instead of lxml==2.3.4 . So you could try downgrading to
lxml 2.3 and see if that fixes it -- that would be good to know, and would
get things moving for you.

I'll see if I can reproduce your bug when I'm in front of my Mac. In the
meantime, if downgrading lxml doesn't work you might be able to get what
you need by accessing the lxml etree directly instead of going through
pyquery. Once you call pdf.load(), pdf.tree should be an etree object that
you can search as described here:
http://lxml.de/tutorial.html#elementpath. You can also dump its
contents to a text file with pdf.tree.write().

On Tue, Jul 10, 2012 at 5:56 AM, Patrik Ragnarsson <
[email protected]

wrote:

I can't get the example from the README working.

This is what I have done:

    $ sudo easy_install pip
    $ sudo pip install pdfquery
    $ wget

https://raw.github.com/jcushman/pdfquery/master/examples/sample.pdf
$ python
Python 2.7.1 (r271:86832, Jul 31 2011, 19:30:53)
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build
2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more
information.
>>> import pdfquery
>>> pdf = pdfquery.PDFQuery("sample.pdf")
>>> pdf.load()
>>> label = pdf.pq(':contains("Your first name and initial")')
Traceback (most recent call last):
File "", line 1, in
File "/Library/Python/2.7/site-packages/pyquery/pyquery.py",
line 247, in call
result = self.class(_args, parent=self, *_kwargs)
File "/Library/Python/2.7/site-packages/pyquery/pyquery.py",
line 223, in init
for tag in elements]
File "lxml.etree.pyx", line 1444, in lxml.etree._Element.xpath
(src/lxml/lxml.etree.c:41726)
File "xpath.pxi", line 321, in
lxml.etree.XPathElementEvaluator.call (src/lxml/lxml.etree.c:117867)
File "xpath.pxi", line 239, in
lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:117044)
File "xpath.pxi", line 225, in
lxml.etree._XPathEvaluatorBase._raise_eval_error
(src/lxml/lxml.etree.c:116913)
lxml.etree.XPathEvalError: Invalid expression

I'm using Mac OS X 10.7.4. Output from pip freeze at
https://gist.github.com/3082390 if that can help in any way. (I'm not a
Python guy.)


Reply to this email directly or view it on GitHub:
#1

from pdfquery.

dentarg avatar dentarg commented on July 29, 2024

Uninstalling lxml just to be sure.

$ sudo pip uninstall lxml
Uninstalling lxml:
    /Library/Python/2.7/site-packages/lxml
    /Library/Python/2.7/site-packages/lxml-2.3.4-py2.7.egg-info
Proceed (y/n)? y
    Successfully uninstalled lxml
$ sudo pip install lxml==2.3
Downloading/unpacking lxml==2.3
    Downloading lxml-2.3.tar.gz (3.2Mb): 3.2Mb downloaded
    Running setup.py egg_info for package lxml
        Building lxml version 2.3.
        Building without Cython.
        Using build configuration of libxslt 1.1.24

        warning: no previously-included files found matching '*.py'
Installing collected packages: lxml
    Running setup.py install for lxml
        Building lxml version 2.3.
        Building without Cython.
        Using build configuration of libxslt 1.1.24
        building 'lxml.etree' extension
        llvm-gcc-4.2 -fno-strict-aliasing -fno-common -dynamic -g -Os -pipe -fno-common -fno-strict-aliasing -fwrapv -mno-fused-madd -DENABLE_DTRACE -DMACOSX -DNDEBUG -Wall -Wstrict-prototypes -Wshorten-64-to-32 -DNDEBUG -g -fwrapv -Os -Wall -Wstrict-prototypes -DENABLE_DTRACE -arch i386 -arch x86_64 -pipe -I/usr/include/libxml2 -I/System/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c src/lxml/lxml.etree.c -o build/temp.macosx-10.7-intel-2.7/src/lxml/lxml.etree.o -w -flat_namespace
        llvm-gcc-4.2 -Wl,-F. -bundle -undefined dynamic_lookup -Wl,-F. -arch i386 -arch x86_64 build/temp.macosx-10.7-intel-2.7/src/lxml/lxml.etree.o -lxslt -lexslt -lxml2 -lz -lm -o build/lib.macosx-10.7-intel-2.7/lxml/etree.so
        building 'lxml.objectify' extension
        llvm-gcc-4.2 -fno-strict-aliasing -fno-common -dynamic -g -Os -pipe -fno-common -fno-strict-aliasing -fwrapv -mno-fused-madd -DENABLE_DTRACE -DMACOSX -DNDEBUG -Wall -Wstrict-prototypes -Wshorten-64-to-32 -DNDEBUG -g -fwrapv -Os -Wall -Wstrict-prototypes -DENABLE_DTRACE -arch i386 -arch x86_64 -pipe -I/usr/include/libxml2 -I/System/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c src/lxml/lxml.objectify.c -o build/temp.macosx-10.7-intel-2.7/src/lxml/lxml.objectify.o -w -flat_namespace
        llvm-gcc-4.2 -Wl,-F. -bundle -undefined dynamic_lookup -Wl,-F. -arch i386 -arch x86_64 build/temp.macosx-10.7-intel-2.7/src/lxml/lxml.objectify.o -lxslt -lexslt -lxml2 -lz -lm -o build/lib.macosx-10.7-intel-2.7/lxml/objectify.so

Successfully installed lxml
Cleaning up...

Trying again:

$ python
Python 2.7.1 (r271:86832, Jul 31 2011, 19:30:53) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pdfquery
>>> pdf = pdfquery.PDFQuery("sample.pdf")
>>> pdf.load()
>>> label = pdf.pq(':contains("Your first name and initial")')
Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/Library/Python/2.7/site-packages/pyquery/pyquery.py", line 247, in __call__
        result = self.__class__(*args, parent=self, **kwargs)
    File "/Library/Python/2.7/site-packages/pyquery/pyquery.py", line 223, in __init__
        for tag in elements]
    File "lxml.etree.pyx", line 1459, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:40530)
    File "xpath.pxi", line 324, in lxml.etree.XPathElementEvaluator.__call__ (src/lxml/lxml.etree.c:113864)
    File "xpath.pxi", line 242, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:113063)
    File "xpath.pxi", line 228, in lxml.etree._XPathEvaluatorBase._raise_eval_error (src/lxml/lxml.etree.c:112935)
lxml.etree.XPathEvalError: Invalid expression

Output from pdf.tree.write() here: https://gist.github.com/3098670

from pdfquery.

jcushman avatar jcushman commented on July 29, 2024

I still haven't had time to try this on a Mac, but if you feel like digging
deeper, what's happening here is pyquery is translating the jquery selector
into an xpath, and lxml is choking on it, so it's gotta be one of those two
libraries. Here's what it looks like when I run the same search directly
with lxml:

pdf.tree.xpath("//*[contains(text(), 'Your first name and initial')]")
[<Element LTTextLineHorizontal at 0x2c63b40>]

If that doesn't work for you, it's definitely a problem with your lxml
library (since it looks like your pdf.tree is being generated properly). If
it does, the next step would be to throw in some debugging statements near
'File "/Library/Python/2.7/site-packages/pyquery/pyquery.py", line 223' and
try to figure out what xpath is being generated. (It won't necessarily be
the same as mine -- not sure how pyquery translates that stuff exactly --
but it'll either be valid xpath or not.)

Sorry for the hassle -- the whole point of this library is to put a
friendly shine on a messy problem, so I'd love to get it sorted out.

On Thu, Jul 12, 2012 at 11:05 AM, Patrik Ragnarsson <
[email protected]

wrote:

Uninstalling lxml just to be sure.

    $ sudo pip uninstall lxml
    Uninstalling lxml:
            /Library/Python/2.7/site-packages/lxml
            /Library/Python/2.7/site-packages/lxml-2.3.4-py2.7.egg-info
    Proceed (y/n)? y
            Successfully uninstalled lxml
    $ sudo pip install lxml==2.3
    Downloading/unpacking lxml==2.3
            Downloading lxml-2.3.tar.gz (3.2Mb): 3.2Mb downloaded
            Running setup.py egg_info for package lxml
                    Building lxml version 2.3.
                    Building without Cython.
                    Using build configuration of libxslt 1.1.24

                    warning: no previously-included files found

matching '*.py'
Installing collected packages: lxml
Running setup.py install for lxml
Building lxml version 2.3.
Building without Cython.
Using build configuration of libxslt 1.1.24
building 'lxml.etree' extension
llvm-gcc-4.2 -fno-strict-aliasing -fno-common
-dynamic -g -Os -pipe -fno-common -fno-strict-aliasing -fwrapv
-mno-fused-madd -DENABLE_DTRACE -DMACOSX -DNDEBUG -Wall -Wstrict-prototypes
-Wshorten-64-to-32 -DNDEBUG -g -fwrapv -Os -Wall -Wstrict-prototypes
-DENABLE_DTRACE -arch i386 -arch x86_64 -pipe -I/usr/include/libxml2
-I/System/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7
-c src/lxml/lxml.etree.c -o
build/temp.macosx-10.7-intel-2.7/src/lxml/lxml.etree.o -w -flat_namespace
llvm-gcc-4.2 -Wl,-F. -bundle -undefined
dynamic_lookup -Wl,-F. -arch i386 -arch x86_64
build/temp.macosx-10.7-intel-2.7/src/lxml/lxml.etree.o -lxslt -lexslt
-lxml2 -lz -lm -o build/lib.macosx-10.7-intel-2.7/lxml/etree.so
building 'lxml.objectify' extension
llvm-gcc-4.2 -fno-strict-aliasing -fno-common
-dynamic -g -Os -pipe -fno-common -fno-strict-aliasing -fwrapv
-mno-fused-madd -DENABLE_DTRACE -DMACOSX -DNDEBUG -Wall -Wstrict-prototypes
-Wshorten-64-to-32 -DNDEBUG -g -fwrapv -Os -Wall -Wstrict-prototypes
-DENABLE_DTRACE -arch i386 -arch x86_64 -pipe -I/usr/include/libxml2
-I/System/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7
-c src/lxml/lxml.objectify.c -o
build/temp.macosx-10.7-intel-2.7/src/lxml/lxml.objectify.o -w
-flat_namespace
llvm-gcc-4.2 -Wl,-F. -bundle -undefined
dynamic_lookup -Wl,-F. -arch i386 -arch x86_64
build/temp.macosx-10.7-intel-2.7/src/lxml/lxml.objectify.o -lxslt -lexslt
-lxml2 -lz -lm -o build/lib.macosx-10.7-intel-2.7/lxml/objectify.so

    Successfully installed lxml
    Cleaning up...

Trying again:

    $ python
    Python 2.7.1 (r271:86832, Jul 31 2011, 19:30:53)
    [GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build

2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more
information.
>>> import pdfquery
>>> pdf = pdfquery.PDFQuery("sample.pdf")
>>> pdf.load()
>>> label = pdf.pq(':contains("Your first name and initial")')
Traceback (most recent call last):
File "", line 1, in
File
"/Library/Python/2.7/site-packages/pyquery/pyquery.py", line 247, in
call
result = self.class(_args, parent=self,
*_kwargs)
File
"/Library/Python/2.7/site-packages/pyquery/pyquery.py", line 223, in
init
for tag in elements]
File "lxml.etree.pyx", line 1459, in
lxml.etree._Element.xpath (src/lxml/lxml.etree.c:40530)
File "xpath.pxi", line 324, in
lxml.etree.XPathElementEvaluator.call (src/lxml/lxml.etree.c:113864)
File "xpath.pxi", line 242, in
lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:113063)
File "xpath.pxi", line 228, in
lxml.etree._XPathEvaluatorBase._raise_eval_error
(src/lxml/lxml.etree.c:112935)
lxml.etree.XPathEvalError: Invalid expression

Output from pdf.tree.write() here: https://gist.github.com/3098670


Reply to this email directly or view it on GitHub:
#1 (comment)

from pdfquery.

dentarg avatar dentarg commented on July 29, 2024

Yeah, parsing PDFs is a pain. I'm not sure if I'm going to use pdfquery in my project, for now I'm using pdftohtml -xml from poppler, but I like to help you solving this problem anyway, though my response can be a bit slow.

Your xpath query works fine for me:

>>> pdf.tree.xpath("//*[contains(text(), 'Your first name and initial')]")
[<Element LTTextLineHorizontal at 0x10d901e30>]

I added print xpath on line 222 in /Library/Python/2.7/site-packages/pyquery/pyquery.py and tried the example again:

>>> label = pdf.pq(':contains("Your first name and initial")')
descendant-or-self::*[contains(text(), '[<STRING 'Your first name and initial' at 10>]')]
...

I'm not sure what to make of it, looks correct?

from pdfquery.

jcushman avatar jcushman commented on July 29, 2024

Oh, yeah -- no, it should be 'Your first name and initial' instead of '[<STRING 'Your first name and initial' at 10>]'.

I narrowed this down to a regression in pyquery 1.2.1, which I reported here:

https://bitbucket.org/olauzanne/pyquery/issue/52/problem-with-contains-selector-121

Until that's fixed, you can either

(1) Comment out the xpath_contains_function function from /Library/Python/2.7/site-packages/pyquery/cssselectpatch.py, or
(2) Downgrade pyquery to 1.1.1 (this might make sense until they sort out any related regressions)

Thanks for the great debugging info -- it was a huge help.

--Jack

from pdfquery.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.