Giter Club home page Giter Club logo

Comments (17)

xuewei4d avatar xuewei4d commented on July 29, 2024

Another problem is Line 358 in pdfquery.py

branch.text = node.get_text()

I suggest remove illegal xml characters here.

from pdfquery.

russkel avatar russkel commented on July 29, 2024

What's the go with this issue? I just ran into it as well.

Changing the line to unicode-escape leads to issues such as:

Traceback (most recent call last):
  File "/Users/russ/PycharmProjects/solar_inspection_report/si_rename.py", line 11, in <module>
    pdf.load(2)  # load only the 2nd page to save CPU time
  File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 230, in load
    self.tree = self.get_tree(*_flatten(page_numbers))
  File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 307, in get_tree
    root.set(k, v.decode('unicode-escape'))
  File "lxml.etree.pyx", line 746, in lxml.etree._Element.set (src/lxml/lxml.etree.c:42970)
  File "apihelpers.pxi", line 547, in lxml.etree._setAttributeValue (src/lxml/lxml.etree.c:19025)
  File "apihelpers.pxi", line 1395, in lxml.etree._utf8 (src/lxml/lxml.etree.c:26485)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

from pdfquery.

xuewei4d avatar xuewei4d commented on July 29, 2024

@russkel I suggest to debug around Line 305, like printing something out. Maybe in your PDF file, key k also contains illegal characters.

from pdfquery.

jcushman avatar jcushman commented on July 29, 2024

Unicode issues are tricky! If you can point me to a PDF that causes the issue I may be able to debug. Even better if you can give me a patch/pull request that fixes the issue with the PDF you point me to ...

Thanks,
Jack

from pdfquery.

russkel avatar russkel commented on July 29, 2024

Sorry for being so unhelpful but the PDFs I have aren't really suitable for
disclosure. I'll have a quick look and see what's up with the docinfo.

On 27 May 2014 22:42, jcushman [email protected] wrote:

Unicode issues are tricky! If you can point me to a PDF that causes the
issue I may be able to debug. Even better if you can give me a patch/pull
request that fixes the issue with the PDF you point me to ...

Thanks,
Jack


Reply to this email directly or view it on GitHubhttps://github.com//issues/11#issuecomment-44285333
.

from pdfquery.

russkel avatar russkel commented on July 29, 2024
ipdb> self.doc.info
[{'Producer': 'Microsoft Reporting Services PDF Rendering Extension 10.0.0.0', 'Creator': 'Microsoft Reporting Services 10.0.0.0', 'Author': '', 'Title': '\xfe\xff\x00I\x00n\x00s\x00p\x00e\x00c\x00t\x00i\x00o\x00n\x00 \x00R\x00e\x00p\x00o\x00r\x00t\x00 \x00v\x002\x00.\x002', 'CreationDate': "D:20140527120206+10'00'", 'Subject': ''}]

It appears there are some nasty characters in this PDF's docinfo.

Opening the file in Acrobat and saving it seemed to clean it up. Corrupted PDF?

from pdfquery.

xuewei4d avatar xuewei4d commented on July 29, 2024

@russkel
I think root.set(k, v.decode('unicode-escape')) works fine for the nasty Title, which is þÿInspection Report v2.2
Random guess, it might due to the empty Subject , as suggest by lxml All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

@jcushman
Thanks for your response. Here is my thoughts.
When texts extracted from PDFs are stored in xml, some illegal characters in the text should be converted into unicode or discarded. I found two places which might need some checking.
I did some modification but it is really brute-force, not friendly and cannot guarantee 100% correct. I will create a branch in my repository. Hope that helps.

from pdfquery.

jcushman avatar jcushman commented on July 29, 2024

OK, I just pushed a change to master that seems to fix russkel's problem. Russkel's Title string turns out to be encoded in UTF-16. (Presumably re-saving it 'fixed' the problem by causing it to be encoded in ASCII, since it doesn't actually have any non-ASCII characters.) I added a smart_unicode_decode function that inspects the string with chardet and decodes with the detected encoding, so russkel's Title is correctly decoded.

xuewei4d, can you see if my change works with your files as well? It should yield better results than .decode('unicode-escape') in general. I assume we'll need to use smart_unicode_decode at Line 358 as well, and maybe elsewhere, but I'm hesitant to mess with it too much without having a test file.

(Probably the PDF has the correct encoding stored somewhere in it as well, so we don't have to try to guess the encoding.)

Thanks,
Jack

from pdfquery.

russkel avatar russkel commented on July 29, 2024

Just tested master and am getting this error processing the PDF:

Traceback (most recent call last):
  File "testpdfquery.py", line 3, in <module>
    pdf.load()
  File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 254, in load
    self.tree = self.get_tree(*_flatten(page_numbers))
  File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 331, in get_tree
    root.set(k, smart_unicode_decode(v))
  File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 87, in smart_unicode_decode
    decoded_string = unicode(encoded_string, encoding=detected_encoding['encoding'], errors='replace')
TypeError: unicode() argument 2 must be string, not None

from pdfquery.

jcushman avatar jcushman commented on July 29, 2024

Hmm, OK. I pushed something to default to 'utf8' if no encoding is detected. Try now?

from pdfquery.

xuewei4d avatar xuewei4d commented on July 29, 2024

I probably misunderstand what my snippet does. The way that @jcushman handles unicode problem is right in despite of minor problem pointed out by @russkel

from pdfquery.

russkel avatar russkel commented on July 29, 2024

Different error now:

Traceback (most recent call last):
  File "testpdfquery.py", line 3, in <module>
    pdf.load()
  File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 254, in load
    self.tree = self.get_tree(*_flatten(page_numbers))
  File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 331, in get_tree
    root.set(k, smart_unicode_decode(v))
  File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 90, in smart_unicode_decode
    if decoded_string[0] in bom_headers:
IndexError: string index out of range

from pdfquery.

russkel avatar russkel commented on July 29, 2024

Okay, this error occurs because:

ipdb> encoded_string
''
ipdb> chardet.detect(encoded_string)
{'confidence': 0.0, 'encoding': None}

I believe this means we just need to check for strings of len == 0.

I changed line 90 to:

if len(decoded_string) > 0 and decoded_string[0] in bom_headers:

And everything is working fine it appears.

from pdfquery.

russkel avatar russkel commented on July 29, 2024

I feel like an idiot for not thinking of this before: I made a PDF with Acrobat that has unicode in the title.
https://www.dropbox.com/s/sb0s4dbcigqvq2r/unicode_docinfo.pdf
This should be suitable for your tests.

from pdfquery.

jcushman avatar jcushman commented on July 29, 2024

You're absolutely right -- we just needed to check for empty strings. I pushed a change to do that, and also added a test to the test suite with your unicode-title PDF. Seems to work?

from pdfquery.

russkel avatar russkel commented on July 29, 2024

Looking all good this side.

Thanks for sorting that out and for the library.

from pdfquery.

jcushman avatar jcushman commented on July 29, 2024

OK, I just pushed v. 0.2.4 to PyPI with better unicode handling for doc.info.

@xuewei4d, if you have a test case for the problem you had with branch.text = node.get_text() (either a full PDF to test, or the value of some intermediate variable that can be used to reproduce the error), maybe open a new bug?

Thanks,
Jack

from pdfquery.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.