When I use pdfquery processing a scholar pdf, I found a unicode problem in Line 305, p

Another problem is Line 358 in pdfquery.py <div class="highlight highlight-source-

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

unicode problem when processing doc.info about pdfquery HOT 17 CLOSED

xuewei4d commented on July 29, 2024

unicode problem when processing doc.info

from pdfquery.

Comments (17)

xuewei4d commented on July 29, 2024

Another problem is Line 358 in pdfquery.py

branch.text = node.get_text()

I suggest remove illegal xml characters here.

from pdfquery.

russkel commented on July 29, 2024

What's the go with this issue? I just ran into it as well.

Changing the line to unicode-escape leads to issues such as:

Traceback (most recent call last):
  File "/Users/russ/PycharmProjects/solar_inspection_report/si_rename.py", line 11, in <module>
    pdf.load(2)  # load only the 2nd page to save CPU time
  File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 230, in load
    self.tree = self.get_tree(*_flatten(page_numbers))
  File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 307, in get_tree
    root.set(k, v.decode('unicode-escape'))
  File "lxml.etree.pyx", line 746, in lxml.etree._Element.set (src/lxml/lxml.etree.c:42970)
  File "apihelpers.pxi", line 547, in lxml.etree._setAttributeValue (src/lxml/lxml.etree.c:19025)
  File "apihelpers.pxi", line 1395, in lxml.etree._utf8 (src/lxml/lxml.etree.c:26485)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

from pdfquery.

xuewei4d commented on July 29, 2024

@russkel I suggest to debug around Line 305, like printing something out. Maybe in your PDF file, key k also contains illegal characters.

from pdfquery.

jcushman commented on July 29, 2024

Unicode issues are tricky! If you can point me to a PDF that causes the issue I may be able to debug. Even better if you can give me a patch/pull request that fixes the issue with the PDF you point me to ...

Thanks,
Jack

from pdfquery.

russkel commented on July 29, 2024

Sorry for being so unhelpful but the PDFs I have aren't really suitable for
disclosure. I'll have a quick look and see what's up with the docinfo.

On 27 May 2014 22:42, jcushman [email protected] wrote:

Unicode issues are tricky! If you can point me to a PDF that causes the
issue I may be able to debug. Even better if you can give me a patch/pull
request that fixes the issue with the PDF you point me to ...

Thanks,
Jack

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/11#issuecomment-44285333
.

from pdfquery.

russkel commented on July 29, 2024

ipdb> self.doc.info
[{'Producer': 'Microsoft Reporting Services PDF Rendering Extension 10.0.0.0', 'Creator': 'Microsoft Reporting Services 10.0.0.0', 'Author': '', 'Title': '\xfe\xff\x00I\x00n\x00s\x00p\x00e\x00c\x00t\x00i\x00o\x00n\x00 \x00R\x00e\x00p\x00o\x00r\x00t\x00 \x00v\x002\x00.\x002', 'CreationDate': "D:20140527120206+10'00'", 'Subject': ''}]

It appears there are some nasty characters in this PDF's docinfo.

Opening the file in Acrobat and saving it seemed to clean it up. Corrupted PDF?

from pdfquery.

xuewei4d commented on July 29, 2024

@russkel
I think root.set(k, v.decode('unicode-escape')) works fine for the nasty Title, which is þÿInspection Report v2.2
Random guess, it might due to the empty Subject , as suggest by lxml All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

@jcushman
Thanks for your response. Here is my thoughts.
When texts extracted from PDFs are stored in xml, some illegal characters in the text should be converted into unicode or discarded. I found two places which might need some checking.
I did some modification but it is really brute-force, not friendly and cannot guarantee 100% correct. I will create a branch in my repository. Hope that helps.

from pdfquery.

jcushman commented on July 29, 2024

OK, I just pushed a change to master that seems to fix russkel's problem. Russkel's Title string turns out to be encoded in UTF-16. (Presumably re-saving it 'fixed' the problem by causing it to be encoded in ASCII, since it doesn't actually have any non-ASCII characters.) I added a smart_unicode_decode function that inspects the string with chardet and decodes with the detected encoding, so russkel's Title is correctly decoded.

xuewei4d, can you see if my change works with your files as well? It should yield better results than .decode('unicode-escape') in general. I assume we'll need to use smart_unicode_decode at Line 358 as well, and maybe elsewhere, but I'm hesitant to mess with it too much without having a test file.

(Probably the PDF has the correct encoding stored somewhere in it as well, so we don't have to try to guess the encoding.)

Thanks,
Jack

from pdfquery.

russkel commented on July 29, 2024

Just tested master and am getting this error processing the PDF:

Traceback (most recent call last):
  File "testpdfquery.py", line 3, in <module>
    pdf.load()
  File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 254, in load
    self.tree = self.get_tree(*_flatten(page_numbers))
  File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 331, in get_tree
    root.set(k, smart_unicode_decode(v))
  File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 87, in smart_unicode_decode
    decoded_string = unicode(encoded_string, encoding=detected_encoding['encoding'], errors='replace')
TypeError: unicode() argument 2 must be string, not None

from pdfquery.

jcushman commented on July 29, 2024

Hmm, OK. I pushed something to default to 'utf8' if no encoding is detected. Try now?

from pdfquery.

xuewei4d commented on July 29, 2024

I probably misunderstand what my snippet does. The way that @jcushman handles unicode problem is right in despite of minor problem pointed out by @russkel

from pdfquery.

russkel commented on July 29, 2024

Different error now:

Traceback (most recent call last):
  File "testpdfquery.py", line 3, in <module>
    pdf.load()
  File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 254, in load
    self.tree = self.get_tree(*_flatten(page_numbers))
  File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 331, in get_tree
    root.set(k, smart_unicode_decode(v))
  File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 90, in smart_unicode_decode
    if decoded_string[0] in bom_headers:
IndexError: string index out of range

from pdfquery.

russkel commented on July 29, 2024

Okay, this error occurs because:

ipdb> encoded_string
''
ipdb> chardet.detect(encoded_string)
{'confidence': 0.0, 'encoding': None}

I believe this means we just need to check for strings of len == 0.

I changed line 90 to:

if len(decoded_string) > 0 and decoded_string[0] in bom_headers:

And everything is working fine it appears.

from pdfquery.

russkel commented on July 29, 2024

I feel like an idiot for not thinking of this before: I made a PDF with Acrobat that has unicode in the title.
https://www.dropbox.com/s/sb0s4dbcigqvq2r/unicode_docinfo.pdf
This should be suitable for your tests.

from pdfquery.

jcushman commented on July 29, 2024

You're absolutely right -- we just needed to check for empty strings. I pushed a change to do that, and also added a test to the test suite with your unicode-title PDF. Seems to work?

from pdfquery.

russkel commented on July 29, 2024

Looking all good this side.

Thanks for sorting that out and for the library.

from pdfquery.

jcushman commented on July 29, 2024

OK, I just pushed v. 0.2.4 to PyPI with better unicode handling for doc.info.

@xuewei4d, if you have a test case for the problem you had with branch.text = node.get_text() (either a full PDF to test, or the value of some intermediate variable that can be used to reproduce the error), maybe open a new bug?

Thanks,
Jack

from pdfquery.

unicode problem when processing doc.info about pdfquery HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent