Comments (17)
Another problem is Line 358 in pdfquery.py
branch.text = node.get_text()
I suggest remove illegal xml characters here.
from pdfquery.
What's the go with this issue? I just ran into it as well.
Changing the line to unicode-escape leads to issues such as:
Traceback (most recent call last):
File "/Users/russ/PycharmProjects/solar_inspection_report/si_rename.py", line 11, in <module>
pdf.load(2) # load only the 2nd page to save CPU time
File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 230, in load
self.tree = self.get_tree(*_flatten(page_numbers))
File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 307, in get_tree
root.set(k, v.decode('unicode-escape'))
File "lxml.etree.pyx", line 746, in lxml.etree._Element.set (src/lxml/lxml.etree.c:42970)
File "apihelpers.pxi", line 547, in lxml.etree._setAttributeValue (src/lxml/lxml.etree.c:19025)
File "apihelpers.pxi", line 1395, in lxml.etree._utf8 (src/lxml/lxml.etree.c:26485)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
from pdfquery.
@russkel I suggest to debug around Line 305, like printing something out. Maybe in your PDF file, key k
also contains illegal characters.
from pdfquery.
Unicode issues are tricky! If you can point me to a PDF that causes the issue I may be able to debug. Even better if you can give me a patch/pull request that fixes the issue with the PDF you point me to ...
Thanks,
Jack
from pdfquery.
Sorry for being so unhelpful but the PDFs I have aren't really suitable for
disclosure. I'll have a quick look and see what's up with the docinfo.
On 27 May 2014 22:42, jcushman [email protected] wrote:
Unicode issues are tricky! If you can point me to a PDF that causes the
issue I may be able to debug. Even better if you can give me a patch/pull
request that fixes the issue with the PDF you point me to ...Thanks,
Jack—
Reply to this email directly or view it on GitHubhttps://github.com//issues/11#issuecomment-44285333
.
from pdfquery.
ipdb> self.doc.info
[{'Producer': 'Microsoft Reporting Services PDF Rendering Extension 10.0.0.0', 'Creator': 'Microsoft Reporting Services 10.0.0.0', 'Author': '', 'Title': '\xfe\xff\x00I\x00n\x00s\x00p\x00e\x00c\x00t\x00i\x00o\x00n\x00 \x00R\x00e\x00p\x00o\x00r\x00t\x00 \x00v\x002\x00.\x002', 'CreationDate': "D:20140527120206+10'00'", 'Subject': ''}]
It appears there are some nasty characters in this PDF's docinfo.
Opening the file in Acrobat and saving it seemed to clean it up. Corrupted PDF?
from pdfquery.
@russkel
I think root.set(k, v.decode('unicode-escape'))
works fine for the nasty Title
, which is þÿInspection Report v2.2
Random guess, it might due to the empty Subject
, as suggest by lxml All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
@jcushman
Thanks for your response. Here is my thoughts.
When texts extracted from PDFs are stored in xml, some illegal characters in the text should be converted into unicode or discarded. I found two places which might need some checking.
I did some modification but it is really brute-force, not friendly and cannot guarantee 100% correct. I will create a branch in my repository. Hope that helps.
from pdfquery.
OK, I just pushed a change to master that seems to fix russkel's problem. Russkel's Title string turns out to be encoded in UTF-16. (Presumably re-saving it 'fixed' the problem by causing it to be encoded in ASCII, since it doesn't actually have any non-ASCII characters.) I added a smart_unicode_decode function that inspects the string with chardet and decodes with the detected encoding, so russkel's Title is correctly decoded.
xuewei4d, can you see if my change works with your files as well? It should yield better results than .decode('unicode-escape') in general. I assume we'll need to use smart_unicode_decode at Line 358 as well, and maybe elsewhere, but I'm hesitant to mess with it too much without having a test file.
(Probably the PDF has the correct encoding stored somewhere in it as well, so we don't have to try to guess the encoding.)
Thanks,
Jack
from pdfquery.
Just tested master and am getting this error processing the PDF:
Traceback (most recent call last):
File "testpdfquery.py", line 3, in <module>
pdf.load()
File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 254, in load
self.tree = self.get_tree(*_flatten(page_numbers))
File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 331, in get_tree
root.set(k, smart_unicode_decode(v))
File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 87, in smart_unicode_decode
decoded_string = unicode(encoded_string, encoding=detected_encoding['encoding'], errors='replace')
TypeError: unicode() argument 2 must be string, not None
from pdfquery.
Hmm, OK. I pushed something to default to 'utf8' if no encoding is detected. Try now?
from pdfquery.
I probably misunderstand what my snippet does. The way that @jcushman handles unicode problem is right in despite of minor problem pointed out by @russkel
from pdfquery.
Different error now:
Traceback (most recent call last):
File "testpdfquery.py", line 3, in <module>
pdf.load()
File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 254, in load
self.tree = self.get_tree(*_flatten(page_numbers))
File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 331, in get_tree
root.set(k, smart_unicode_decode(v))
File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 90, in smart_unicode_decode
if decoded_string[0] in bom_headers:
IndexError: string index out of range
from pdfquery.
Okay, this error occurs because:
ipdb> encoded_string
''
ipdb> chardet.detect(encoded_string)
{'confidence': 0.0, 'encoding': None}
I believe this means we just need to check for strings of len == 0.
I changed line 90 to:
if len(decoded_string) > 0 and decoded_string[0] in bom_headers:
And everything is working fine it appears.
from pdfquery.
I feel like an idiot for not thinking of this before: I made a PDF with Acrobat that has unicode in the title.
https://www.dropbox.com/s/sb0s4dbcigqvq2r/unicode_docinfo.pdf
This should be suitable for your tests.
from pdfquery.
You're absolutely right -- we just needed to check for empty strings. I pushed a change to do that, and also added a test to the test suite with your unicode-title PDF. Seems to work?
from pdfquery.
Looking all good this side.
Thanks for sorting that out and for the library.
from pdfquery.
OK, I just pushed v. 0.2.4 to PyPI with better unicode handling for doc.info.
@xuewei4d, if you have a test case for the problem you had with branch.text = node.get_text()
(either a full PDF to test, or the value of some intermediate variable that can be used to reproduce the error), maybe open a new bug?
Thanks,
Jack
from pdfquery.
Related Issues (20)
- Can't get coordinates.
- Pseudo classes not working
- How does pdfquery determine the index?
- can load the pages I need HOT 1
- Can't concat str to bytes HOT 3
- ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters HOT 1
- PdfQuery | .extract problem
- loading file with filecache AttributeError: 'NoneType' object has no attribute 'writestr' HOT 1
- windows only: pdfquery is locking the opended pdf-file HOT 1
- Extract all words with their coordinates.
- cache collision HOT 1
- can't concat str to bytes EASY FIX -- please update! HOT 3
- recommend you use pdfminer rather than pdfquery HOT 1
- Not able to detect horizontal lines properly.
- Coordinates to locator
- Is this project still alive? HOT 3
- Python 2 dependency problem: pyquery
- Support for password protected pdf files
- AttributeError: module 'pdfquery' has no attribute 'PDFQuery'
- TypeError: 'PDFObjRef' object is not subscriptable
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pdfquery.