pdfminer / pdfminer.six Goto Github PK
View Code? Open in Web Editor NEWCommunity maintained fork of pdfminer - we fathom PDF
Home Page: https://pdfminersix.readthedocs.io
License: MIT License
Community maintained fork of pdfminer - we fathom PDF
Home Page: https://pdfminersix.readthedocs.io
License: MIT License
I've tried to install pdfminer with pip on windows, but I was getting an error over wheels bundle on pyCrypto. The problem was solved when changing a line in the setup file, requires = ['six', 'pycrypto']
to requires = ['six', 'pycryptodome']
.
I think it would be nice to detect if it's on windows to set requires
correctly.
https://pypi.python.org/pypi/pdfminer2 has v20151206, while latest seems to be v 20160614.
Thanks!
updated python from 2.7 to 3.4 on my laptop (Windows)
test run command-line command pdf2txt.py simple1.pdf
return ImportError: no module named 'six'
I guess there's something wrong with the third line: import six
anyone knows how to fix it?
Would it be possible to do this?
I'm maintaining pdfminer in Fedora (and for F24+ I've switched the package over to pdfminer.six; nice work, by the way!), and the use of /usr/bin/env python
at the top of the library files is... confusing rpm's automatic dependency detection. As a result, python3-pdfminer thinks it needs a python2 install on the system...
To fix this, I wrote a patch yesterday to remove these lines and rebuilt the package, and things appear to still work just fine. I'd be happy to open a pull request for it assuming that this is a desirable change.
Looking at the code, it seems that:
When I install pdfminer from PyPi the source is different than downloaded from github for the same tag.
One example is logging. More especially this change 1d54ecd which should be present from version 20160614 (one year ago). After that version there are two new versions.
When I download the package from PyPi forfor version 20170419 (https://pypi.python.org/packages/43/71/b592b9b384c9bc4429e9a35cc9d61a5eb7fabef2208140c30550a474defe/pdfminer.six-20170419.tar.gz#md5=c43b443ad759441adb07fde5f1ca3435) this change is not there. But when I download the archive from Github for that tag (https://github.com/pdfminer/pdfminer.six/archive/20170419.zip) everything is there.
Now I'm forced to workaround the installation in requirements.txt by adding:
https://github.com/pdfminer/pdfminer.six/archive/20170419.zip#egg=pdfminer.six==20170419
Which is not ideal.
I'm wondering what may be the issue with the package in PyPi?
@goulu : Thanks for this awesome package. It works like a charm. It actually resolves this issue which I was facing while using pdfminer3k
.
I have ran into an issue with this pdf file. I am trying to get an xml output from it by running pdf2txt.py -A -o output.xml -t xml 2b.pdf
. But the output xml just contains the following and misses all the text information:
Interestingly, when I convert this file to xml using pdfminer3k
it gives a "list index out of range" error at this line. And if I change the code at that line to the following then it works.
if x:
try:
objid1 = x[-2]
genno = x[-1]
except:
return None
Can you please help?
Hi there,
I'm currently trying to use pdfminer within a jupyter notebook to convert pdf files to text but fail miserably :/ I know that you provide the command line tool pdf2text.py, but isn't this also possible in another way? Let's say I use the example code you provided up to the following point:
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
# Open a PDF document.
fp = open('sample.pdf', 'rb')
parser = PDFParser(fp)
document = PDFDocument(parser)
How could I create a text file out of this then? Is it somehow possible that you create a function for pdf2text.py
functionality?
Thanks anyway for the package :)
When running this command below on pdfminer.six version 20160202 in Python 2.7.10, NameError: global name 'ImageWriter' is not defined
error message occurred.
$ pdf2txt.py -O myoutput -o myoutput/myfile.html -t html -p 1,3 myfile.pdf
Problem arises when you try to run pdf2txt. Error trace states can not import max int. I'm running Python 3.5.1. After researching the error, I've come to the conclusion that Python 3.x.x have removed the system constant maxint; hence, the inability to import said maxint.
Um...
Please send aid.
Errors
pip install pdfminer.six
... 100% ... bit fineshed with permission error,Successfully built pdfminer.six
Installing collected packages: pdfminer.six
Exception:
Traceback (most recent call last):
File "/home/user/.local/lib/python2.7/site-packages/pip/basecommand.py", line 215, in main
status = self.run(options, args)
.....
OSError: [Errno 13] Permission ... '/usr/local/lib/python2.7/dist-packages/pdfminer'
sudo pip install pdfminer.six
resulted inThe directory '/home/user/.cache/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/home/user/.cache/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Collecting pdfminer.six
Requirement already satisfied: six in /usr/lib/python2.7/dist-packages (from pdfminer.six)
Requirement already satisfied: pycrypto in /usr/lib/python2.7/dist-packages (from pdfminer.six)
Installing collected packages: pdfminer.six
Successfully installed pdfminer.six-20170419
but, no way to call it
pdf2txt.py myFile.pdf
produced error "/usr/bin/env: “python\r”: not found"
Hi,
I am using Python 3.6 and I cannot set up Pdfminer. six.
While doing pdf2txt.py samples/simple1.pdf, an error appears :
ModuleNotFoundError: No module named 'pdfminer.settings'
Has anyone run into the same problem?
Thank you very much in advance for your help!
Today, the parser ignore the painting information extracted (stroke, colors, fill, etc.), saving only the linewidth. I created a patch do add more information, helping with some cases.
Is it possible to just retrieve all the text on the page with each fragment returned with its bounding box, i.e., (x1, y1, x2, y2, text) -- with no layout analysis? Use case: this would be ideal for people who want to do their own layout analysis with minimal overheads.
I am trying to convert PDF to tag file. It worked perfected fine in python 2. Tried the same thing in python 3, getting this error, any workaround?
/home/ubuntu/anaconda3/envs/py35/lib/python3.5/site-packages/pdfminer.six-20170119-py3.5.egg/pdfminer/utils.py in make_compat_str(in_str)
24 def make_compat_str(in_str):
25 "In Py2, does nothing. In Py3, converts to string, guessing encoding."
---> 26 assert isinstance(in_str, (bytes, str, unicode))
27 if six.PY3 and isinstance(in_str, bytes):
28 enc = chardet.detect(in_str)
AssertionError:
I'm occasionally getting an error:
File "c:\Anaconda3\lib\site-packages\pdfminer.six-20170720-py3.6.egg\pdfminer\psparser.py", line 233, in fillbuf
if self.charpos < len(self.buf):
TypeError: '<' not supported between instances of 'tuple' and 'int'
As far as I can tell the only place this might be caused by is line 350 in psparser.py
return (self._parse_comment, len(s))
where a tuple is in fact returned from function _parse_comment(self, s, i)
I hope this is enough info.
Patrick
When I run this command with this file
dumppdf -a invalid.pdf
receive error message:
$ dumppdf -a invalid.pdf
<pdf>Traceback (most recent call last):
File "/usr/bin/dumppdf", line 268, in <module>
if __name__ == '__main__': sys.exit(main(sys.argv))
File "/usr/bin/dumppdf", line 265, in main
dumpall=dumpall, codec=codec, extractdir=extractdir)
File "/usr/bin/dumppdf", line 216, in dumppdf
dumpallobjs(outfp, doc, codec=codec)
File "/usr/bin/dumppdf", line 102, in dumpallobjs
obj = doc.getobj(objid)
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfdocument.py", line 658, in getobj
assert objid != 0
AssertionError
I can't
pdf2txt.py -t xml something.pdf
<?xml version="1.0" encoding="utf-8" ?>
<pages>
<page id="1" bbox="0.000,0.000,595.320,841.920" rotate="0">
<textbox id="0" bbox="262.250,792.168,339.072,802.128">
<textline bbox="262.250,792.168,339.072,802.128">
Traceback (most recent call last):
File "/path/.venv/bin/pdf2txt.py", line 126, in <module>
if __name__ == '__main__': sys.exit(main())
File "/path/.venv/bin/pdf2txt.py", line 121, in main
outfp = extract_text(**vars(A))
File "/path/.venv/bin/pdf2txt.py", line 61, in extract_text
pdfminer.high_level.extract_text_to_fp(fp, **locals())
File "/path/.venv/lib/python3.5/site-packages/pdfminer/high_level.py", line 83, in extract_text_to_fp
interpreter.process_page(page)
File "/path/.venv/lib/python3.5/site-packages/pdfminer/pdfinterp.py", line 837, in process_page
self.device.end_page(page)
File "/path/.venv/lib/python3.5/site-packages/pdfminer/converter.py", line 56, in end_page
self.receive_layout(self.cur_item)
File "/path/.venv/lib/python3.5/site-packages/pdfminer/converter.py", line 537, in receive_layout
render(ltpage)
File "/path/.venv/lib/python3.5/site-packages/pdfminer/converter.py", line 483, in render
render(child)
File "/path/.venv/lib/python3.5/site-packages/pdfminer/converter.py", line 517, in render
render(child)
File "/path/.venv/lib/python3.5/site-packages/pdfminer/converter.py", line 508, in render
render(child)
File "/path/.venv/lib/python3.5/site-packages/pdfminer/converter.py", line 521, in render
(enc(item.fontname, None), bbox2str(item.bbox), item.size))
File "/path/.venv/lib/python3.5/site-packages/pdfminer/utils.py", line 277, in enc
x = x.replace('&', '&').replace('>', '>').replace('<', '<').replace('"', '"')
TypeError: a bytes-like object is required, not 'str'
I'm trying to extract all the text from a PDF by this version of PDFminer, but it chunks by letters although I change the -M, -L or -W options.
I need to extract it in XML format, ¿is there any option to extract word by word or line by line?
Thanks
In the feedbytes
methods, character conversion to byte happens via ord
. On Py3 this is not needed, since we're dealing with bytestrings directly.
This was also mentioned in #24 and subsequently solved, but a couple of cases were still missing.
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTTextLine
from pdfminer.pdfdevice import PDFDevice
rsrcmgr = PDFResourceManager()
laparams = LAParams()
laparams.char_margin = 0.5
laparams.word_margin = 0.5
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.get_pages(fp):
interpreter.process_page(page)
layout = device.get_result()
For some pdf files, it appears that the device.get_result() returns an object whose _objs has 0 length. The pdf files contains a table with cells having text or numbers. (I used to have pdfminer3k, under that package, these pdf files will get exception for zlib error.) (For some reason I can't attach the offending pdf here.)
File "mycode.py", line 123, in foo
for (level, title, destname, actionref, _) in doc.get_outlines():
File "pdfminer/pdfdocument.py", line 703, in search
for x in search(entry['First'], level+1):
File "pdfminer/pdfdocument.py", line 697, in search
title = decode_text(str_value(entry['Title']))
File "pdfminer/utils.py", line 271, in decode_text
return ''.join(PDFDocEncoding[ord(c)] for c in s)
File "pdfminer/utils.py", line 271, in <genexpr>
return ''.join(PDFDocEncoding[ord(c)] for c in s)
TypeError: ord() expected string of length 1, but int found
I believe the fix for this in Python3 is pretty simple; we shouldn't use ord():
--- a/pdfminer/utils.py
+++ b/pdfminer/utils.py
@@ -268,7 +268,7 @@ def decode_text(s):
if s.startswith(b'\xfe\xff'):
return six.text_type(s[2:], 'utf-16be', 'ignore')
else:
- return ''.join(PDFDocEncoding[ord(c)] for c in s)
+ return ''.join(PDFDocEncoding[c] for c in s)
# enc
... However, the reason this is a bug report and not a pull request is that I doubt it's correct for Py2, and don't really know what the correct portable thing to do is.
This is not a technical issue. It's more about increasing public trust in this repo and its organization.
I've seen that no members are listed in the pdfminer organization. The discussion that caused the creation of this organization suggests that there are quite a few developers involved. Members, would you mind to make your organization status publicly visible? This should be possible at https://github.com/orgs/pdfminer/people
There is no organization icon. That looks a bit sad. Can we come up with one and upload it to GitHub? e.g. a modified version of what you find in a web search, if the image license allows derivatives, or an actual free PDF icon
The description of the repository has French spelling (a space before the colon). Can this be fixed? e.g. replace "PDF Parser : fork with Python 2+3 support using six " by "Python PDF Parser -- fork with Python 2+3 support using six" on the repo home.
I'm having trouble converting pdf's into html. Everything seems to work fine except the text is not positioned correctly on the page. It seems like all the text is being bunched together into a few span tags.
I've tried the following to no avail:
I also received some encoding errors which i was able to get by by using switching from "from io import StringsIO" to "from Six import BytesIO".
Has anyone had any success in converting pdf's to html? If so would you mind sharing your configuration? I've attached a sample config code and html output file for reference:
`
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import HTMLConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from six import BytesIO
def convert_pdf_to_html(pdf_path, html_path):
"""Converts PDF to HTML file
ARGS:
pdf_path: full path of pdf file to convert to html
html_path: full path of html file containing extracted pdf data
"""
rsrcmgr = PDFResourceManager()
retstr = BytesIO()
codec = 'UTF-8'
laparams = LAParams()
device = HTMLConverter(rsrcmgr, retstr, codec=codec)
fp = open(pdf_path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
fstr = ''
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,
caching=caching, check_extractable=True):
interpreter.process_page(page)
fstr = retstr.getvalue()
fp.close()
device.close()
retstr.close()
fstr = fstr.replace(b'\n', b"")
html_file = open(html_path, 'wb')
html_file.write(fstr)`
The following simple python code illustrates a bug with parsing the attached PDF file. Specifically, it incorrectly determines the height of text. Namely it thinks the small text is much larger than the big text.
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTChar
def parse_pages():
fp = open('WrongFontSizes3.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
parser.set_document(doc)
rsrcmgr = PDFResourceManager()
laparams = LAParams(char_margin=3.5, all_texts=True)
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(doc):
interpreter.process_page(page)
layout = device.get_result()
yield layout
if __name__ == '__main__':
for page in parse_pages():
for tbox in page:
if not isinstance(tbox, LTTextBox):
continue
for line in tbox:
for char in line:
if not isinstance(char, LTChar):
continue
print char.get_text().encode('UTF-8'), char.size
Output:
B 29.4555
i 29.4555
g 29.4555
T 29.4555
e 29.4555
x 29.4555
t 29.4555
S 66.96
m 66.96
a 66.96
l 66.96
l 66.96
66.96
66.96
T 66.96
e 66.96
x 66.96
t 66.96
Process finished with exit code 0
>>> pdfminer.__version__
'1.3.0'
>>> from pdfminer.pdfpage import PDFPage
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: No module named 'pdfminer.pdfpage'
it works ok in py2,why?
#py2 demo
from bs4 import BeautifulSoup
import requests
import re
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from cStringIO import StringIO
from io import open
from pdfminer.pdfpage import PDFPage
def pdf_txt(url):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
f = requests.get(url).content
fp = StringIO(f)
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
for page in PDFPage.get_pages(fp,
pagenos,
maxpages=maxpages,
password=password,
caching=caching,
check_extractable=True):
interpreter.process_page(page)
fp.close()
device.close()
str = retstr.getvalue()
retstr.close()
return str
print pdf_txt('http://pythonscraping.com/pages/warandpeace/chapter1.pdf')
I ran into the issue of pdfminer.six replacing strings from the text of my PDF file like 'fi', 'ff' etc. with a char which is displayed in console as a question mark (?). I guess it is some non-ASCII char since I can not replace it with searching for the actual char '?'. I found out that these strings ('fi', 'ff' and so on) are found in the file pdffont.py in a list called STANDARD_STRINGS. I tried commenting them out, to see if it would fix my problem, but it did not.
The PKG_INFO file of pdfminer.six says:
Metadata-Version: 1.1
Name: pdfminer.six
Version: 20160614
Summary: PDF parser and analyzer
If more info is needed to fix the issue, let me know. I can also provide the PDF file that produces the issue. Other than that keep the good work up, I really enjoy pdfminer.six!
Extracting text with images there is an error "TypeError: object of type 'zip' has no len()
".
"File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pdfminer/image.py", line 74, in export_image
if len(filters) == 1 and filters[0][0] in LITERALS_DCT_DECODE:
TypeError: object of type 'zip' has no len()"
I converted also a PDF file "i1040nr.pdf" in your test set and there is the same error.
The URL should be https://github.com/pdfminer/pdfminer.six/ but is https://github.com/pdfminer/pdfminer
Then, it's also wrong in https://pypi.python.org/pypi/pdfminer.six/20170720
I get an UnicodeEncodeError when using pdfminer (the version d79612c from git)
Download https://www.dropbox.com/s/khjfr63o82fa5yn/numbers-test-document.pdf?dl=0 and execute the following script:
#!/usr/bin/env python
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text
print(convert_pdf_to_txt("numbers-test-document.pdf"))
Traceback (most recent call last):
File "pdfminer_sample3.py", line 32, in <module>
print(convert_pdf_to_txt("samples/numbers-test-document.pdf"))
File "pdfminer_sample3.py", line 14, in convert_pdf_to_txt
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
File "/usr/local/lib/python2.7/dist-packages/pdfminer/converter.py", line 186, in __init__
PDFConverter.__init__(self, rsrcmgr, outfp, codec=codec, pageno=pageno, laparams=laparams)
File "/usr/local/lib/python2.7/dist-packages/pdfminer/converter.py", line 173, in __init__
self.outfp.write(u"é")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)
The attached PDF produces this error:
Traceback (most recent call last):
File "/usr/local/bin/pdf2txt.py", line 4, in <module>
__import__('pkg_resources').run_script('pdfminer.six==20170119', 'pdf2txt.py')
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 719, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1504, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python3.5/dist-packages/pdfminer.six-20170119-py3.5.egg/EGG-INFO/scripts/pdf2txt.py", line 127, in <module>
if __name__ == '__main__': sys.exit(main())
File "/usr/local/lib/python3.5/dist-packages/pdfminer.six-20170119-py3.5.egg/EGG-INFO/scripts/pdf2txt.py", line 122, in main
outfp = extract_text(**vars(A))
File "/usr/local/lib/python3.5/dist-packages/pdfminer.six-20170119-py3.5.egg/EGG-INFO/scripts/pdf2txt.py", line 62, in extract_text
pdfminer.high_level.extract_text_to_fp(fp, **locals())
File "/usr/local/lib/python3.5/dist-packages/pdfminer.six-20170119-py3.5.egg/pdfminer/high_level.py", line 83, in extract_text_to_fp
interpreter.process_page(page)
File "/usr/local/lib/python3.5/dist-packages/pdfminer.six-20170119-py3.5.egg/pdfminer/pdfinterp.py", line 852, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)
File "/usr/local/lib/python3.5/dist-packages/pdfminer.six-20170119-py3.5.egg/pdfminer/pdfinterp.py", line 862, in render_contents
self.init_resources(resources)
File "/usr/local/lib/python3.5/dist-packages/pdfminer.six-20170119-py3.5.egg/pdfminer/pdfinterp.py", line 362, in init_resources
self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
File "/usr/local/lib/python3.5/dist-packages/pdfminer.six-20170119-py3.5.egg/pdfminer/pdfinterp.py", line 212, in get_font
font = self.get_font(None, subspec)
File "/usr/local/lib/python3.5/dist-packages/pdfminer.six-20170119-py3.5.egg/pdfminer/pdfinterp.py", line 203, in get_font
font = PDFCIDFont(self, spec)
File "/usr/local/lib/python3.5/dist-packages/pdfminer.six-20170119-py3.5.egg/pdfminer/pdffont.py", line 672, in __init__
CMapParser(self.unicode_map, BytesIO(strm.get_data())).run()
TypeError: a bytes-like object is required, not 'str'
Downloading pdfminer.six-20151013.zip (4.2MB)
100% |################################| 4.2MB 141kB/s
Traceback (most recent call last):
File "<string>", line 20, in <module>
File "/tmp/pip-build-QgQ8RT/pdfminer.six/setup.py", line 12, in <module>
install_requires=['six', 'chardet'] if sys.version_info.major>2 else ['six'],
AttributeError: 'tuple' object has no attribute 'major'
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 20, in <module>
File "/tmp/pip-build-QgQ8RT/pdfminer.six/setup.py", line 12, in <module>
install_requires=['six', 'chardet'] if sys.version_info.major>2 else ['six'],
AttributeError: 'tuple' object has no attribute 'major'
Can you please review and merge PR euske/pdfminer#107
pdf2txt.py fails to run with:
/usr/bin/env: ‘python\r’: No such file or directory
This appears to be due to a DOS carriage return in the shebang line. Running dos2unix pdf2txt.py
appears to fix the issue.
Test environment:
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04 LTS
Release: 16.04
Codename: xenial
pdfminer.six (20160614) - from PyPi via pip
Running in a virtualenv
$ virtualenv --version
15.0.1
$ python --version
Python 3.5.1+
Hey man! Thks for doing this. I downloaded the package and followed your instructions. Unfortunately when i try to scrap a pdf, I get an Import Error which says: "no module named 'pdfminor.settings'". I checked the folder pdfminor if the settings file was missing but it isn't. Any idea what the problem might be?
cheers, ed
I am using the excellent pdfminer.six package for analysis of text in PDFs that my clients receive from their clients.
I hit an assert failure while using the PDFPageAggregator
converter. Here are the code, PDF, and stack trace:
https://github.com/hughsw/pdfminer.six/blob/master/tools/pdfstats.py
arm_ed_t_board_elektor_magazine_article.pdf
Traceback (most recent call last):
File "./pdfstats.py", line 81, in <module>
sys.exit(main(sys.argv[1:]))
File "./pdfstats.py", line 71, in main
interpreter.process_page(page)
File "/usr/local/lib/python3.6/site-packages/pdfminer/pdfinterp.py", line 851, in process_page
self.device.end_page(page)
File "/usr/local/lib/python3.6/site-packages/pdfminer/converter.py", line 51, in end_page
self.cur_item.analyze(self.laparams)
File "/usr/local/lib/python3.6/site-packages/pdfminer/layout.py", line 677, in analyze
obj.analyze(laparams)
File "/usr/local/lib/python3.6/site-packages/pdfminer/layout.py", line 724, in analyze
LTLayoutContainer.analyze(self, laparams)
File "/usr/local/lib/python3.6/site-packages/pdfminer/layout.py", line 684, in analyze
textboxes = list(self.group_textlines(laparams, textlines))
File "/usr/local/lib/python3.6/site-packages/pdfminer/layout.py", line 579, in group_textlines
neighbors = line.find_neighbors(plane, laparams.line_margin)
File "/usr/local/lib/python3.6/site-packages/pdfminer/layout.py", line 387, in find_neighbors
return [obj for obj in objs
File "/usr/local/lib/python3.6/site-packages/pdfminer/layout.py", line 387, in <listcomp>
return [obj for obj in objs
File "/usr/local/lib/python3.6/site-packages/pdfminer/utils.py", line 373, in find
for k in self._getrange(bbox):
File "/usr/local/lib/python3.6/site-packages/pdfminer/utils.py", line 335, in _getrange
for y in drange(y0, y1, self.gridsize):
File "/usr/local/lib/python3.6/site-packages/pdfminer/utils.py", line 173, in drange
assert v0 < v1
AssertionError: (807.874, 807.874, 50)
I'm trying to migrate some pdfminer code to python3 (which was working with the upstream pdfminer on python2.7) using this version of pdfminer. It fails on:
File "mycode.py", line 123, in main
interpreter.process_page(page)
File "pdfminer/pdfinterp.py", line 834, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)
File "pdfminer/pdfinterp.py", line 846, in render_contents
self.execute(list_value(streams))
File "pdfminer/pdfinterp.py", line 870, in execute
func(*args)
File "pdfminer/pdfinterp.py", line 811, in do_Do
interpreter.render_contents(resources, [xobj], ctm=mult_matrix(matrix, self.ctm))
File "pdfminer/pdfinterp.py", line 846, in render_contents
self.execute(list_value(streams))
File "pdfminer/pdfinterp.py", line 862, in execute
method = 'do_%s' % name.replace('*', '_a').replace('"', '_w').replace("'", '_q')
TypeError: 'str' does not support the buffer interface
This seems to be a string conversion issue due to the following new code in psparser.py:
def keyword_name(x):
if not isinstance(x, PSKeyword):
# (snip)
else:
name=x.name
if six.PY3:
try:
name = str(name,'utf-8')
except:
pass
return name
Sticking a 'raise' in there (rather than pass) shows that the utf-8 decoding is failing ("invalid start byte"), and indeed the name looks like binary junk. There are a lot of these bad keyword names in this particular PDF, and they're all on the same page, so it may well be a malformed PDF or a parser bug elsewhere. (Sorry, I can't share the PDF.) Nevertheless, pdfminer should probably be able to handle this more robustly, because these bad names would have been ignored by execute() if STRICT was off, which it is by default in the original pdfminer.
So, I have two sub-buglets here (sorry for lumping them together):
Hi,
I'm trying to convert a simple PDF to HTML using:
pdf2txt.py test.pdf -t html -o test.html
Here is the test PDF file:
test.pdf
html source:
<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head><body>
<span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:595px; height:842px;"></span>
<div style="position:absolute; top:50px;"><a name="1">Page 1</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:118px; width:448px; height:45px;"><span style="font-family: ; font-size:16px">The Portable Document Format (PDF) is the world’s leading language for describing <br>the printed page</span><span style="font-family: ; font-size:15px"> <br></span><span style="font-family: ; font-size:15px"> <br></span></div><span style="position:absolute; border: black 1px solid; left:72px; top:121px; width:445px; height:13px;"></span>
<span style="position:absolute; border: black 1px solid; left:72px; top:135px; width:86px; height:13px;"></span>
<div style="position:absolute; top:0px;">Page: <a href="#1">1</a></div>
</body></html>
Now, the problem is that width
of the line is incorrectly computed making it to wrap differently then the original doc. This can lead to smth like this:
Is there a fix for this issue? If not can you guide me where to look so that I can make a PR with the fix?
I think this tool may be helpful for what we need and in this case we can contribute to it.
Thx a lot!
C:\Python36\Scripts>.\pip3 install pdfminer.six
Collecting pdfminer.six
Using cached pdfminer.six-20170419.tar.gz
Requirement already satisfied: six in c:\python36\lib\site-packages (from pdfminer.six)
Collecting pycrypto (from pdfminer.six)
Using cached pycrypto-2.6.1.tar.gz
Collecting chardet (from pdfminer.six)
Using cached chardet-3.0.3-py2.py3-none-any.whl
Installing collected packages: pycrypto, chardet, pdfminer.six
Running setup.py install for pycrypto ... error
Complete output from command c:\python36\python.exe -u -c "import setuptools, tokenize;file='C:\Users\john\AppData\Local\Temp\pip-build-thbvu0am\pycrypto\setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record C:\Users\john\AppData\Local\Temp\pip-d96gbthz-record\install-record.txt --single-version-externally-managed --compile:
running install
running build
running build_py
creating build
creating build\lib.win-amd64-3.6
creating build\lib.win-amd64-3.6\Crypto
copying lib\Crypto\pct_warnings.py -> build\lib.win-amd64-3.6\Crypto
copying lib\Crypto_init_.py -> build\lib.win-amd64-3.6\Crypto
creating build\lib.win-amd64-3.6\Crypto\Hash
copying lib\Crypto\Hash\hashalgo.py -> build\lib.win-amd64-3.6\Crypto\Hash
copying lib\Crypto\Hash\HMAC.py -> build\lib.win-amd64-3.6\Crypto\Hash
copying lib\Crypto\Hash\MD2.py -> build\lib.win-amd64-3.6\Crypto\Hash
copying lib\Crypto\Hash\MD4.py -> build\lib.win-amd64-3.6\Crypto\Hash
copying lib\Crypto\Hash\MD5.py -> build\lib.win-amd64-3.6\Crypto\Hash
copying lib\Crypto\Hash\RIPEMD.py -> build\lib.win-amd64-3.6\Crypto\Hash
copying lib\Crypto\Hash\SHA.py -> build\lib.win-amd64-3.6\Crypto\Hash
copying lib\Crypto\Hash\SHA224.py -> build\lib.win-amd64-3.6\Crypto\Hash
copying lib\Crypto\Hash\SHA256.py -> build\lib.win-amd64-3.6\Crypto\Hash
copying lib\Crypto\Hash\SHA384.py -> build\lib.win-amd64-3.6\Crypto\Hash
copying lib\Crypto\Hash\SHA512.py -> build\lib.win-amd64-3.6\Crypto\Hash
copying lib\Crypto\Hash_init_.py -> build\lib.win-amd64-3.6\Crypto\Hash
creating build\lib.win-amd64-3.6\Crypto\Cipher
copying lib\Crypto\Cipher\AES.py -> build\lib.win-amd64-3.6\Crypto\Cipher
copying lib\Crypto\Cipher\ARC2.py -> build\lib.win-amd64-3.6\Crypto\Cipher
copying lib\Crypto\Cipher\ARC4.py -> build\lib.win-amd64-3.6\Crypto\Cipher
copying lib\Crypto\Cipher\blockalgo.py -> build\lib.win-amd64-3.6\Crypto\Cipher
copying lib\Crypto\Cipher\Blowfish.py -> build\lib.win-amd64-3.6\Crypto\Cipher
copying lib\Crypto\Cipher\CAST.py -> build\lib.win-amd64-3.6\Crypto\Cipher
copying lib\Crypto\Cipher\DES.py -> build\lib.win-amd64-3.6\Crypto\Cipher
copying lib\Crypto\Cipher\DES3.py -> build\lib.win-amd64-3.6\Crypto\Cipher
copying lib\Crypto\Cipher\PKCS1_OAEP.py -> build\lib.win-amd64-3.6\Crypto\Cipher
copying lib\Crypto\Cipher\PKCS1_v1_5.py -> build\lib.win-amd64-3.6\Crypto\Cipher
copying lib\Crypto\Cipher\XOR.py -> build\lib.win-amd64-3.6\Crypto\Cipher
copying lib\Crypto\Cipher_init_.py -> build\lib.win-amd64-3.6\Crypto\Cipher
creating build\lib.win-amd64-3.6\Crypto\Util
copying lib\Crypto\Util\asn1.py -> build\lib.win-amd64-3.6\Crypto\Util
copying lib\Crypto\Util\Counter.py -> build\lib.win-amd64-3.6\Crypto\Util
copying lib\Crypto\Util\number.py -> build\lib.win-amd64-3.6\Crypto\Util
copying lib\Crypto\Util\py3compat.py -> build\lib.win-amd64-3.6\Crypto\Util
copying lib\Crypto\Util\randpool.py -> build\lib.win-amd64-3.6\Crypto\Util
copying lib\Crypto\Util\RFC1751.py -> build\lib.win-amd64-3.6\Crypto\Util
copying lib\Crypto\Util\winrandom.py -> build\lib.win-amd64-3.6\Crypto\Util
copying lib\Crypto\Util_number_new.py -> build\lib.win-amd64-3.6\Crypto\Util
copying lib\Crypto\Util_init_.py -> build\lib.win-amd64-3.6\Crypto\Util
creating build\lib.win-amd64-3.6\Crypto\Random
copying lib\Crypto\Random\random.py -> build\lib.win-amd64-3.6\Crypto\Random
copying lib\Crypto\Random_UserFriendlyRNG.py -> build\lib.win-amd64-3.6\Crypto\Random
copying lib\Crypto\Random_init_.py -> build\lib.win-amd64-3.6\Crypto\Random
creating build\lib.win-amd64-3.6\Crypto\Random\Fortuna
copying lib\Crypto\Random\Fortuna\FortunaAccumulator.py -> build\lib.win-amd64-3.6\Crypto\Random\Fortuna
copying lib\Crypto\Random\Fortuna\FortunaGenerator.py -> build\lib.win-amd64-3.6\Crypto\Random\Fortuna
copying lib\Crypto\Random\Fortuna\SHAd256.py -> build\lib.win-amd64-3.6\Crypto\Random\Fortuna
copying lib\Crypto\Random\Fortuna_init_.py -> build\lib.win-amd64-3.6\Crypto\Random\Fortuna
creating build\lib.win-amd64-3.6\Crypto\Random\OSRNG
copying lib\Crypto\Random\OSRNG\fallback.py -> build\lib.win-amd64-3.6\Crypto\Random\OSRNG
copying lib\Crypto\Random\OSRNG\nt.py -> build\lib.win-amd64-3.6\Crypto\Random\OSRNG
copying lib\Crypto\Random\OSRNG\posix.py -> build\lib.win-amd64-3.6\Crypto\Random\OSRNG
copying lib\Crypto\Random\OSRNG\rng_base.py -> build\lib.win-amd64-3.6\Crypto\Random\OSRNG
copying lib\Crypto\Random\OSRNG_init_.py -> build\lib.win-amd64-3.6\Crypto\Random\OSRNG
creating build\lib.win-amd64-3.6\Crypto\SelfTest
copying lib\Crypto\SelfTest\st_common.py -> build\lib.win-amd64-3.6\Crypto\SelfTest
copying lib\Crypto\SelfTest_init_.py -> build\lib.win-amd64-3.6\Crypto\SelfTest
creating build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
copying lib\Crypto\SelfTest\Cipher\common.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
copying lib\Crypto\SelfTest\Cipher\test_AES.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
copying lib\Crypto\SelfTest\Cipher\test_ARC2.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
copying lib\Crypto\SelfTest\Cipher\test_ARC4.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
copying lib\Crypto\SelfTest\Cipher\test_Blowfish.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
copying lib\Crypto\SelfTest\Cipher\test_CAST.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
copying lib\Crypto\SelfTest\Cipher\test_DES.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
copying lib\Crypto\SelfTest\Cipher\test_DES3.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
copying lib\Crypto\SelfTest\Cipher\test_pkcs1_15.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
copying lib\Crypto\SelfTest\Cipher\test_pkcs1_oaep.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
copying lib\Crypto\SelfTest\Cipher\test_XOR.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
copying lib\Crypto\SelfTest\Cipher_init_.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Cipher
creating build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
copying lib\Crypto\SelfTest\Hash\common.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
copying lib\Crypto\SelfTest\Hash\test_HMAC.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
copying lib\Crypto\SelfTest\Hash\test_MD2.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
copying lib\Crypto\SelfTest\Hash\test_MD4.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
copying lib\Crypto\SelfTest\Hash\test_MD5.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
copying lib\Crypto\SelfTest\Hash\test_RIPEMD.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
copying lib\Crypto\SelfTest\Hash\test_SHA.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
copying lib\Crypto\SelfTest\Hash\test_SHA224.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
copying lib\Crypto\SelfTest\Hash\test_SHA256.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
copying lib\Crypto\SelfTest\Hash\test_SHA384.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
copying lib\Crypto\SelfTest\Hash\test_SHA512.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
copying lib\Crypto\SelfTest\Hash_init_.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Hash
creating build\lib.win-amd64-3.6\Crypto\SelfTest\Protocol
copying lib\Crypto\SelfTest\Protocol\test_AllOrNothing.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Protocol
copying lib\Crypto\SelfTest\Protocol\test_chaffing.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Protocol
copying lib\Crypto\SelfTest\Protocol\test_KDF.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Protocol
copying lib\Crypto\SelfTest\Protocol\test_rfc1751.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Protocol
copying lib\Crypto\SelfTest\Protocol_init_.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Protocol
creating build\lib.win-amd64-3.6\Crypto\SelfTest\PublicKey
copying lib\Crypto\SelfTest\PublicKey\test_DSA.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\PublicKey
copying lib\Crypto\SelfTest\PublicKey\test_ElGamal.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\PublicKey
copying lib\Crypto\SelfTest\PublicKey\test_importKey.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\PublicKey
copying lib\Crypto\SelfTest\PublicKey\test_RSA.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\PublicKey
copying lib\Crypto\SelfTest\PublicKey_init_.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\PublicKey
creating build\lib.win-amd64-3.6\Crypto\SelfTest\Random
copying lib\Crypto\SelfTest\Random\test_random.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random
copying lib\Crypto\SelfTest\Random\test_rpoolcompat.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random
copying lib\Crypto\SelfTest\Random\test__UserFriendlyRNG.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random
copying lib\Crypto\SelfTest\Random_init_.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random
creating build\lib.win-amd64-3.6\Crypto\SelfTest\Random\Fortuna
copying lib\Crypto\SelfTest\Random\Fortuna\test_FortunaAccumulator.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random\Fortuna
copying lib\Crypto\SelfTest\Random\Fortuna\test_FortunaGenerator.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random\Fortuna
copying lib\Crypto\SelfTest\Random\Fortuna\test_SHAd256.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random\Fortuna
copying lib\Crypto\SelfTest\Random\Fortuna_init_.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random\Fortuna
creating build\lib.win-amd64-3.6\Crypto\SelfTest\Random\OSRNG
copying lib\Crypto\SelfTest\Random\OSRNG\test_fallback.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random\OSRNG
copying lib\Crypto\SelfTest\Random\OSRNG\test_generic.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random\OSRNG
copying lib\Crypto\SelfTest\Random\OSRNG\test_nt.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random\OSRNG
copying lib\Crypto\SelfTest\Random\OSRNG\test_posix.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random\OSRNG
copying lib\Crypto\SelfTest\Random\OSRNG\test_winrandom.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random\OSRNG
copying lib\Crypto\SelfTest\Random\OSRNG_init_.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Random\OSRNG
creating build\lib.win-amd64-3.6\Crypto\SelfTest\Util
copying lib\Crypto\SelfTest\Util\test_asn1.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Util
copying lib\Crypto\SelfTest\Util\test_Counter.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Util
copying lib\Crypto\SelfTest\Util\test_number.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Util
copying lib\Crypto\SelfTest\Util\test_winrandom.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Util
copying lib\Crypto\SelfTest\Util_init_.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Util
creating build\lib.win-amd64-3.6\Crypto\SelfTest\Signature
copying lib\Crypto\SelfTest\Signature\test_pkcs1_15.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Signature
copying lib\Crypto\SelfTest\Signature\test_pkcs1_pss.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Signature
copying lib\Crypto\SelfTest\Signature_init_.py -> build\lib.win-amd64-3.6\Crypto\SelfTest\Signature
creating build\lib.win-amd64-3.6\Crypto\Protocol
copying lib\Crypto\Protocol\AllOrNothing.py -> build\lib.win-amd64-3.6\Crypto\Protocol
copying lib\Crypto\Protocol\Chaffing.py -> build\lib.win-amd64-3.6\Crypto\Protocol
copying lib\Crypto\Protocol\KDF.py -> build\lib.win-amd64-3.6\Crypto\Protocol
copying lib\Crypto\Protocol_init_.py -> build\lib.win-amd64-3.6\Crypto\Protocol
creating build\lib.win-amd64-3.6\Crypto\PublicKey
copying lib\Crypto\PublicKey\DSA.py -> build\lib.win-amd64-3.6\Crypto\PublicKey
copying lib\Crypto\PublicKey\ElGamal.py -> build\lib.win-amd64-3.6\Crypto\PublicKey
copying lib\Crypto\PublicKey\pubkey.py -> build\lib.win-amd64-3.6\Crypto\PublicKey
copying lib\Crypto\PublicKey\RSA.py -> build\lib.win-amd64-3.6\Crypto\PublicKey
copying lib\Crypto\PublicKey_DSA.py -> build\lib.win-amd64-3.6\Crypto\PublicKey
copying lib\Crypto\PublicKey_RSA.py -> build\lib.win-amd64-3.6\Crypto\PublicKey
copying lib\Crypto\PublicKey_slowmath.py -> build\lib.win-amd64-3.6\Crypto\PublicKey
copying lib\Crypto\PublicKey_init_.py -> build\lib.win-amd64-3.6\Crypto\PublicKey
creating build\lib.win-amd64-3.6\Crypto\Signature
copying lib\Crypto\Signature\PKCS1_PSS.py -> build\lib.win-amd64-3.6\Crypto\Signature
copying lib\Crypto\Signature\PKCS1_v1_5.py -> build\lib.win-amd64-3.6\Crypto\Signature
copying lib\Crypto\Signature_init_.py -> build\lib.win-amd64-3.6\Crypto\Signature
Skipping optional fixer: buffer
Skipping optional fixer: idioms
Skipping optional fixer: set_literal
Skipping optional fixer: ws_comma
running build_ext
warning: GMP or MPIR library not found; Not building Crypto.PublicKey._fastmath.
building 'Crypto.Random.OSRNG.winrandom' extension
creating build\temp.win-amd64-3.6
creating build\temp.win-amd64-3.6\Release
creating build\temp.win-amd64-3.6\Release\src
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Isrc/ -Isrc/inc-msvc/ -Ic:\python36\include -Ic:\python36\include "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.6.1\include\um" "-IC:\Program Files (x86)\Windows Kits\8.1\include\shared" "-IC:\Program Files (x86)\Windows Kits\8.1\include\um" "-IC:\Program Files (x86)\Windows Kits\8.1\include\winrt" /Tcsrc/winrand.c /Fobuild\temp.win-amd64-3.6\Release\src/winrand.obj
winrand.c
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(26): error C2061: syntax error: identifier 'intmax_t'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(27): error C2061: syntax error: identifier 'rem'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(27): error C2059: syntax error: ';'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(28): error C2059: syntax error: '}'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(30): error C2061: syntax error: identifier 'imaxdiv_t'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(30): error C2059: syntax error: ';'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(40): error C2143: syntax error: missing '{' before '__cdecl'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(41): error C2146: syntax error: missing ')' before identifier '_Number'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(41): error C2061: syntax error: identifier '_Number'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(41): error C2059: syntax error: ';'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(42): error C2059: syntax error: ')'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(45): error C2143: syntax error: missing '{' before '__cdecl'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(46): error C2146: syntax error: missing ')' before identifier '_Numerator'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(46): error C2061: syntax error: identifier '_Numerator'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(46): error C2059: syntax error: ';'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(46): error C2059: syntax error: ','
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(48): error C2059: syntax error: ')'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(50): error C2143: syntax error: missing '{' before '__cdecl'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(56): error C2143: syntax error: missing '{' before '__cdecl'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(63): error C2143: syntax error: missing '{' before '__cdecl'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(69): error C2143: syntax error: missing '{' before '__cdecl'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(76): error C2143: syntax error: missing '{' before '__cdecl'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(82): error C2143: syntax error: missing '{' before '__cdecl'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(89): error C2143: syntax error: missing '{' before '__cdecl'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\inttypes.h(95): error C2143: syntax error: missing '{' before '__cdecl'
error: command 'C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe' failed with exit status 2
----------------------------------------
Command "c:\python36\python.exe -u -c "import setuptools, tokenize;file='C:\Users\john\AppData\Local\Temp\pip-build-thbvu0am\pycrypto\setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record C:\Users\john\AppData\Local\Temp\pip-d96gbthz-record\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in C:\Users\john\AppData\Local\Temp\pip-build-thbvu0am\pycrypto\
Hey there,
in the github project description you're linking to https://goulu.github.io/pdfminer/, that page doesn't exist.
I looked at psparser.py
file and sow bytesindex
function which purporse is to replace byteobj[from:to]
and more precisely byteobj[idx]
to return the same on Python 2 and 3.
From my understanding and experiments the only difference between Python 2 and 3 in regards to getting element or slice from byteobject is that in Python 3 when you get single element you receive integer instead of bytestrig. When you get slice is the same in both Python 2 and 3.
Now the implementation of bytesindex
differ from how slices works if to
is a negative value. In current implementation if -1
(or any other negative value) is passed as to
then it will be to the end of the bytestring instead to the end minus one byte (or the exact number of bytes). Because of that implementation detail all usages of bytesindex
where all bytes to the end of the bytestring need to be get are misleading because it uses -1
as argument.
The possible performance improvements can be because of reduced function calls if proper slice is used instead of the function. Also will be more obvious the the reader what exactly data are get from the bytestring.
Running pdf2txt.py on the attached PDF crashes with an attribute error in recently added code, commit 82af7f0 (see #56).
bash-3.2$ python3 /usr/local/bin/pdf2txt.py 175.pdf
INFO:pdfminer.pdfdocument:xref found: pos=b'774066'
INFO:pdfminer.pdfdocument:read_xref_from: start=774066, token=/b'xref'
INFO:pdfminer.pdfdocument:xref objects: {2: (None, 9, 0), 3: (None, 400798, 0), 4: (None, 400895, 0), 5: (None, 773855, 0), 6: (None, 401082, 0), 7: (None, 773571, 0), 8: (None, 773668, 0), 9: (None, 773919, 0), 10: (None, 773970, 0)}
INFO:pdfminer.pdfdocument:trailer: {'Size': 10, 'Root': <PDFObjRef:8>, 'Info': <PDFObjRef:9>}
INFO:pdfminer.pdfdocument:trailer: {'Size': 10, 'Root': <PDFObjRef:8>, 'Info': <PDFObjRef:9>}
Traceback (most recent call last):
File "/usr/local/bin/pdf2txt.py", line 129, in <module>
if __name__ == '__main__': sys.exit(main())
File "/usr/local/bin/pdf2txt.py", line 124, in main
outfp = extract_text(**vars(A))
File "/usr/local/bin/pdf2txt.py", line 64, in extract_text
pdfminer.high_level.extract_text_to_fp(fp, **locals())
File "/usr/local/lib/python3.6/site-packages/pdfminer/high_level.py", line 81, in extract_text_to_fp
check_extractable=True):
File "/usr/local/lib/python3.6/site-packages/pdfminer/pdfpage.py", line 121, in get_pages
doc = PDFDocument(parser, password=password, caching=caching)
File "/usr/local/lib/python3.6/site-packages/pdfminer/pdfdocument.py", line 579, in __init__
self.info.append(dict_value(trailer['Info']))
File "/usr/local/lib/python3.6/site-packages/pdfminer/pdftypes.py", line 164, in dict_value
x = resolve1(x)
File "/usr/local/lib/python3.6/site-packages/pdfminer/pdftypes.py", line 84, in resolve1
x = x.resolve(default=default)
File "/usr/local/lib/python3.6/site-packages/pdfminer/pdftypes.py", line 71, in resolve
return self.doc.getobj(self.objid)
File "/usr/local/lib/python3.6/site-packages/pdfminer/pdfdocument.py", line 689, in getobj
obj = self._getobj_parse(index, objid)
File "/usr/local/lib/python3.6/site-packages/pdfminer/pdfdocument.py", line 655, in _getobj_parse
while kwd is not self.KEYWORD_OBJ:
AttributeError: 'PDFDocument' object has no attribute 'KEYWORD_OBJ'
I'm running into an issue with pdfminer trying to import settings from Django. I have my virtualenv configured to use the system site-packages so I don't have to constantly recompile packages like numpy. I also happen to have Django installed in my system site-packages directory. Even though my project doesn't need/use Django, pdfminer still tries to access django.conf.settings.PDF_MINER_IS_STRICT.
It would be nice to have a way to ignore Django if it's installed but not actually used.
The .six fork adds extra quotes to PSLiteral.__repr__
:
ipdb> from pdfminer.psparser import PSLiteral
ipdb> PSLiteral("Name")
/'Name'
... where regular pdfminer would just print /Name
.
This seems to be because of this line, which switched from using '/%s'
to '/%r'
.
Should be a one-character fix, unless there's some reason using %r
is important?
(Seems like a minor issue, I know, but pdfquery uses __repr__
for serializing PDFs, so this becomes a blocker for py3 support.)
Thanks!
Hey-lo,
I'm building a version of PDFMiner.six
using conda
for conda-forge. When possible, we try to include a link to the license file in the meta.yaml
specification; doing so requires both:
MANIFEST.in
file.Would you consider adding a copy of the license to the bundle and updating MANIFEST.in
to include it?
I am confused about the dumpdf for the outline. I have read about the note and guideline but it doesnt work. https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167 and it might the similar cause by #74 . Can you explain a bit more? @goulu assume there is a block of text discovered from the miner and how can we use to bookmark command to allocation which character in the block contains such bookmark symbol?
I'm receiving this error when working with certain PDFs. Because of the nature of the data I'm working with, I'm not at liberty to post a sample file but I've had the same issue with several files in the data set I'm working with.
File "/usr/local/lib/python3.5/dist-packages/pdfminer/pdfinterp.py", line 852, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)
File "/usr/local/lib/python3.5/dist-packages/pdfminer/pdfinterp.py", line 864, in render_contents
self.execute(list_value(streams))
File "/usr/local/lib/python3.5/dist-packages/pdfminer/pdfinterp.py", line 875, in execute
(_, obj) = parser.nextobject()
File "/usr/local/lib/python3.5/dist-packages/pdfminer/psparser.py", line 583, in nextobject
(pos, token) = self.nexttoken()
File "/usr/local/lib/python3.5/dist-packages/pdfminer/psparser.py", line 509, in nexttoken
self.fillbuf()
File "/usr/local/lib/python3.5/dist-packages/pdfminer/pdfinterp.py", line 248, in fillbuf
if self.charpos < len(self.buf):
TypeError: unorderable types: tuple() < int()
Printing the self.charpos variable immediately before that comparison line shows a bunch of integer output as expected and then this right before the error:
(<bound method PSBaseParser._parse_comment of <PDFContentParser: <_io.BytesIO object at 0x7f67996495c8>, bufpos=8192>>, 4096)
Hi,
We're currently in the process of upgrading a codebase from Python 2 to 3, and running into the bug in #15
This has been fixed in master, and a RC with release date Jan. 19th was created. When can this be expected? For the time being, I'll run of a known-good commit, but I'd prefer to install from PyPI at all times.
I've got a PDF (can't share it because of sensitive information) that fails during page creation via pdfminer.pdfpage.PDFPage.create_pages
because it returns an empty iterable, relevant code with my patch:
@classmethod
def create_pages(klass, document, debug=0):
def search(obj, parent):
if isinstance(obj, int):
objid = obj
tree = dict_value(document.getobj(objid)).copy()
else:
objid = obj.objid
tree = dict_value(obj).copy()
for (k, v) in parent.iteritems():
if k in klass.INHERITABLE_ATTRS and k not in tree:
tree[k] = v
# FIXME: wrong case?
tree_type = tree.get('Type', tree.get('type'))
if tree_type is LITERAL_PAGES and 'Kids' in tree:
if 1 <= debug:
print >>sys.stderr, 'Pages: Kids=%r' % tree['Kids']
for c in list_value(tree['Kids']):
for x in search(c, tree):
yield x
elif tree_type is LITERAL_PAGE:
if 1 <= debug:
print >>sys.stderr, 'Page: %r' % tree
yield (objid, tree)
pages = False
if 'Pages' in document.catalog:
for (objid, tree) in search(document.catalog['Pages'], document.catalog):
yield klass(document, objid, tree)
pages = True
if not pages:
# fallback when /Pages is missing.
for xref in document.xrefs:
for objid in xref.get_objids():
try:
obj = document.getobj(objid)
if isinstance(obj, dict) and obj.get('Type') is LITERAL_PAGE:
yield klass(document, objid, obj)
except PDFObjectNotFound:
pass
return
The relevant bit is tree_type = tree.get('Type', tree.get('type'))
- the actual PDF stream has a lowercase /type
instead of the expected /Type
, causing the generator to never yield
, which in turn causes StopIteration
in pdfquery.
According to the spec (1.7, page 57-58), this is valid and /Type
is a different name object than /type
. However, in this case, the meaning is the same, and probably the PDF generator is the 'offending' root cause here.
I am using python 2.7.10. When running the following code I get a unicodedecodeerror
This is the code:
`
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from StringIO import StringIO
rsrcmgr = PDFResourceManager()
retstr = StringIO()
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec='utf-8', laparams=laparams)
fp = file('c:\users\public\data\pdfs\policy.pdf', 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 1
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
`
And this is the error:
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Users\Public\Public Software\WinPython32\python-2.7.10\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 790, in runfile execfile(filename, namespace) File "C:\Users\Public\Public Software\WinPython32\python-2.7.10\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 77, in execfile exec(compile(scripttext, filename, 'exec'), glob, loc) File "C:/Users/Public/Public Software/WinPython32/python-2.7.10/Scripts/pdfextract.py", line 28, in <module> text = retstr.getvalue() File "C:\Users\Public\Public Software\WinPython32\python-2.7.10\lib\StringIO.py", line 272, in getvalue self.buf += ''.join(self.buflist) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
After some digging, turns out the problem is that in PDFConverter the following thing happens:
`class PDFConverter(PDFLayoutAnalyzer):
def __init__(self, rsrcmgr, outfp, codec='utf-8', pageno=1, laparams=None):
PDFLayoutAnalyzer.__init__(self, rsrcmgr, pageno=pageno, laparams=laparams)
self.outfp = outfp
self.codec = codec
if hasattr(self.outfp, 'mode'):
if 'b' in self.outfp.mode:
self.outfp_binary = True
else:
self.outfp_binary = False
else:
import io
if isinstance(self.outfp, io.BytesIO):
self.outfp_binary = True
elif isinstance(self.outfp, io.StringIO):
self.outfp_binary = False
else:
try:
self.outfp.write(u"é)
self.outfp_binary = False
except TypeError:
self.outfp_binary = True
return`
As I am using StringIO from StringIO; the buflist in my StringIO object ends up with the u'é' entry which is unicode type. Later in the code when it is writing from the PDF to this array it writes str types. This mixing causes StringIO to throw a UnicodeDecodeError when it tries to join them all (in the getvalues() call).
I'm not that pro with Python, I managed to get it working by replacing the particular line by:
self.outfp.write(u"é".encode(codec,'ignore'))
But maybe this defeats the purpose of the line (?).
I found a post on StackOverflow with some information that I thought was relevant:
http://stackoverflow.com/questions/5701372/what-caused-this-traceback
Hi,
Have prevision to release one version in pip?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.