jaepil / pdfminer3k Goto Github PK
View Code? Open in Web Editor NEWThis project forked from euske/pdfminer
Python 3 port of pdfminer
This project forked from euske/pdfminer
Python 3 port of pdfminer
See docs/index.html pytest is needed to run tests in the 'tests' folder.
I am using Anaconda and used conda forge to install pdfminer3k
Error:
runfile('C:/Phoenix/Python/listpdfsandcountwords.py', wdir='C:/Phoenix/Python')
Traceback (most recent call last):
File "", line 1, in
runfile('C:/Phoenix/Python/listpdfsandcountwords.py', wdir='C:/Phoenix/Python')
File "C:\Work\lib\site-packages\spyder\utils\site\sitecustomize.py", line 710, in runfile
execfile(filename, namespace)
File "C:\Work\lib\site-packages\spyder\utils\site\sitecustomize.py", line 101, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Phoenix/Python/listpdfsandcountwords.py", line 14, in
from pdfminer.pdfpage import PDFPage
ModuleNotFoundError: No module named 'pdfminer.pdfpage'
Conda Environment:
(C:\Work) C:\Users\dparamanand>conda info
Current conda install:
platform : win-64
conda version : 4.3.29
conda is private : False
conda-env version : 4.3.29
conda-build version : 3.0.27
python version : 3.6.3.final.0
requests version : 2.18.4
root environment : C:\Work (writable)
default environment : C:\Work
envs directories : C:\Work\envs
C:\Users\dparamanand\AppData\Local\conda\conda\envs
C:\Users\dparamanand\.conda\envs
package cache : C:\Work\pkgs
C:\Users\dparamanand\AppData\Local\conda\conda\pkgs
channel URLs : https://repo.continuum.io/pkgs/main/win-64
https://repo.continuum.io/pkgs/main/noarch
https://repo.continuum.io/pkgs/free/win-64
https://repo.continuum.io/pkgs/free/noarch
https://repo.continuum.io/pkgs/r/win-64
https://repo.continuum.io/pkgs/r/noarch
https://repo.continuum.io/pkgs/pro/win-64
https://repo.continuum.io/pkgs/pro/noarch
https://repo.continuum.io/pkgs/msys2/win-64
https://repo.continuum.io/pkgs/msys2/noarch
config file : C:\Users\dparamanand\.condarc
netrc file : None
offline mode : False
user-agent : conda/4.3.29 requests/2.18.4 CPython/3.6.3 Windows/10 Windows/10.0.16299
administrator : False
Code:
"""
Created on Fri Sep 29 10:43:29 2017
@author: dpar0004
"""
import os
#for reading the pdf
from io import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from nltk.corpus import stopwords
from nltk.collocations import TrigramCollocationFinder
from nltk.collocations import QuadgramCollocationFinder
#for counting the sentences and words
import nltk
import collections
from nltk import word_tokenize
from collections import Counter
#for couting most frequent words
import re
def convert(filename, pages=None):
if not pages:
pagenums = set()
else:
pagenums = set(pages)
output = StringIO()
manager = PDFResourceManager()
converter = TextConverter(manager, output, laparams=LAParams())
interpreter = PDFPageInterpreter(manager, converter)
infile = open(filename, 'rb')
for page in PDFPage.get_pages(infile, pagenums):
interpreter.process_page(page)
infile.close()
converter.close()
text = output.getvalue()
output.close
return text
pdfFiles = []
dir_name='C:\Phoenix\Documents from Bryan'
for filename in os.listdir(dir_name):
if filename.endswith('.pdf') or filename.endswith('.PDF') or filename.endswith('.Pdf') or filename.endswith('.pDf') or filename.endswith('.pdF') or filename.endswith('.pDF') or filename.endswith('.pDf') or filename.endswith('.PDF'):
pdfFiles.append(filename)
text=convert(os.path.join(dir_name, filename))
sentence_count = len(nltk.tokenize.sent_tokenize(text))
word_count = len(nltk.tokenize.word_tokenize(text))
print('\nThe file ',filename,' has ',word_count, 'words and ', sentence_count,' sentences in it.\n')
#use findall for counting most common words, quadgrams, trigrams
all_text = re.findall(r'\w+', text)
all_text =map(lambda x: x.lower(), all_text)
filtered_words = list(filter(lambda word: word not in stopwords.words('english') and word.isalpha(), all_text))
word_counts = Counter(filtered_words).most_common(20)
print('The 20 most commonly occuring words in this file are : \n\n', word_counts)
print('\nThe 10 most common 3 word combinations appearing in this file are: \n')
trigram = TrigramCollocationFinder.from_words(filtered_words)
print(sorted(trigram.ngram_fd.items(), key=lambda t: (-t[1], t[0]))[:10])
fourgrams=QuadgramCollocationFinder.from_words(filtered_words)
print('\nThe 10 most common 4 word combinations appearing in this file are: \n')
print(sorted(fourgrams.ngram_fd.items(), key=lambda t: (-t[1], t[0]))[:10])
print('----------------------------------------------------------------------------------------------------')
pdfFile.set_parser(parser)
File "C:\Users\Administrator\Envs\artcle\lib\site-packages\pdfminer\pdfparser.py", line 431, in set_parser
self.encryption = (list_value(trailer['ID']),dict_value(trailer['Encrypt']))
KeyError: 'ID'
When extracting text from pdf (https://www.aanda.org/articles/aa/pdf/2006/02/aa3061-05.pdf), I got a lot of warning and the extraction failed.
My code is as:
import os
import sys
import importlib
importlib.reload(sys)
from pdfminer.pdfparser import PDFParser,PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LTTextBoxHorizontal,LAParams
from pdfminer.pdfinterp import PDFTextExtractionNotAllowed
def parse(path,target):
if (os.path.exists(target)):
os.remove(target)
fp = open(path, 'rb')
praser = PDFParser(fp)
doc = PDFDocument()
praser.set_document(doc)
doc.set_parser(praser)
doc.initialize()
if not doc.is_extractable:
raise PDFTextExtractionNotAllowed
else:
rsrcmgr = PDFResourceManager()
laparams = LAParams(all_texts = True)
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in doc.get_pages(): # doc.get_pages() 获取page列表
interpreter.process_page(page)
layout = device.get_result()
for x in layout:
if (isinstance(x, LTTextBoxHorizontal)):
with open(target, 'a', encoding='utf-8') as f:
results = x.get_text()
# print(results)
f.write(results + '\n')
if name == 'main':
path = r'./pdf/aa3061-05.pdf'
parse(path,path.replace('.pdf','.txt'))
the warnings:
......
WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BIBNJI+txsy'>, 5
WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BIBNJI+txsy'>, 5
WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BICMGG+txex'>, 4
WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BIBNJI+txsy'>, 5
WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BICMGG+txex'>, 5
WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BIBNJI+txsy'>, 5
WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BIBNJI+txsy'>, 5
......
On converting pdf to xml using pdf2txt.py
, if the script encounters a \
(backslash) in the pdf document then it doesn't output anything after it in the converted xml file.
Test case: This is the pdf document and below is its output on running pdf2txt.py -A -o output.xml -t xml backslash.pdf
.
Root level logging is still present in pdfminer.psparser.nextobject():
logging.debug('do_keyword: pos=%r, token=%r, stack=%r', pos, token, self.curstack)
I guess it is was not intended : )
maximum recursion depth exceeded error when using pdfminer3k
Here is my code:
def readPDF(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, laparams=laparams)
with io_open(path, 'rb') as pdfFile:
process_pdf(rsrcmgr, device, pdfFile)
device.close()
content = retstr.getvalue()
retstr.close()
filename = path.replace('pdf', 'txt')
with open(filename, 'w') as f:
f.write(content)
This is the error received:
--- Logging error ---
Traceback (most recent call last):
File "/usr/local/python3/lib/python3.6/logging/init.py", line 992, in emit
msg = self.format(record)
File "/usr/local/python3/lib/python3.6/logging/init.py", line 838, in format
return fmt.format(record)
File "/usr/local/python3/lib/python3.6/logging/init.py", line 575, in format
record.message = record.getMessage()
File "/usr/local/python3/lib/python3.6/logging/init.py", line 338, in getMessage
msg = msg % self.args
File "/usr/local/python3/lib/python3.6/site-packages/pdfminer/pdftypes.py", line 132, in repr
return '<PDFStream(%r): raw=%d, %r>' % (self.objid, len(self.rawdata), self.attrs)
RecursionError: maximum recursion depth exceeded while calling a Python object
Call stack:
File "run_history.py", line 11, in
cmdline.execute("scrapy crawl sse_listedinfo_announcement -a begin_date={0} -a end_date={1} -a path={2}".format(begin_date, end_date, path).split())
File "/usr/local/python3/lib/python3.6/site-packages/scrapy/cmdline.py", line 142, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/usr/local/python3/lib/python3.6/site-packages/scrapy/cmdline.py", line 88, in _run_print_help
Fatal Python error: Cannot recover from stack overflow:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.