jaepil / pdfminer3k Goto Github PK

This project forked from euske/pdfminer

Python 3 port of pdfminer

Python 100.00%

pdfminer3k's Introduction

See docs/index.html

pytest is needed to run tests in the 'tests' folder.

pdfminer3k's People

Contributors

Stargazers

Watchers

pdfminer3k's Issues

ModuleNotFoundError: No module named 'pdfminer.pdfpage'

I am using Anaconda and used conda forge to install pdfminer3k

Error:

runfile('C:/Phoenix/Python/listpdfsandcountwords.py', wdir='C:/Phoenix/Python')
Traceback (most recent call last):

File "", line 1, in
runfile('C:/Phoenix/Python/listpdfsandcountwords.py', wdir='C:/Phoenix/Python')

File "C:\Work\lib\site-packages\spyder\utils\site\sitecustomize.py", line 710, in runfile
execfile(filename, namespace)

File "C:\Work\lib\site-packages\spyder\utils\site\sitecustomize.py", line 101, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "C:/Phoenix/Python/listpdfsandcountwords.py", line 14, in
from pdfminer.pdfpage import PDFPage

ModuleNotFoundError: No module named 'pdfminer.pdfpage'

Conda Environment:

(C:\Work) C:\Users\dparamanand>conda info
Current conda install:

           platform : win-64
      conda version : 4.3.29
   conda is private : False
  conda-env version : 4.3.29
conda-build version : 3.0.27
     python version : 3.6.3.final.0
   requests version : 2.18.4
   root environment : C:\Work  (writable)
default environment : C:\Work
   envs directories : C:\Work\envs
                      C:\Users\dparamanand\AppData\Local\conda\conda\envs
                      C:\Users\dparamanand\.conda\envs
      package cache : C:\Work\pkgs
                      C:\Users\dparamanand\AppData\Local\conda\conda\pkgs
       channel URLs : https://repo.continuum.io/pkgs/main/win-64
                      https://repo.continuum.io/pkgs/main/noarch
                      https://repo.continuum.io/pkgs/free/win-64
                      https://repo.continuum.io/pkgs/free/noarch
                      https://repo.continuum.io/pkgs/r/win-64
                      https://repo.continuum.io/pkgs/r/noarch
                      https://repo.continuum.io/pkgs/pro/win-64
                      https://repo.continuum.io/pkgs/pro/noarch
                      https://repo.continuum.io/pkgs/msys2/win-64
                      https://repo.continuum.io/pkgs/msys2/noarch
        config file : C:\Users\dparamanand\.condarc
         netrc file : None
       offline mode : False
         user-agent : conda/4.3.29 requests/2.18.4 CPython/3.6.3 Windows/10 Windows/10.0.16299
      administrator : False

Code:

-- coding: utf-8 --

"""
Created on Fri Sep 29 10:43:29 2017

@author: dpar0004
"""

import os
#for reading the pdf
from io import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from nltk.corpus import stopwords
from nltk.collocations import TrigramCollocationFinder
from nltk.collocations import QuadgramCollocationFinder

#for counting the sentences and words
import nltk
import collections
from nltk import word_tokenize
from collections import Counter

#for couting most frequent words
import re

def convert(filename, pages=None):
if not pages:
pagenums = set()
else:
pagenums = set(pages)

output = StringIO()
manager = PDFResourceManager()
converter = TextConverter(manager, output, laparams=LAParams())
interpreter = PDFPageInterpreter(manager, converter)

infile = open(filename, 'rb')
for page in PDFPage.get_pages(infile, pagenums):
    interpreter.process_page(page)
infile.close()
converter.close()
text = output.getvalue()
output.close
return text

pdfFiles = []
dir_name='C:\Phoenix\Documents from Bryan'
for filename in os.listdir(dir_name):
if filename.endswith('.pdf') or filename.endswith('.PDF') or filename.endswith('.Pdf') or filename.endswith('.pDf') or filename.endswith('.pdF') or filename.endswith('.pDF') or filename.endswith('.pDf') or filename.endswith('.PDF'):
pdfFiles.append(filename)
text=convert(os.path.join(dir_name, filename))
sentence_count = len(nltk.tokenize.sent_tokenize(text))
word_count = len(nltk.tokenize.word_tokenize(text))
print('\nThe file ',filename,' has ',word_count, 'words and ', sentence_count,' sentences in it.\n')

     #use findall for counting most common words, quadgrams, trigrams
     all_text = re.findall(r'\w+', text)
     all_text =map(lambda x: x.lower(), all_text)
     filtered_words = list(filter(lambda word: word not in stopwords.words('english') and word.isalpha(), all_text))

     word_counts = Counter(filtered_words).most_common(20)
     print('The 20 most commonly occuring words in this file are : \n\n', word_counts)
     
     print('\nThe 10 most common 3 word combinations appearing in this file are: \n')
     trigram = TrigramCollocationFinder.from_words(filtered_words)
     print(sorted(trigram.ngram_fd.items(), key=lambda t: (-t[1], t[0]))[:10])
     
     fourgrams=QuadgramCollocationFinder.from_words(filtered_words)
     print('\nThe 10 most common 4 word combinations appearing in this file are: \n')
     print(sorted(fourgrams.ngram_fd.items(), key=lambda t: (-t[1], t[0]))[:10])
     
     print('----------------------------------------------------------------------------------------------------')

KeyError: 'ID'

pdfFile.set_parser(parser)
File "C:\Users\Administrator\Envs\artcle\lib\site-packages\pdfminer\pdfparser.py", line 431, in set_parser
self.encryption = (list_value(trailer['ID']),dict_value(trailer['Encrypt']))
KeyError: 'ID'

WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont=

When extracting text from pdf (https://www.aanda.org/articles/aa/pdf/2006/02/aa3061-05.pdf), I got a lot of warning and the extraction failed.

My code is as:
import os
import sys
import importlib
importlib.reload(sys)
from pdfminer.pdfparser import PDFParser,PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LTTextBoxHorizontal,LAParams
from pdfminer.pdfinterp import PDFTextExtractionNotAllowed
def parse(path,target):
if (os.path.exists(target)):
os.remove(target)
fp = open(path, 'rb')
praser = PDFParser(fp)
doc = PDFDocument()
praser.set_document(doc)
doc.set_parser(praser)

doc.initialize()

if not doc.is_extractable:
    raise PDFTextExtractionNotAllowed
else:
    rsrcmgr = PDFResourceManager()
    laparams = LAParams(all_texts = True)
    device = PDFPageAggregator(rsrcmgr, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)

    for page in doc.get_pages(): # doc.get_pages() 获取page列表
        interpreter.process_page(page)
        layout = device.get_result()
        for x in layout:
            if (isinstance(x, LTTextBoxHorizontal)):
                with open(target, 'a', encoding='utf-8') as f:
                    results = x.get_text()
                    # print(results)
                    f.write(results + '\n')

if name == 'main':
path = r'./pdf/aa3061-05.pdf'
parse(path,path.replace('.pdf','.txt'))

the warnings:
......
WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BIBNJI+txsy'>, 5
WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BIBNJI+txsy'>, 5
WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BICMGG+txex'>, 4
WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BIBNJI+txsy'>, 5
WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BICMGG+txex'>, 5
WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BIBNJI+txsy'>, 5
WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BIBNJI+txsy'>, 5
......

is this project active?

No output after a backslash "\" is encountered

On converting pdf to xml using pdf2txt.py, if the script encounters a \ (backslash) in the pdf document then it doesn't output anything after it in the converted xml file.

Test case: This is the pdf document and below is its output on running pdf2txt.py -A -o output.xml -t xml backslash.pdf.

how to support CJK language

root logger in psparser

Root level logging is still present in pdfminer.psparser.nextobject():

logging.debug('do_keyword: pos=%r, token=%r, stack=%r', pos, token, self.curstack)

I guess it is was not intended : )

PDFMiner3k: Maximum recursion depth exceeded while calling a Python object

maximum recursion depth exceeded error when using pdfminer3k
Here is my code:


def readPDF(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, laparams=laparams)

with io_open(path, 'rb') as pdfFile:
    process_pdf(rsrcmgr, device, pdfFile)
device.close()

content = retstr.getvalue()
retstr.close()

filename = path.replace('pdf', 'txt')
with open(filename, 'w') as f:
    f.write(content)

This is the error received:

--- Logging error ---
Traceback (most recent call last):
File "/usr/local/python3/lib/python3.6/logging/init.py", line 992, in emit
msg = self.format(record)
File "/usr/local/python3/lib/python3.6/logging/init.py", line 838, in format
return fmt.format(record)
File "/usr/local/python3/lib/python3.6/logging/init.py", line 575, in format
record.message = record.getMessage()
File "/usr/local/python3/lib/python3.6/logging/init.py", line 338, in getMessage
msg = msg % self.args
File "/usr/local/python3/lib/python3.6/site-packages/pdfminer/pdftypes.py", line 132, in repr
return '<PDFStream(%r): raw=%d, %r>' % (self.objid, len(self.rawdata), self.attrs)
RecursionError: maximum recursion depth exceeded while calling a Python object
Call stack:
File "run_history.py", line 11, in
cmdline.execute("scrapy crawl sse_listedinfo_announcement -a begin_date={0} -a end_date={1} -a path={2}".format(begin_date, end_date, path).split())
File "/usr/local/python3/lib/python3.6/site-packages/scrapy/cmdline.py", line 142, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/usr/local/python3/lib/python3.6/site-packages/scrapy/cmdline.py", line 88, in _run_print_help

Fatal Python error: Cannot recover from stack overflow:

jaepil / pdfminer3k Goto Github PK

pdfminer3k's Introduction

pdfminer3k's People

Contributors

Stargazers

Watchers

Forkers

pdfminer3k's Issues

ModuleNotFoundError: No module named 'pdfminer.pdfpage'

-- coding: utf-8 --

KeyError: 'ID'

WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont=

is this project active?

No output after a backslash "\" is encountered

how to support CJK language

root logger in psparser

PDFMiner3k: Maximum recursion depth exceeded while calling a Python object

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent