I am using python 2.7.10. When running the following code I get a unicodedecodeerror</

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

UnicodeDecodeError about pdfminer.six HOT 3 CLOSED

pdfminer commented on May 19, 2024

UnicodeDecodeError

from pdfminer.six.

Comments (3)

vstoykov commented on May 19, 2024

Hello @AlexandreGalois, may I suggest you to edit your issue description in order code to be more readable. You can use

```
Some code here
```

Then in order to fix your code you can try to import StringIO from io module.

from io import StringIO

do not use cStringIO or StringIO modules, they are for Python 2.5 compatibility. From Python 2.6 and up io module is preferable (because this is the only module in Python3).

from pdfminer.six.

goulu commented on May 19, 2024

@AlexandreGalois is it ok ? Please reopen if not.

from pdfminer.six.

michi88 commented on May 19, 2024

Ran into this issue as well.

When passing io.StringIO in python 2, self.outfp_binary is False but text in def write_text is six.binary_type and thus fails.

class TextConverter(PDFConverter):

    def __init__(self, rsrcmgr, outfp, codec='utf-8', pageno=1, laparams=None,
                 showpageno=False, imagewriter=None):
        PDFConverter.__init__(self, rsrcmgr, outfp, codec=codec, pageno=pageno, laparams=laparams)
        self.showpageno = showpageno
        self.imagewriter = imagewriter
        return

    def write_text(self, text):
        text = utils.compatible_encode_method(text, self.codec, 'ignore')
        if six.PY3 and self.outfp_binary:
            text = text.encode()
        self.outfp.write(text)
        return

...

It only works with BytesIO.

text should probably be decoded when output is not self.outfp_binary.

Hack to get it working is:

def extract_text_from_pdf(pdf_file):
    out = io.BytesIO()
    pdfminer.high_level.extract_text_to_fp(pdf_file, out, codec='utf-8')
    out.seek(0)
    return out.read().decode('utf-8')

from pdfminer.six.

Recommend Projects