When I tried: python3.2 html2text.py "<a href="http://salonkritik.net/" rel="nofol

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

It is a little bit more complicated: encoding is set to 'utf-8

There are some things I would note about this issue: <code cla

Unicode Decode Error: about html2text HOT 3 OPEN

aaronsw commented on July 23, 2024

Unicode Decode Error:

from html2text.

Comments (3)

stefanor commented on July 23, 2024

$ curl -s http://salonkritik.net/ | isutf8
stdin: line 85, char 1, byte offset 30: invalid UTF-8 code
$ curl -s -I http://salonkritik.net/ | grep Content-Type
Content-Type: text/html
$ curl -s http://salonkritik.net/ | grep Content-Type
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />

So the site doesn't claim to be UTF-8, we are just assuming it, because feedparser thinks it's us-ascii. The bug here is in feedparser (or in html2text trusting feedparser to guess the content type). Passing "iso-8859-1" as the optional encoding parameter will easily work around this for this particular site

from html2text.

leesei commented on July 23, 2024

It is a little bit more complicated:

encoding is set to 'utf-8' as default, so charset detection is disabled (see # process input code block)
I've installed feedparser(5.1.3) and there is no attribute named _getCharacterEncoding, so the import raise error and it always fallback to 'utf-8' (the lamda function)

In respond to GgVvTt's question, as stefanor mentioned, you HAVE TO specify an encoding to get rid of the error:
python html2text.py "http://salonkritik.net/" "iso-8859-1"

from html2text.

mcepl commented on July 23, 2024

There are some things I would note about this issue:

feedparser._getCharacterEncoding is gone and we should get rid of it (and we will have one less dependency, yay!). Using private method of any package was never a bright idea anyway.
default character encoding for HTTP is latin-1, not us-ascii. In this Aaron presented himself as a typical American, I am afraid.
I don't know if there is somewhere in stdlib something equivalent to the below show function get_char_encoding. If there is, we should certainly use it. Otherwise, I believe my function could be a pretty reasonable resolution of the situation (BTW, if you run the script, you find out that mostly utf-8 already won, so this issue is going to be less and less important).
If checking charset parameter is not enough, we can go all the way to http://www.w3.org/International/questions/qa-html-encoding-declarations

#!/usr/bin/python3

import urllib.request
import cgi
import logging
logging.basicConfig(format='%(levelname)s:%(funcName)s:%(message)s',
                    level=logging.INFO)


def get_char_encoding(in_url):
    req = urllib.request.Request(url=in_url, method='HEAD')
    f = urllib.request.urlopen(req)

    if f.status == 200:
        ct_header = f.getheader('Content-Type')
        logging.debug('raw Content-Type header: {}'.format(ct_header))
        if ct_header is not None:
            _, encoding = cgi.parse_header(ct_header)
            logging.debug('encoding = {}'.format(encoding))

            if 'charset' in encoding:
                return encoding['charset'].lower()

    return 'iso-8859-1'

for url in ['http://salonkritik.net/',
            'http://www.ihned.cz',
            'http://www.w3.org',
            'http://www.yandex.ru/',
            'https://www.microsoft.co.jp',
            'http://hk.qq.com/',
            'http://www.haaretz.co.il/',
            'https://th.wikipedia.org/',
            'http://www.maxboard.co.kr/']:
    print('{} has encoding {}'.format(url, get_char_encoding(url)))

from html2text.

Unicode Decode Error: about html2text HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent