Giter Club home page Giter Club logo

Comments (3)

stefanor avatar stefanor commented on July 23, 2024
$ curl -s http://salonkritik.net/ | isutf8
stdin: line 85, char 1, byte offset 30: invalid UTF-8 code
$ curl -s -I http://salonkritik.net/ | grep Content-Type
Content-Type: text/html
$ curl -s http://salonkritik.net/ | grep Content-Type
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />

So the site doesn't claim to be UTF-8, we are just assuming it, because feedparser thinks it's us-ascii. The bug here is in feedparser (or in html2text trusting feedparser to guess the content type). Passing "iso-8859-1" as the optional encoding parameter will easily work around this for this particular site

from html2text.

leesei avatar leesei commented on July 23, 2024

It is a little bit more complicated:

  • encoding is set to 'utf-8' as default, so charset detection is disabled (see # process input code block)
  • I've installed feedparser(5.1.3) and there is no attribute named _getCharacterEncoding, so the import raise error and it always fallback to 'utf-8' (the lamda function)

In respond to GgVvTt's question, as stefanor mentioned, you HAVE TO specify an encoding to get rid of the error:
python html2text.py "http://salonkritik.net/" "iso-8859-1"

from html2text.

mcepl avatar mcepl commented on July 23, 2024

There are some things I would note about this issue:

  1. feedparser._getCharacterEncoding is gone and we should get rid of it (and we will have one less dependency, yay!). Using private method of any package was never a bright idea anyway.
  2. default character encoding for HTTP is latin-1, not us-ascii. In this Aaron presented himself as a typical American, I am afraid.
  3. I don't know if there is somewhere in stdlib something equivalent to the below show function get_char_encoding. If there is, we should certainly use it. Otherwise, I believe my function could be a pretty reasonable resolution of the situation (BTW, if you run the script, you find out that mostly utf-8 already won, so this issue is going to be less and less important).
  4. If checking charset parameter is not enough, we can go all the way to http://www.w3.org/International/questions/qa-html-encoding-declarations
#!/usr/bin/python3

import urllib.request
import cgi
import logging
logging.basicConfig(format='%(levelname)s:%(funcName)s:%(message)s',
                    level=logging.INFO)


def get_char_encoding(in_url):
    req = urllib.request.Request(url=in_url, method='HEAD')
    f = urllib.request.urlopen(req)

    if f.status == 200:
        ct_header = f.getheader('Content-Type')
        logging.debug('raw Content-Type header: {}'.format(ct_header))
        if ct_header is not None:
            _, encoding = cgi.parse_header(ct_header)
            logging.debug('encoding = {}'.format(encoding))

            if 'charset' in encoding:
                return encoding['charset'].lower()

    return 'iso-8859-1'

for url in ['http://salonkritik.net/',
            'http://www.ihned.cz',
            'http://www.w3.org',
            'http://www.yandex.ru/',
            'https://www.microsoft.co.jp',
            'http://hk.qq.com/',
            'http://www.haaretz.co.il/',
            'https://th.wikipedia.org/',
            'http://www.maxboard.co.kr/']:
    print('{} has encoding {}'.format(url, get_char_encoding(url)))

from html2text.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.