Comments (3)
$ curl -s http://salonkritik.net/ | isutf8
stdin: line 85, char 1, byte offset 30: invalid UTF-8 code
$ curl -s -I http://salonkritik.net/ | grep Content-Type
Content-Type: text/html
$ curl -s http://salonkritik.net/ | grep Content-Type
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
So the site doesn't claim to be UTF-8, we are just assuming it, because feedparser thinks it's us-ascii. The bug here is in feedparser (or in html2text trusting feedparser to guess the content type). Passing "iso-8859-1" as the optional encoding parameter will easily work around this for this particular site
from html2text.
It is a little bit more complicated:
- encoding is set to 'utf-8' as default, so charset detection is disabled (see
# process input
code block) - I've installed feedparser(5.1.3) and there is no attribute named
_getCharacterEncoding
, so the import raise error and it always fallback to 'utf-8' (the lamda function)
In respond to GgVvTt's question, as stefanor mentioned, you HAVE TO specify an encoding to get rid of the error:
python html2text.py "http://salonkritik.net/" "iso-8859-1"
from html2text.
There are some things I would note about this issue:
feedparser._getCharacterEncoding
is gone and we should get rid of it (and we will have one less dependency, yay!). Using private method of any package was never a bright idea anyway.- default character encoding for HTTP is
latin-1
, notus-ascii
. In this Aaron presented himself as a typical American, I am afraid. - I don't know if there is somewhere in stdlib something equivalent to the below show function
get_char_encoding
. If there is, we should certainly use it. Otherwise, I believe my function could be a pretty reasonable resolution of the situation (BTW, if you run the script, you find out that mostlyutf-8
already won, so this issue is going to be less and less important). - If checking
charset
parameter is not enough, we can go all the way to http://www.w3.org/International/questions/qa-html-encoding-declarations
#!/usr/bin/python3
import urllib.request
import cgi
import logging
logging.basicConfig(format='%(levelname)s:%(funcName)s:%(message)s',
level=logging.INFO)
def get_char_encoding(in_url):
req = urllib.request.Request(url=in_url, method='HEAD')
f = urllib.request.urlopen(req)
if f.status == 200:
ct_header = f.getheader('Content-Type')
logging.debug('raw Content-Type header: {}'.format(ct_header))
if ct_header is not None:
_, encoding = cgi.parse_header(ct_header)
logging.debug('encoding = {}'.format(encoding))
if 'charset' in encoding:
return encoding['charset'].lower()
return 'iso-8859-1'
for url in ['http://salonkritik.net/',
'http://www.ihned.cz',
'http://www.w3.org',
'http://www.yandex.ru/',
'https://www.microsoft.co.jp',
'http://hk.qq.com/',
'http://www.haaretz.co.il/',
'https://th.wikipedia.org/',
'http://www.maxboard.co.kr/']:
print('{} has encoding {}'.format(url, get_char_encoding(url)))
from html2text.
Related Issues (20)
- Trailing line break in list element should be ignored
- Line breaks in bold renders incorrect markdown
- Remove SLASH character before some list mark character
- href instead of content HOT 1
- Extra '\' slash appear before '-' and '.' HOT 1
- remove display:none tag HOT 1
- Let us all weep HOT 2
- do not support chinese? HOT 1
- Xrange error in python3.x HOT 1
- Document `pip install` ? HOT 1
- python2.7 ImportError: No module named entities HOT 1
- gap appears before char when any html char is inside a strong tag e.g HOT 3
- Where does the asterisk * come from? HOT 1
- Option to remove title from inline links url
- Support for `text-decoration: line-through` HOT 2
- markdown link was truncated because the limit of BODY_WIDTH 78
- no encoding declared
- Extra "\" slashes before specific numeric HOT 1
- Python HOT 4
- I miss you HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from html2text.