Giter Club home page Giter Club logo

html2text's Introduction

html2text is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text. Better yet, that ASCII also happens to be valid Markdown (a text-to-HTML format).

Usage: html2text.py [(filename|url) [encoding]]

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  --ignore-links        don't include any formatting for links
  --ignore-images       don't include any formatting for images
  -g, --google-doc      convert an html-exported Google Document
  -d, --dash-unordered-list
                        use a dash rather than a star for unordered list items
  -b BODY_WIDTH, --body-width=BODY_WIDTH
                        number of characters per output line, 0 for no wrap
  -i LIST_INDENT, --google-list-indent=LIST_INDENT
                        number of pixels Google indents nested lists
  -s, --hide-strikethrough
                        hide strike-through text. only relevent when -g is
                        specified as well

Or you can use it from within Python:

import html2text
print html2text.html2text("<p>Hello, world.</p>")

Or with some configuration options:

import html2text
h = html2text.HTML2Text()
h.ignore_links = True
print h.handle("<p>Hello, <a href='http://earth.google.com/'>world</a>!")

Originally written by Aaron Swartz. This code is distributed under the GPLv3.

How to do a release

  1. Update the version in html2text.py
  2. Update the version in setup.py
  3. Run python setup.py sdist upload

How to run unit tests

cd test/
python run_tests.py

Build Status

html2text's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

html2text's Issues

More Licensing Options?

Would it be possible to provide an additional (non-viral) licensing option? As is, it seems this code can't be used on any commercial projects.

Use from python

I'm not sure if this is really an issue but if I try to use html2text from a python program it fails with Caught NameError while rendering: global name 'options' is not defined.

What I'm trying to do is:

import html2text

def html2markdown(html):
    return html2text.html2text(html)

Is this even possible?

Adding user agent for input url

I'm new to Python and glad to find this module to allow me to parse webpages.
I would like suggest adding support for spoofing user agent for HTTP sources.
Some webpage will return 401 when using urlopen(), e.g. http://www.google.com/patents/US5255452.
Currently I'm using another Python (2.7) script to dump the output with user agent spoof for html2text:

    import urllib2
    request = urllib2.Request(url="http://www.google.com/patents/US5255452")
    # spoof user agent
    request.add_header("User-Agent", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/537.1 (KHTML, like Gecko)")
    result = urllib2.urlopen(request)
    # write result .read() to file

a parse error when strong tag in a tag

The demo is

source:
<a href="1"><strong>Meteor</strong></a>
<strong><a href="1">TypeScript</a></strong> 

result:
**[Meteor**](1)
**[TypeScript](1)**

First result of the link is wrong.

Line breaks in bold renders incorrect markdown

Bold generally should not be allowed to have line breaks in between them, or have whitespace on the inside, or they'll be treated as literal asterisks. So for example:

<b>Our bold text<br /></b>

Results in:

**Our bold text\n  **

Which will not render in bold with most markdown interpreters.

Instead we should move trailing line breaks outside of the asterisks:

**Our bold text**\n  

and end and re-start bold for if it's not trailing, eg:

<b>Our multiline<br />bold text</b>

to

**Our multiline**\n  **bold text**

empty href with no quotes leads to exception

With bad html of the form {a href=}foo{/a} html2text will bomb out. An empty string will work fine (e.g. {a href=''}foo{/a}). I don't know how tolerant of bad HTML you want html2text to be.

Ian

Remove SLASH character before some list mark character

This lib is awesome, but i have a small problem. i have a html code -> ...

import html2text
s="<div>- this is text </div>"
print html2text.html2text(s)
output:
\- this is text

What is puporse of first slash character in the above output ??? HOw do I remove it ?

THank you !

Paragraphs in blockquotes split text

If you have this construction

<blockquote>
    <p>Paragraph 1</p>
    <p>Paragraph 2</p>
</blockquote>

html2text generates this output

> Paragraph 1

>

> Paragraph 2

This should not be, since the empty line between the paragraphs breaks the blockquote...

ValueError: need more than 1 value to unpack

> python html2text.py http://en.wikipedia.org/wiki/Python
Traceback (most recent call last):
  File "C:\SharedPrograms\html2text\html2text.py", line 758, in <module>
    wrapwrite(html2text(data, baseurl))
  File "C:\SharedPrograms\html2text\html2text.py", line 691, in html2text
    return optwrap(html2text_file(html, None, baseurl))
  File "C:\SharedPrograms\html2text\html2text.py", line 686, in html2text_file
    h.feed(html)
  File "C:\PortableApps\Python\Python26\lib\HTMLParser.py", line 108, in feed
    self.goahead(0)
  File "C:\PortableApps\Python\Python26\lib\HTMLParser.py", line 142, in goahead
    if i < j: self.handle_data(rawdata[i:j])
  File "C:\SharedPrograms\html2text\html2text.py", line 671, in handle_data
    self.style_def.update(dumb_css_parser(data))
  File "C:\SharedPrograms\html2text\html2text.py", line 177, in dumb_css_parser
    elements = dict([(a.strip(), dumb_property_dict(b)) for a, b in elements])
  File "C:\SharedPrograms\html2text\html2text.py", line 165, in dumb_property_dict
    return dict([(x.strip(), y.strip()) for x, y in [z.split(':') for z in style.split(';')]]);
ValueError: need more than 1 value to unpack

Memory leaks

Hi guys,

I really couldn't figure out the exact reason of the problem, but the fact is that I'm using your code for processing around 5,000 HTML documents and my RAM is getting filled quickly. I'm 100% sure that it's your code because I replaced it for a simple HTML tags removal and the leak was gone.

Sorry for not being more informative, but I guess it's pretty easy to set an experiment yourselves.

Please enable cookies

When I html2text https://davidwalsh.name/2016s-most-important-web-apps-tools, the following error shows up:

Please enable cookies.

Error 1010 Ray ID: 272e8783e7f122e2 • 2016-02-11 08:02:03 UTC

Access denied

What happened?

The owner of this website (davidwalsh.name) has banned your access based on
your browser's signature (272e8783e7f122e2-ua48).

CloudFlare Ray ID: 272e8783e7f122e2 • Your IP: xxxx •
Performance & security by CloudFlare

processing of <pre> element results in double-spaced text

When running this example script:

#!/usr/bin/python

import html2text

inStr = """
<pre class="wiki">"addnoresponse": {
    "name": "NoRespns",
    "position": "topRow",
    "commentIdx": "noResponseString",
    "status": "CLOSED",
    "resolution": "INSUFFICIENT_DATA"
},
</pre>
"""
print html2text.html2text(inStr)

I get this:

bradford:~ $ python test-PRE-bug.py 
"addnoresponse": {

        "name": "NoRespns",

        "position": "topRow",

        "commentIdx": "noResponseString",

        "status": "CLOSED",

        "resolution": "INSUFFICIENT_DATA"

    },



bradford:~ $ 

I mean this is pretty awful. I understand that you want to make this into Markdown, but shouldn’t html2text produce something at least a bit readable? Or could we get some parameter to html2text (prettyParse=true), which would avoid this?

Code blocks in lists

I found an issue with code blocks within lists. When html2text stumble on this HTML:

<ul>

  <li>
    <p>Split a one-file album flac file into tracks according its cue list:</p>
    <pre><code>shntool split -f album.cue -o flac album.flac</code></pre>
  </li>

  <li>
    <p>Merge several .wav file to one file named <code>merged.wav</code>:</p>
    <pre><code>sox part1.wav part2.wav part3.wav merged.wav</code></pre>
  </li>

  (...)

</ul>

it produces the following Markdown:

* Split a one-file album flac file into tracks according its cue list:

    shntool split -f album.cue -o flac album.flac

* Merge several .wav file to one file named `merged.wav`:

    sox part1.wav part2.wav part3.wav merged.wav

  (...)

While it should produce:

* Split a one-file album flac file into tracks according its cue list:

        shntool split -f album.cue -o flac album.flac

* Merge several .wav file to one file named `merged.wav`:

        sox part1.wav part2.wav part3.wav merged.wav

  (...)

I.e. html2text must add 8 spaces before a code block within a list instead of 4, else the code block will be rendered as a paragraph.

Paragraphs in lists

For this HTML:

<ul>
    <li>
        <p>Test</p>
        <p>Test</p>
    </li>
</ul>

should generate:

- Test

    Test

instead of:

- Test

Test

This is probably related to issue #17.

License

html2text page at http://www.aaronsw.com/2002/html2text state license as GNU GPL 3.0 but there is no COPYING file into the sources.
Can you please add the GPL 3 COPYING file to avoid legal problems when packaging html2text?

Thanks.

Long list lines do not wrap

I noticed an

<li>with 200 characters</li> 

outputs a 200 character long line.
I found this irritating, so added some code in v3.02 method optwrap(text)

Just a fragment

WAS:

for para in text.split("\n"):
    if len(para) > 0:
        if para[0] != ' ' and para[0] != '-' and para[0] != '*':
            for line in wrap(para, BODY_WIDTH):
                result += line + "\n"
            result += "\n"
            newlines = 2
        else:
            if not onlywhite(para):
                result += para + "\n"
                newlines = 1

IS:

reList = re.compile('(^[ ]+[0-9]+\. )|(^[ ]+\* )')
for para in text.split("\n"):
    if len(para) > 0:
        if para[0] != ' ' and para[0] != '-' and para[0] != '*':
            for line in wrap(para, BODY_WIDTH):
                result += line + "\n"
            result += "\n"
            newlines = 2
        else:
            # Handle list item - split lines with indent under. 
            if reList.match( para ):
                indent = False
                indent_spaces = ''
                for line in wrap(para, BODY_WIDTH - 6): # -allowance for indentation pad
                    if False == indent:
                        indent = True
                        result += line + "\n"
                        # Find length to start of text for indent spacing
                        lst = reList.search(line).group()
                        indent_spaces =  ' ' * len(lst)
                    else:
                      result += indent_spaces + line + "\n"
                result += "\n"
                newlines = 1
            elif not onlywhite(para):
                result += para + "\n"
                newlines = 1

Not catching HTMLParser.HTMLParseError from unclosed tag

>>>import html2text

>>>html2text.__version__

'3.02'

>>>html2text.html2text('<p')

Traceback (most recent call last):

File "", line 1, in

File "html2text.py", line 450, in html2text

return optwrap(html2text_file(html, None, baseurl))

File "html2text.py", line 447, in html2text_file

return h.close()

File "html2text.py", line 185, in close

HTMLParser.HTMLParser.close(self)

File "C:\dev\python27\lib\HTMLParser.py", line 112, in close

self.goahead(1)

File "C:\dev\python27\lib\HTMLParser.py", line 164, in goahead

self.error("EOF in middle of construct")

File "C:\dev\python27\lib\HTMLParser.py", line 115, in error

raise HTMLParseError(message, self.getpos())

HTMLParser.HTMLParseError: EOF in middle of construct, at line 1, column 1

Is it possible to append a '>' or drop the whole tag and retry without passing the exception up?

FWIW, first parsing the HTML with BeautifulSoup eliminates the unclosed tag and html2text suceeds.

Add an option to allow pure text to be returned (ignore page breaks, etc.)

I am using html2text to store LaTeX syntax in a Google Doc and later retrieve it for processing. For this to work there cannot be any special characters in the returned text; the text needs to be returned exactly as it appears in the source text. It appears that html2text inserts "* * *" for page breaks.

Example:

html2text --google-doc --ignore-emphasis https://docs.google.com/document/d/.../pub?embedded=true

Actual Output:

% start of document

\begin{document}

* * *

% abstract

\newpage

\begin{abstract}

Abstract goes here...

\end{abstract}

* * *

% end of document

\end{document} 

Desired Output:

% start of document

\begin{document}

% abstract

\newpage

\begin{abstract}

Abstract goes here...

\end{abstract}

% end of document

\end{document} 

Any chance of a new release on PyPi?

All the recent changes looks really good, any chance of a new release on PyPi, or at least a version bump so it is easier to manage with pip?

setup.py file

classifiers only up to python 3.2
What does setup.py do anyway?

option to force encoding

This is necessary if input is read from stdin.
Or, to keep current syntax, use "-" for stdin.

ignore_emphasis header

Hi,
is there a reason not to use "ignore_emphasis" for header tags (h1, h2, h3, ...), too?
Or is this just not implemented, yet?

h = html2text.HTML2Text()
h.ignore_emphasis=True
h.handle("

title

")
u'# title\n\n'

Arndt

Backslash getting inserted before multiple dashes

Anyone have an explanation or fix for this?

In [11]: print html2text.html2text('-').strip()
-

In [12]: print html2text.html2text('--').strip()
\--

In [13]: print html2text.html2text('------').strip()
\------

UnicodeDecode Error

r = requests.get('http://en.wikipedia.org/wiki/Monty_Python')
print html2text.html2text(r.content)
Traceback (most recent call last):
File "", line 1, in
File "html2text.py", line 812, in html2text
return h.handle(html)
File "html2text.py", line 254, in handle
return self.optwrap(self.close())
File "html2text.py", line 266, in close
self.outtext = self.outtext.join(self.outtextlist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 4: ordinal not in range(128)

Problem with <font> tags --> not displaying markdown syntax

Hi, first time poster here. I apologize in advance for not following issue-submission protocol that may exist.

I am working on converting corporate annual reports (default format html, yet no standardized form of html) to text with markdown syntax. HTML2Text works perfectly for and tags, but not for <font...FontWeight: Bold> type tags. In these instances, the text is displayed with no markdown tags. I am a novice Python programmer and I cannot overcome this issue on my own.

This research is very important as it will expose certain companies that were either negligent or incompetent in the years before and surrounding the recent financial meltdown. Any help will be greatly appreciated.

Here is some sample html that exhibits the problem I described above...

https://docs.google.com/document/d/1PUSJWCfnddFCMzb_qiIg7dQYxwyBJpsh-T_cR55oa-A/edit?usp=sharing

Bug - Converting an <img> tag with a hypen in src and a src greater than 74 characters adds a newline after the hypen in the output

I noticed an odd bug when Converting an <img> tag containing:

  • A hyphen in the src
  • A src longer than 74 characters

Converting a <img> tag with a src of 74 characters or less works fine

> # Note the missing "y" in the last word, "supply"
>img = '<img src="http://matthewmoisen.com/blog/wp-content/matthew_moisen_tractor_suppl.jpg">'
>html2text.html2text(img)
u'![](http://matthewmoisen.com/blog/wp-content/matthew_moisen_tractor_suppl.jpg)\n\n'

> # Note the addition of the "y" in the last word, "supply"
>img = '<img src="http://matthewmoisen.com/blog/wp-content/matthew_moisen_tractor_supply.jpg">'
>html2text.html2text(img)
u'![](http://matthewmoisen.com/blog/wp-\ncontent/matthew_moisen_tractor_supply.jpg)\n\n'

See how a \n character has been added after wp- ?

Use sortable versioning scheme

Please use easily sortable versions (e.g. "3.1.0", "3.1.0.1", "3.1.1", "3.1.2"). Many tools treat "3.101" as newer than "3.11". (The master branch now has version "3.11 dev".)

$ python -c 'from distutils.version import LooseVersion; print(LooseVersion("3.101") > LooseVersion("3.11"))'
True
$ python -c 'from distutils.version import StrictVersion; print(StrictVersion("3.101") > StrictVersion("3.11"))'
True

Examples from a function used by Portage (main package manager for Gentoo Linux):

$ python -c 'from portage import vercmp; print(vercmp("1", "2"))'
-1
$ python -c 'from portage import vercmp; print(vercmp("1", "1"))'
0
$ python -c 'from portage import vercmp; print(vercmp("2", "1"))'
1
$ python -c 'from portage import vercmp; print(vercmp("3.101", "3.11"))'
1

xrange is not supported in Python3

This module gives the impression it supports Python 3 however xrange is unsupported in Python 3.

I suggest changing it to range or importing xrange from six module in Python 3 ! :)

DEVELOPMENT HAS BEEN MOVED | New maintainer of the html2text project

Hey,

tl;dr: https://github.com/Alir3z4/html2text

It's been almost 3 years since the the last release, version 3.02. As myself, I use this package on some projects and it works perfectly.

Recently I found out the author is not alive anymore(sadly), and there're couple of really helpful pull-request waiting for review and get merged into master. Although we can fork the repo and use it on our own but the original package on the pypi would remains untouched and eventually will be outdated.

Currently I'm the maintainer of html2text on the pypi:
https://pypi.python.org/pypi/html2text

I bumped the version number to 2014.4.5 and did a release. No changes has been made on the original code(mainly to not introduce conflicts with current pull-requests and unmerged patches), I only did:

  • Add ChangeLog.rst file.
  • Add AUTHORS.rst file.
  • Update README.md.

Also I merged some changes from @mcepl fork because they were look good. It would be great to see others send their patches to my repo where I'll keep the latest changes:

https://github.com/Alir3z4/html2text

I'll try to do code review as much as I can, but I hope to get some help from @mcepl and others who are already familiar with the code base and its concept.

Thanks,
Alireza Savand

Trailing line break in list element should be ignored

Give the following HTML:

<ul>
    <li>Item 1 <br /></li>
    <li>Item 2</li>
</ul>

the following Markdown will be generated:

* Item 1 \n\n* Item 2

Which will actually be interpreted as two distinct unordered lists instead of one list with two entries. It seems that removing trailing line breaks from list items would make for a better semantic translation.

Unicode Decode Error:

When I tried:
python3.2 html2text.py "http://salonkritik.net/"

I got an error:
Traceback (most recent call last):
File "html2text.py", line 478, in
data = text.decode(encoding)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xed in position 2624: invalid continuation byte

The website is in Spanish, encoding supposedly iso-8859-1, which may be the cause of the issue.

I added 'ignore' to line 478 to ignore errors, and it seems like it works. But, I'm a still a newbie at python and I'm not sure what chaos (if any) might be wrought by ignoring encoding errors.

 if encoding == 'us-ascii':
                encoding = 'utf-8'
        data = text.decode(encoding, 'ignore')

bad end tag error

> python html2text.py http://www.hardcoregaming101.net/
Traceback (most recent call last):
  File "C:\SharedPrograms\html2text\html2text.py", line 491, in <module>
    wrapwrite(html2text(data, baseurl))
  File "C:\SharedPrograms\html2text\html2text.py", line 450, in html2text
    return optwrap(html2text_file(html, None, baseurl))
  File "C:\SharedPrograms\html2text\html2text.py", line 445, in html2text_file
    h.feed(html)
  File "C:\PortableApps\Python\Python26\lib\HTMLParser.py", line 108, in feed
    self.goahead(0)
  File "C:\PortableApps\Python\Python26\lib\HTMLParser.py", line 150, in goahead
    k = self.parse_endtag(i)
  File "C:\PortableApps\Python\Python26\lib\HTMLParser.py", line 314, in parse_endtag
    self.error("bad end tag: %r" % (rawdata[i:j],))
  File "C:\PortableApps\Python\Python26\lib\HTMLParser.py", line 115, in error
    raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: bad end tag: u'</a</p>', at line 416, column 113

Support for plain text without Markdown syntax

I'm working on a fork to add an option (--no-markdown) that will allow the conversion of HTML to pure plain text. For example, this will add quotation marks around blockquotes, remove any markdown syntax for headers (and several other places), and basically present things nicely when markdown will not be used to render it.

I believe this is a valid use case; this is the best project that I've found that converts HTML into plain text, but I think it would be nice to have an option to output things straight to plain text without any Markdown syntax.

Here's an example output:

Output (Markdown):
# Title of my document:

**Lorem** Ipsum is simply dummy text of the printing and typesetting 
industry. Lorem Ipsum has been the industry's standard dummy 
text ever since the 1500s, when an unknown printer took a galley 
of type and scrambled it to make a type specimen book.

Check out an awesome project here: [https://github.com/aaronsw/html2text](https://github.com/aaronsw/html2text)

> It was popularised in the 1960s with the release of Letraset sheets 
containing Lorem Ipsum passages, and more recently with desktop 
publishing software like Aldus PageMaker including versions 
of Lorem Ipsum.

  * bit
  * bold italic
    * orange
    * apple
  * final
Output (No Markdown):
Title of my document:

Lorem Ipsum is simply dummy text of the printing and typesetting
industry. Lorem Ipsum has been the industry's standard dummy
text ever since the 1500s, when an unknown printer took a galley
of type and scrambled it to make a type specimen book.

Check out an awesome project here: https://github.com/aaronsw/html2text

“It was popularised in the 1960s with the release of Letraset sheets
containing Lorem Ipsum passages, and more recently with desktop
publishing software like Aldus PageMaker including versions
of Lorem Ipsum.”

  – bit
  – bold italic
    – orange
    – apple
  – final

I've got things rolling here: mwaterfall/html2text@6e288c3

I'd love to hear views on this. I'm happy to put some more work into it so it's ready to eventually merge into the main project.

wrapwrite doesn't encode output

$ python html2text.py http://google.com/
Traceback (most recent call last):
  File "html2text.py", line 473, in <module>
    wrapwrite(html2text(data, baseurl))
  File "html2text.py", line 436, in wrapwrite
    def wrapwrite(text): sys.stdout.write(text)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xbb' in position 85: ordinal not in range(128)

This is not an issue with Python3.

href instead of content

It would be nice to be able to config when you set up:

ignore_links = True

To set up if you want the href or content of the tag

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.