html-to-text / node-html-to-text Goto Github PK

View Code? Open in Web Editor NEW

1.6K 28.0 224.0 1.21 MB

Advanced html to text converter

License: Other

JavaScript 95.78% HTML 4.22%

html text plain-text node email pretty-print converter

node-html-to-text's People

Contributors

Stargazers

Watchers

Forkers

abhishekbhalani starterstep benjaminma hipmob mikerobe aisaacs thanh-nguyen danielshort25 zosiakropka stepupsoftware caffeinewriter paulmalenke meedamian danmactough kritivasas tanongkit2 amkohn pipoprods halfdan rameshv bjdmeest born2net panta82 royaltm nvdnkpr seakkus pgrm tomyam1 strml skhvan1111 luizfilipe nathan929 ethanresnick webjay anubhavmishra huancz volgunin pdehaan iamstarkov frankie4fingers ruid0 kameamea pathumego matthiasdg ironprogrammer pmug toddtarsi ecaroth simllll urish standuprey andrewfinlay flyeven aef- dahui1024 pklada duanefields xdevelsistemas martinnuc lwalace dremora oinkagency answerdash jamie-robertson pahalovo nemo johnhawley tawk greg-js jaanek jackhou imbox tobby2002 xiaohanzhang rissem joinergreat sumeet-chaayos mattbailey haxpor ideawake ropbla9 plygrynd-jynkyrd tony824 javascript-forks adam-lynch codifyclp dmytrovakoliuk bestmusman stepboom ilyalesik niksmith samsonhabte2006 i-sight oldlam x31forest appropos wong2 winfieldco isaacwhite fde31

node-html-to-text's Issues

Add class/id table selector support

First element of sublist not indented

Nested lists are not formatted correctly: the first element of the sublist appears on the same line as the parent list. This snippet:

<ul>
  <li>1
    <ul>
      <li>2
        <ul>
          <li>3.1</li>
          <li>3.2</li>
        </ul>
      </li>
    </ul>
  </li>
</ul>

should output as:

* 1
  * 2
    * 3.1
    * 3.2

but instead outputs as

* 1 * 2 * 3.1
       * 3.2

Does not process tables with center tag

Although center tag is deprecated in HTML5, some vendors still use it in their emails.

For example:

<html>
<body>
    <table>
    <center>
        <tbody>
            <tr>
                <td>Issue Date:</td>
                <td>Wed, 01 Oct 2014</td>
            </tr>
        </tbody>
    </center>
    </table>
</body>
</html>

becomes:

I use:

var text = HtmlToText.fromString(html, {tables: true})

Support Node 4 engine

This currently throws the following warning trying to upgrade an app to Node 4:

npm WARN engine [email protected]: wanted: {"node":">= 0.8.0 <=0.12 || <3"} (current: {"node":"4.1.0","npm":"2.14.4"})

Google drive html-document parsed incorrectly

If I create a new Google drive document just containing the singe word BUG and format it as Heading2 and save it as html/zip and run the converter on it the result is:
A NAME="H.SO6TKWWQXWA"BUG

According to the W3C Validator it seems like the A tag is not used a not fully standards compliant way in Google Docs. It says the following:

Below is a list of the warning message(s) produced when checking your document.
Warning Line 1, Column 1950: The name attribute is obsolete. 
Consider putting an id attribute on the nearest container instead.
…><h2 class="c1"><a name="h.so6tkwwqxwa"></a><span>BUG</span></h2></body></html>

The html code is available at https://gist.github.com/SmallRoomLabs/506eeb0871c5272119e5

using --tables=true doesn't produce the expected results.

Running the example/test.html as described on the main page results in the correctly formatted table, however if you change "--tables=#invoice,.address" to "--tables=true" (which should output all tables, not just the ones specified via class or ID) doesn't produce nicely formatted output.

According to the docs, this should be the case, but something is not getting set correctly. I have a set of HTML I'm working on that has no class or ID information on the tables so the "--tables=true" is the only way I can get the output I need.

Does not process tables with uppercase tags

When a table has one of its tags in uppercase the output is incorrect.

For example:

<html>
<body>
    <table>
        <TR>
            <td>Issue Date:</td>
            <td>Wed, 01 Oct 2014</td>
        </TR>
    </table>
    <p>Passengers</p>
</body>
</html>

becomes:

Passengers

I use:

var text = HtmlToText.fromString(html, {tables: true})

Style tags not parsed - show up in content

Style tag contents are inserted as text. This can be easily fixed by adding a case in html_to_text.js ~line 109 to skip style tags.

Thanks for the very helpful module!

Erroneously removes >

html-to-text removes >, even if the < is escaped.

$ echo "&lt;b>Hello" | ./bin/cli.js 
<b Hello

No newline after tables

When a table is followed by either h1, h2, p, or div there is no newline between them.

For example:

<html>
<body>
    <table>
        <tr>
            <td>Issue Date:</td>
            <td>Wed, 01 Oct 2014</td>
        </tr>
    </table>
    <p>Passengers</p>
</body>
</html>

becomes:

Issue Date: Wed, 01 Oct 2014Passengers

instead of:

Issue Date: Wed, 01 Oct 2014
Passengers

I use this call:

var text = HtmlToText.fromString(html)

Add version tags

They're simple yet effective. :-)

Here; https://github.com/werk85/node-html-to-text/releases

Update npm

Hello there,

Please update npm package, it seems the npm package doesn't have new link options which are really useful.

Thanks for this great project,
Afshin

Disable "Transform headlines to uppercase text"

Is there option available to disable "Transform headlines to uppercase text" functionality?

throws error on <img /> without attributes

require('html-to-text').fromString('<img />')
require('html-to-text').fromString('<img>')

TypeError: Cannot read property 'alt' of undefined
    at Object.formatImage [as image] (node_modules/html-to-text/lib/formatter.js:14:18)
    at node_modules/html-to-text/lib/html-to-text.js:82:24
    at Function._.each._.forEach (node_modules/html-to-text/node_modules/underscore/underscore.js:103:9)
    at walk (node_modules/html-to-text/lib/html-to-text.js:77:4)
    at htmlToText (node_modules/html-to-text/lib/html-to-text.js:32:15)
    at Object.exports.fromString (node_modules/html-to-text/lib/html-to-text.js:149:9)
    at repl:1:26
    at REPLServer.self.eval (repl.js:110:21)
    at Interface.<anonymous> (repl.js:239:12)
    at Interface.emit (events.js:95:17)

Allow to disable wordwrap

I use node-html-to-text to convert html content before writing it to csv/xlsx file.
I don't want word wrapping when doing so.

I currently pass { wordwrap: NaN } and this disables word wrapping (because length + word.length always < NaN (wordwrap() in helper.js).

I suggest that { wordwrap: null } would be used to disable word wrapping.

Change from underscore to lodash

Bug in Table formatting since 2.1.3 ?

it looks like, table formatting is broken after installing 2.1.3 instead of 2.1.0.

htmltotext.fromString('<table><tr><th>Key</th><th>Value</th></tr><tr><td>status</td><td>edit</td></tr></table>'));

results in

KeyValuestatusedit

Convert '<a>' tags that contains '&'

There is a problem when to converting an <a> tag that contains & on its href value and its text.

var htmlToText = require('html-to-text');
var text = htmlToText.fromString('<a href="https://www.mywebsite.com?param1=value1&amp;param2=value2">Click &amp; Go</a>');
console.log(text); // Outputs: 'Click �param2=value2]'

Why use blankspace instead of \n

htmlToText.fromString('<div>one</div><div>two</div>')

will get one two, wht not one\ntwo ?

Add unittests

<li>these words each end on a new line</li>

When I set wordwrap: false the string <li>these words each end on a new line</li> becomes these\nwords\neach\nend\non\na\nnew\nline.

If I do not set set wordwrap: false, they don't.

Unicode

scenario:

var htmlToText = require('html-to-text');
htmlToText.fromString('&#128514;');

result is incorect:

htmlToText.fromString('&#128514;')
''

Quick fix: replace String.fromCharCode => String.fromCodePoint
(polyfill https://github.com/mathiasbynens/String.fromCodePoint)
in https://github.com/werk85/node-html-to-text/blob/master/lib/helper.js#L48

require('string.fromcodepoint');
String.fromCodePoint(128514)
'😂'
String.fromCharCode(128514)
''

Test not passed

-We appreciate your business. And we hope you'll check out our new products [http://example.com/] !
+We appreciate your business. And we hope you'll check out our new products [http://example.com/]!

Setting attribute tables: true (to select all tables) doesn't work

Remove this - my fault...

80 chars per line formatting can be broken by formatting tags

I've found that the 80 char per line limit can be broken by inline formatting. Is there a recommended way to get this to behave itself or is this a bug?

<p>
  <strong>This text isn't counted</strong> when calculating where to break a string for 80 character line lengths.
</p>

package.json: Accept engine node >= 0.9.0

npm install ghost (currently at version 0.5) gives the warning:

npm WARN engine [email protected]: wanted: {"node":"~0.8.0"} (current: {"node":"0.10.30","npm":"1.4.21"})

Is it possible to increase the engine version in package.json, or is there some strict dependency on node 0.8.x?

linkHrefBaseUrl not set for html

I can't figure out why the linkHrefBaseUrl is not set in the html version but correctly for the converted text version. I have set linkHrefBaseUrl in options and it is correctly build for text but missing for html, resulting in incomplete href: href=3D'/auth/verify'

I have a very basic use case so far:

import nodemailer from 'nodemailer'
import htmlToText from 'html-to-text'

const Mailer = nodemailer.createTransport(someTransporter(options))

const htmlToTextOptions = {
  linkHrefBaseUrl: 'http://domain.com'
}
Mailer.use('compile', async (mail, next) => {
  if (!mail.data.text && mail.data.html) {
    mail.data.text = await htmlToText.fromString(mail.data.html, htmlToTextOptions)
  }
  next()
})

const verificationUrl = '/auth/verify'
let body = `<b>Awesome sauce</b><br>
  <a href='${verificationUrl}'>Verify Account</a><br>
  Thank you!`

let mail = {
  to: some@email.com,
  from: '[email protected]',
  subject: 'Registration',
  html: body
}
Mailer.sendMail(mail, (err, res) => {...})

Environment:
node 4.2
nodemailer: 2.0.0
node-html-to-text: 1.5.1

These are the resulting plain and html types as received by any mail client:

Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset=UTF-8
Mime-Version: 1.0

Awesome sauce
Verify Account [http://domain.com/auth/verify]
Thank you!=

Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset=UTF-8
Mime-Version: 1.0

Awesome sauce

<a href=3D'/auth/verify'>Verify Account

Thank you!=

Changelog should be updated with NPM repository version

'RangeError: Maximum call stack size exceeded' When processing html with large number of <wbr> tags

Steps to reproduce:

Install html-to-text version 1.2.1
Create html file with 2000 <wbr> tags. See the sample python script
Run the command cat test-bad.html | html-to-text > out.txt
Observe 'RangeError: Maximum call stack size exceeded'

# Python script to generate HTML
f = open('test-bad.html','w')
f.write('<!DOCTYPE html><html><head></head><body>\n')
for n in range(2000):
    f.write('<wbr>n')
f.write('</body></html>')
f.close()

This is also reproducible using invalid HTML tags instead of <wbr> tags e.g. <foo>

Should not remove server and/or templating tags

Given the following,

<p>Hello, <?php echo $name  ?></p>

It makes no sense to strip out the <?php ... ?> part.

This applies to other template styles, such as <% ... %> or shorthand styles <?= ... ?>.

Simplest solution to this would probably to have tag suffice/prefix pair exclusion rules.
eg. [ ['<%', '%>'], [ '<?', '?>' ] ...

selector for baseElement

Hello.
Any idea how would go about selecting all divs that have an id that start with pf ? like pf1 , pf2 , etc.
baseElement: ['div#pf * ???'] is there a wildcard match for this ?

Some HTML attributes coming through

This code:

    var request = require('request');
    var htt = require('html-to-text');

    request('http://bashmodernquantity.com/bash-modern-quantity/2014/1/10/wool-and-copper',function(err,res,body){
      var text = htt.fromString(body);
      console.log(text);
    });

Will log out several URLs, like https://twitter.com/thomasqbrady, which only occur in the HTML document as attribute values (of HREF) on anchor tags. A few anchor IDs show up, too, obvious, since they start with #.

Support video elements

Looks like there isn't an option to keep video elements in the text.

Love the support for <img/> elements, would be nice to have similar support for <video>.

Test case failing due to syntax error

When I run the test I got the following error

Uncaught TypeError: expect(...).to.be.null is not a function

It coming from the following lines
https://github.com/werk85/node-html-to-text/blob/master/test/html-to-text.js#L48
https://github.com/werk85/node-html-to-text/blob/master/test/html-to-text.js#L51

The correct syntax will should be

expect(err).to.be.null;

An anchor (<a/>) without a href attribute causes a crash

Hi,

I have started developing with html-to-text and discovered that it expects every anchor to have a href and crashes when it doesn't find one.

Changing line 25 in formatter.js from

return elem.attribs.href.replace(/^mailto\:/, '');

if (elem.attribs && elem.attribs.href) {
    return elem.attribs.href.replace(/^mailto\:/, '');
}
else {
    return helper.wordwrap(helper.decodeHTMLEntities(_s.strip(elem.raw)), options.wordwrap);
}

fixes this.

Thank you!

Erroneously removes whitespace.

Normal <strong>Bold</strong> results in NormalBold

Add ul/ol support

Add command line tool

ANSI colors and styles

Is it possible to add ANSI stlyes to the output?

use tabs for formatting HTML tables

Is there a way to set HtmlToText to output tab characters to separate html table cells instead of making fixed width columns?

Relax node "engine" requirements.

npm WARN engine [email protected]: wanted: {"node":"~0.8.0"} (current: {"node":"0.10.36","npm":"2.3.0"})

Everything seems to work fine outside of these boundaries. Maybe you can relax the engine requirements to include Node 0.10, 0.12, and IO.js 1.x.x?

Complete package.json informations

Add CHANGELOG.md

Please, now it's really hard to realise what has changed between versions.

Thanks for the package btw! Using it at Trustroots.

The example with tables is not formatted correctly

The example test.html is not being formatted correctly. The table format ist not being preserved in the plain text.

Is there a reason <h5> and <h6> aren't handled like the other headings?

Noticed that h5 isn't converted to uppercase, but more importantly that there is no whitespace between the heading and what follows next.

Text formatting causes space loss between words

While integrating your html-to-text npm module, I ran into an issue where words that are formatted (bold, italic, etc.) would be converted without their surrounding spaces. As a result, the output has the words jammed together. My fix makes sure that any surrounding spaces are converted to a single space. That allows the output to be readable regardless of the formatting or word wrapping.

Proposed solution in formatter.js:

function formatText(elem, options) {
        var lws = elem.raw.match(/^\s+/g);
        var rws = elem.raw.match(/\s+$/g);
        var text = helper.decodeHTMLEntities(elem.raw);
        return (lws != null ? ' ' : '') + helper.wordwrap(text, options.wordwrap) +  (rws != null ? ' ' : '');
};

To test, create a sentence with a bold word in it.

Add better HTML entities support

Extra spaces being added

var htmlToText = require('html-to-text');
var text = htmlToText.fromString("This is a <a href='url'>link</a>.");
console.log(text);

will print This is a link [url] . (note the space before the period). Makes it problematic to use, for example when a url comes at the end of a sentence.

It also seems to be adding spaces before the link.

Question: does this work in the browser

I appreciate this is a node library. Is there a version of this or anything similar that works in the browser?

remove console.log call in cli.js

This line pollutes the cli output.

Instead, visionmedia/debug may be used for this purpose.

html-to-text / node-html-to-text Goto Github PK

node-html-to-text's People

Contributors

Stargazers

Watchers

Forkers

node-html-to-text's Issues

Recommend Projects

Recommend Topics

Recommend Org