Giter Club home page Giter Club logo

node-html-to-text's People

Contributors

aautio avatar baptistejamin avatar danmactough avatar dremora avatar glkz avatar greenkeeperio-bot avatar iamstarkov avatar jimmywarting avatar killymxi avatar luddd3 avatar madflow avatar nicktouchette avatar pdehaan avatar pgrm avatar pklada avatar real-artswan avatar realityking avatar rissem avatar royaltm avatar secretfader avatar simllll avatar synzen avatar thanh-nguyen avatar thefive avatar tomyam1-personal avatar urish avatar valeriansaliou avatar volgunin avatar webjay avatar wong2 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

node-html-to-text's Issues

First element of sublist not indented

Nested lists are not formatted correctly: the first element of the sublist appears on the same line as the parent list. This snippet:

<ul>
  <li>1
    <ul>
      <li>2
        <ul>
          <li>3.1</li>
          <li>3.2</li>
        </ul>
      </li>
    </ul>
  </li>
</ul>

should output as:

* 1
  * 2
    * 3.1
    * 3.2

but instead outputs as

* 1 * 2 * 3.1
       * 3.2

Does not process tables with center tag

Although center tag is deprecated in HTML5, some vendors still use it in their emails.

For example:

<html>
<body>
    <table>
    <center>
        <tbody>
            <tr>
                <td>Issue Date:</td>
                <td>Wed, 01 Oct 2014</td>
            </tr>
        </tbody>
    </center>
    </table>
</body>
</html>

becomes:

I use:

var text = HtmlToText.fromString(html, {tables: true})

Support Node 4 engine

This currently throws the following warning trying to upgrade an app to Node 4:

npm WARN engine [email protected]: wanted: {"node":">= 0.8.0 <=0.12 || <3"} (current: {"node":"4.1.0","npm":"2.14.4"})

Google drive html-document parsed incorrectly

If I create a new Google drive document just containing the singe word BUG and format it as Heading2 and save it as html/zip and run the converter on it the result is:
A NAME="H.SO6TKWWQXWA"BUG

According to the W3C Validator it seems like the A tag is not used a not fully standards compliant way in Google Docs. It says the following:

Below is a list of the warning message(s) produced when checking your document.
Warning Line 1, Column 1950: The name attribute is obsolete. 
Consider putting an id attribute on the nearest container instead.
…><h2 class="c1"><a name="h.so6tkwwqxwa"></a><span>BUG</span></h2></body></html>

The html code is available at https://gist.github.com/SmallRoomLabs/506eeb0871c5272119e5

using --tables=true doesn't produce the expected results.

Running the example/test.html as described on the main page results in the correctly formatted table, however if you change "--tables=#invoice,.address" to "--tables=true" (which should output all tables, not just the ones specified via class or ID) doesn't produce nicely formatted output.

According to the docs, this should be the case, but something is not getting set correctly. I have a set of HTML I'm working on that has no class or ID information on the tables so the "--tables=true" is the only way I can get the output I need.

Does not process tables with uppercase tags

When a table has one of its tags in uppercase the output is incorrect.

For example:

<html>
<body>
    <table>
        <TR>
            <td>Issue Date:</td>
            <td>Wed, 01 Oct 2014</td>
        </TR>
    </table>
    <p>Passengers</p>
</body>
</html>

becomes:

Passengers

I use:

var text = HtmlToText.fromString(html, {tables: true})

Style tags not parsed - show up in content

Style tag contents are inserted as text. This can be easily fixed by adding a case in html_to_text.js ~line 109 to skip style tags.

Thanks for the very helpful module!

Erroneously removes >

html-to-text removes >, even if the < is escaped.

$ echo "&lt;b>Hello" | ./bin/cli.js 
<b Hello

No newline after tables

When a table is followed by either h1, h2, p, or div there is no newline between them.

For example:

<html>
<body>
    <table>
        <tr>
            <td>Issue Date:</td>
            <td>Wed, 01 Oct 2014</td>
        </tr>
    </table>
    <p>Passengers</p>
</body>
</html>

becomes:

Issue Date: Wed, 01 Oct 2014Passengers

instead of:

Issue Date: Wed, 01 Oct 2014
Passengers

I use this call:

var text = HtmlToText.fromString(html)

Update npm

Hello there,

Please update npm package, it seems the npm package doesn't have new link options which are really useful.

Thanks for this great project,
Afshin

throws error on <img /> without attributes

require('html-to-text').fromString('<img />')
require('html-to-text').fromString('<img>')

TypeError: Cannot read property 'alt' of undefined
    at Object.formatImage [as image] (node_modules/html-to-text/lib/formatter.js:14:18)
    at node_modules/html-to-text/lib/html-to-text.js:82:24
    at Function._.each._.forEach (node_modules/html-to-text/node_modules/underscore/underscore.js:103:9)
    at walk (node_modules/html-to-text/lib/html-to-text.js:77:4)
    at htmlToText (node_modules/html-to-text/lib/html-to-text.js:32:15)
    at Object.exports.fromString (node_modules/html-to-text/lib/html-to-text.js:149:9)
    at repl:1:26
    at REPLServer.self.eval (repl.js:110:21)
    at Interface.<anonymous> (repl.js:239:12)
    at Interface.emit (events.js:95:17)

Allow to disable wordwrap

I use node-html-to-text to convert html content before writing it to csv/xlsx file.
I don't want word wrapping when doing so.

I currently pass { wordwrap: NaN } and this disables word wrapping (because length + word.length always < NaN (wordwrap() in helper.js).

I suggest that { wordwrap: null } would be used to disable word wrapping.

Bug in Table formatting since 2.1.3 ?

it looks like, table formatting is broken after installing 2.1.3 instead of 2.1.0.

htmltotext.fromString('<table><tr><th>Key</th><th>Value</th></tr><tr><td>status</td><td>edit</td></tr></table>'));

results in

KeyValuestatusedit

Convert '<a>' tags that contains '&amp;'

There is a problem when to converting an <a> tag that contains &amp; on its href value and its text.

var htmlToText = require('html-to-text');
var text = htmlToText.fromString('<a href="https://www.mywebsite.com?param1=value1&amp;param2=value2">Click &amp; Go</a>');
console.log(text); // Outputs: 'Click �param2=value2]'

<li>these words each end on a new line</li>

When I set wordwrap: false the string <li>these words each end on a new line</li> becomes these\nwords\neach\nend\non\na\nnew\nline.

If I do not set set wordwrap: false, they don't.

Test not passed

-We appreciate your business. And we hope you'll check out our new products [http://example.com/] !
+We appreciate your business. And we hope you'll check out our new products [http://example.com/]!

80 chars per line formatting can be broken by formatting tags

I've found that the 80 char per line limit can be broken by inline formatting. Is there a recommended way to get this to behave itself or is this a bug?

<p>
  <strong>This text isn't counted</strong> when calculating where to break a string for 80 character line lengths.
</p>

package.json: Accept engine node >= 0.9.0

npm install ghost (currently at version 0.5) gives the warning:

npm WARN engine [email protected]: wanted: {"node":"~0.8.0"} (current: {"node":"0.10.30","npm":"1.4.21"})

Is it possible to increase the engine version in package.json, or is there some strict dependency on node 0.8.x?

linkHrefBaseUrl not set for html

I can't figure out why the linkHrefBaseUrl is not set in the html version but correctly for the converted text version. I have set linkHrefBaseUrl in options and it is correctly build for text but missing for html, resulting in incomplete href: href=3D'/auth/verify'

I have a very basic use case so far:

import nodemailer from 'nodemailer'
import htmlToText from 'html-to-text'

const Mailer = nodemailer.createTransport(someTransporter(options))

const htmlToTextOptions = {
  linkHrefBaseUrl: 'http://domain.com'
}
Mailer.use('compile', async (mail, next) => {
  if (!mail.data.text && mail.data.html) {
    mail.data.text = await htmlToText.fromString(mail.data.html, htmlToTextOptions)
  }
  next()
})

const verificationUrl = '/auth/verify'
let body = `<b>Awesome sauce</b><br>
  <a href='${verificationUrl}'>Verify Account</a><br>
  Thank you!`

let mail = {
  to: some@email.com,
  from: '[email protected]',
  subject: 'Registration',
  html: body
}
Mailer.sendMail(mail, (err, res) => {...})

Environment:
node 4.2
nodemailer: 2.0.0
node-html-to-text: 1.5.1

These are the resulting plain and html types as received by any mail client:

Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset=UTF-8
Mime-Version: 1.0

Awesome sauce
Verify Account [http://domain.com/auth/verify]
Thank you!=

Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset=UTF-8
Mime-Version: 1.0

Awesome sauce

<a href=3D'/auth/verify'>Verify Account

Thank you!=

'RangeError: Maximum call stack size exceeded' When processing html with large number of <wbr> tags

Steps to reproduce:

  1. Install html-to-text version 1.2.1
  2. Create html file with 2000 <wbr> tags. See the sample python script
  3. Run the command cat test-bad.html | html-to-text > out.txt
  4. Observe 'RangeError: Maximum call stack size exceeded'
# Python script to generate HTML
f = open('test-bad.html','w')
f.write('<!DOCTYPE html><html><head></head><body>\n')
for n in range(2000):
    f.write('<wbr>n')
f.write('</body></html>')
f.close()

This is also reproducible using invalid HTML tags instead of <wbr> tags e.g. <foo>

Should not remove server and/or templating tags

Given the following,

<p>Hello, <?php echo $name  ?></p>

It makes no sense to strip out the <?php ... ?> part.

This applies to other template styles, such as <% ... %> or shorthand styles <?= ... ?>.


Simplest solution to this would probably to have tag suffice/prefix pair exclusion rules.
eg. [ ['<%', '%>'], [ '<?', '?>' ] ...

selector for baseElement

Hello.
Any idea how would go about selecting all divs that have an id that start with pf ? like pf1 , pf2 , etc.
baseElement: ['div#pf * ???'] is there a wildcard match for this ?

Some HTML attributes coming through

This code:

    var request = require('request');
    var htt = require('html-to-text');

    request('http://bashmodernquantity.com/bash-modern-quantity/2014/1/10/wool-and-copper',function(err,res,body){
      var text = htt.fromString(body);
      console.log(text);
    });

Will log out several URLs, like https://twitter.com/thomasqbrady, which only occur in the HTML document as attribute values (of HREF) on anchor tags. A few anchor IDs show up, too, obvious, since they start with #.

Support video elements

Looks like there isn't an option to keep video elements in the text.

Love the support for <img/> elements, would be nice to have similar support for <video>.

An anchor (<a/>) without a href attribute causes a crash

Hi,

I have started developing with html-to-text and discovered that it expects every anchor to have a href and crashes when it doesn't find one.

Changing line 25 in formatter.js from

return elem.attribs.href.replace(/^mailto\:/, '');

to

if (elem.attribs && elem.attribs.href) {
    return elem.attribs.href.replace(/^mailto\:/, '');
}
else {
    return helper.wordwrap(helper.decodeHTMLEntities(_s.strip(elem.raw)), options.wordwrap);
}

fixes this.

Thank you!

Relax node "engine" requirements.

npm WARN engine [email protected]: wanted: {"node":"~0.8.0"} (current: {"node":"0.10.36","npm":"2.3.0"})

Everything seems to work fine outside of these boundaries. Maybe you can relax the engine requirements to include Node 0.10, 0.12, and IO.js 1.x.x?

Add CHANGELOG.md

Please, now it's really hard to realise what has changed between versions.

Thanks for the package btw! Using it at Trustroots.

Text formatting causes space loss between words

While integrating your html-to-text npm module, I ran into an issue where words that are formatted (bold, italic, etc.) would be converted without their surrounding spaces. As a result, the output has the words jammed together. My fix makes sure that any surrounding spaces are converted to a single space. That allows the output to be readable regardless of the formatting or word wrapping.

Proposed solution in formatter.js:

function formatText(elem, options) {
        var lws = elem.raw.match(/^\s+/g);
        var rws = elem.raw.match(/\s+$/g);
        var text = helper.decodeHTMLEntities(elem.raw);
        return (lws != null ? ' ' : '') + helper.wordwrap(text, options.wordwrap) +  (rws != null ? ' ' : '');
};

To test, create a sentence with a bold word in it.

Extra spaces being added

var htmlToText = require('html-to-text');
var text = htmlToText.fromString("This is a <a href='url'>link</a>.");
console.log(text);

will print This is a link [url] . (note the space before the period). Makes it problematic to use, for example when a url comes at the end of a sentence.

It also seems to be adding spaces before the link.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.