html-to-text / node-html-to-text Goto Github PK
View Code? Open in Web Editor NEWAdvanced html to text converter
License: Other
Advanced html to text converter
License: Other
Nested lists are not formatted correctly: the first element of the sublist appears on the same line as the parent list. This snippet:
<ul>
<li>1
<ul>
<li>2
<ul>
<li>3.1</li>
<li>3.2</li>
</ul>
</li>
</ul>
</li>
</ul>
should output as:
* 1
* 2
* 3.1
* 3.2
but instead outputs as
* 1 * 2 * 3.1
* 3.2
Although center tag is deprecated in HTML5, some vendors still use it in their emails.
For example:
<html>
<body>
<table>
<center>
<tbody>
<tr>
<td>Issue Date:</td>
<td>Wed, 01 Oct 2014</td>
</tr>
</tbody>
</center>
</table>
</body>
</html>
becomes:
I use:
var text = HtmlToText.fromString(html, {tables: true})
This currently throws the following warning trying to upgrade an app to Node 4:
npm WARN engine [email protected]: wanted: {"node":">= 0.8.0 <=0.12 || <3"} (current: {"node":"4.1.0","npm":"2.14.4"})
If I create a new Google drive document just containing the singe word BUG and format it as Heading2 and save it as html/zip and run the converter on it the result is:
A NAME="H.SO6TKWWQXWA"BUG
According to the W3C Validator it seems like the A tag is not used a not fully standards compliant way in Google Docs. It says the following:
Below is a list of the warning message(s) produced when checking your document.
Warning Line 1, Column 1950: The name attribute is obsolete.
Consider putting an id attribute on the nearest container instead.
…><h2 class="c1"><a name="h.so6tkwwqxwa"></a><span>BUG</span></h2></body></html>
The html code is available at https://gist.github.com/SmallRoomLabs/506eeb0871c5272119e5
Running the example/test.html as described on the main page results in the correctly formatted table, however if you change "--tables=#invoice,.address" to "--tables=true" (which should output all tables, not just the ones specified via class or ID) doesn't produce nicely formatted output.
According to the docs, this should be the case, but something is not getting set correctly. I have a set of HTML I'm working on that has no class or ID information on the tables so the "--tables=true" is the only way I can get the output I need.
When a table has one of its tags in uppercase the output is incorrect.
For example:
<html>
<body>
<table>
<TR>
<td>Issue Date:</td>
<td>Wed, 01 Oct 2014</td>
</TR>
</table>
<p>Passengers</p>
</body>
</html>
becomes:
Passengers
I use:
var text = HtmlToText.fromString(html, {tables: true})
Style tag contents are inserted as text. This can be easily fixed by adding a case in html_to_text.js ~line 109 to skip style tags.
Thanks for the very helpful module!
html-to-text removes >, even if the < is escaped.
$ echo "<b>Hello" | ./bin/cli.js
<b Hello
When a table is followed by either h1, h2, p, or div there is no newline between them.
For example:
<html>
<body>
<table>
<tr>
<td>Issue Date:</td>
<td>Wed, 01 Oct 2014</td>
</tr>
</table>
<p>Passengers</p>
</body>
</html>
becomes:
Issue Date: Wed, 01 Oct 2014Passengers
instead of:
Issue Date: Wed, 01 Oct 2014
Passengers
I use this call:
var text = HtmlToText.fromString(html)
They're simple yet effective. :-)
Hello there,
Please update npm package, it seems the npm package doesn't have new link options which are really useful.
Thanks for this great project,
Afshin
Is there option available to disable "Transform headlines to uppercase text" functionality?
require('html-to-text').fromString('<img />')
require('html-to-text').fromString('<img>')
TypeError: Cannot read property 'alt' of undefined
at Object.formatImage [as image] (node_modules/html-to-text/lib/formatter.js:14:18)
at node_modules/html-to-text/lib/html-to-text.js:82:24
at Function._.each._.forEach (node_modules/html-to-text/node_modules/underscore/underscore.js:103:9)
at walk (node_modules/html-to-text/lib/html-to-text.js:77:4)
at htmlToText (node_modules/html-to-text/lib/html-to-text.js:32:15)
at Object.exports.fromString (node_modules/html-to-text/lib/html-to-text.js:149:9)
at repl:1:26
at REPLServer.self.eval (repl.js:110:21)
at Interface.<anonymous> (repl.js:239:12)
at Interface.emit (events.js:95:17)
I use node-html-to-text
to convert html content before writing it to csv/xlsx file.
I don't want word wrapping when doing so.
I currently pass { wordwrap: NaN }
and this disables word wrapping (because length + word.length
always < NaN
(wordwrap() in helper.js).
I suggest that { wordwrap: null }
would be used to disable word wrapping.
it looks like, table formatting is broken after installing 2.1.3 instead of 2.1.0.
htmltotext.fromString('<table><tr><th>Key</th><th>Value</th></tr><tr><td>status</td><td>edit</td></tr></table>'));
results in
KeyValuestatusedit
There is a problem when to converting an <a>
tag that contains &
on its href
value and its text.
var htmlToText = require('html-to-text');
var text = htmlToText.fromString('<a href="https://www.mywebsite.com?param1=value1&param2=value2">Click & Go</a>');
console.log(text); // Outputs: 'Click �param2=value2]'
htmlToText.fromString('<div>one</div><div>two</div>')
will get one two
, wht not one\ntwo
?
When I set wordwrap: false
the string <li>these words each end on a new line</li>
becomes these\nwords\neach\nend\non\na\nnew\nline
.
If I do not set set wordwrap: false
, they don't.
scenario:
var htmlToText = require('html-to-text');
htmlToText.fromString('😂');
result is incorect:
htmlToText.fromString('😂')
''
Quick fix: replace String.fromCharCode => String.fromCodePoint
(polyfill https://github.com/mathiasbynens/String.fromCodePoint)
in https://github.com/werk85/node-html-to-text/blob/master/lib/helper.js#L48
require('string.fromcodepoint');
String.fromCodePoint(128514)
'😂'
String.fromCharCode(128514)
''
-We appreciate your business. And we hope you'll check out our new products [http://example.com/] !
+We appreciate your business. And we hope you'll check out our new products [http://example.com/]!
Remove this - my fault...
I've found that the 80 char per line limit can be broken by inline formatting. Is there a recommended way to get this to behave itself or is this a bug?
<p>
<strong>This text isn't counted</strong> when calculating where to break a string for 80 character line lengths.
</p>
npm install ghost
(currently at version 0.5
) gives the warning:
npm WARN engine [email protected]: wanted: {"node":"~0.8.0"} (current: {"node":"0.10.30","npm":"1.4.21"})
Is it possible to increase the engine version in package.json
, or is there some strict dependency on node 0.8.x?
I can't figure out why the linkHrefBaseUrl is not set in the html version but correctly for the converted text version. I have set linkHrefBaseUrl in options and it is correctly build for text but missing for html, resulting in incomplete href: href=3D'/auth/verify'
I have a very basic use case so far:
import nodemailer from 'nodemailer'
import htmlToText from 'html-to-text'
const Mailer = nodemailer.createTransport(someTransporter(options))
const htmlToTextOptions = {
linkHrefBaseUrl: 'http://domain.com'
}
Mailer.use('compile', async (mail, next) => {
if (!mail.data.text && mail.data.html) {
mail.data.text = await htmlToText.fromString(mail.data.html, htmlToTextOptions)
}
next()
})
const verificationUrl = '/auth/verify'
let body = `<b>Awesome sauce</b><br>
<a href='${verificationUrl}'>Verify Account</a><br>
Thank you!`
let mail = {
to: some@email.com,
from: '[email protected]',
subject: 'Registration',
html: body
}
Mailer.sendMail(mail, (err, res) => {...})
Environment:
node 4.2
nodemailer: 2.0.0
node-html-to-text: 1.5.1
These are the resulting plain and html types as received by any mail client:
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset=UTF-8
Mime-Version: 1.0
Awesome sauce
Verify Account [http://domain.com/auth/verify]
Thank you!=
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset=UTF-8
Mime-Version: 1.0
Awesome sauce
<a href=3D'/auth/verify'>Verify Account
Thank you!=
Steps to reproduce:
html-to-text
version 1.2.1<wbr>
tags. See the sample python scriptcat test-bad.html | html-to-text > out.txt
'RangeError: Maximum call stack size exceeded'
# Python script to generate HTML
f = open('test-bad.html','w')
f.write('<!DOCTYPE html><html><head></head><body>\n')
for n in range(2000):
f.write('<wbr>n')
f.write('</body></html>')
f.close()
This is also reproducible using invalid HTML tags instead of <wbr>
tags e.g. <foo>
Given the following,
<p>Hello, <?php echo $name ?></p>
It makes no sense to strip out the <?php ... ?>
part.
This applies to other template styles, such as <% ... %>
or shorthand styles <?= ... ?>
.
Simplest solution to this would probably to have tag suffice/prefix pair exclusion rules.
eg. [ ['<%', '%>'], [ '<?', '?>' ] ...
Hello.
Any idea how would go about selecting all divs that have an id that start with pf ? like pf1 , pf2 , etc.
baseElement: ['div#pf * ???'] is there a wildcard match for this ?
This code:
var request = require('request');
var htt = require('html-to-text');
request('http://bashmodernquantity.com/bash-modern-quantity/2014/1/10/wool-and-copper',function(err,res,body){
var text = htt.fromString(body);
console.log(text);
});
Will log out several URLs, like https://twitter.com/thomasqbrady, which only occur in the HTML document as attribute values (of HREF) on anchor tags. A few anchor IDs show up, too, obvious, since they start with #.
Looks like there isn't an option to keep video elements in the text.
Love the support for <img/>
elements, would be nice to have similar support for <video>
.
When I run the test I got the following error
Uncaught TypeError: expect(...).to.be.null is not a function
It coming from the following lines
https://github.com/werk85/node-html-to-text/blob/master/test/html-to-text.js#L48
https://github.com/werk85/node-html-to-text/blob/master/test/html-to-text.js#L51
The correct syntax will should be
expect(err).to.be.null;
Hi,
I have started developing with html-to-text and discovered that it expects every anchor to have a href and crashes when it doesn't find one.
Changing line 25 in formatter.js from
return elem.attribs.href.replace(/^mailto\:/, '');
to
if (elem.attribs && elem.attribs.href) {
return elem.attribs.href.replace(/^mailto\:/, '');
}
else {
return helper.wordwrap(helper.decodeHTMLEntities(_s.strip(elem.raw)), options.wordwrap);
}
fixes this.
Thank you!
Normal <strong>Bold</strong>
results in NormalBold
Is it possible to add ANSI stlyes to the output?
Is there a way to set HtmlToText to output tab characters to separate html table cells instead of making fixed width columns?
npm WARN engine [email protected]: wanted: {"node":"~0.8.0"} (current: {"node":"0.10.36","npm":"2.3.0"})
Everything seems to work fine outside of these boundaries. Maybe you can relax the engine requirements to include Node 0.10, 0.12, and IO.js 1.x.x?
Please, now it's really hard to realise what has changed between versions.
Thanks for the package btw! Using it at Trustroots.
The example test.html is not being formatted correctly. The table format ist not being preserved in the plain text.
Noticed that h5 isn't converted to uppercase, but more importantly that there is no whitespace between the heading and what follows next.
While integrating your html-to-text npm module, I ran into an issue where words that are formatted (bold, italic, etc.) would be converted without their surrounding spaces. As a result, the output has the words jammed together. My fix makes sure that any surrounding spaces are converted to a single space. That allows the output to be readable regardless of the formatting or word wrapping.
Proposed solution in formatter.js:
function formatText(elem, options) { var lws = elem.raw.match(/^\s+/g); var rws = elem.raw.match(/\s+$/g); var text = helper.decodeHTMLEntities(elem.raw); return (lws != null ? ' ' : '') + helper.wordwrap(text, options.wordwrap) + (rws != null ? ' ' : ''); };
To test, create a sentence with a bold word in it.
var htmlToText = require('html-to-text');
var text = htmlToText.fromString("This is a <a href='url'>link</a>.");
console.log(text);
will print This is a link [url] .
(note the space before the period). Makes it problematic to use, for example when a url comes at the end of a sentence.
It also seems to be adding spaces before the link.
I appreciate this is a node library. Is there a version of this or anything similar that works in the browser?
This line pollutes the cli output.
Instead, visionmedia/debug may be used for this purpose.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.