soundasleep / html2text Goto Github PK
View Code? Open in Web Editor NEWA PHP component to convert HTML into a plain text format
License: MIT License
A PHP component to convert HTML into a plain text format
License: MIT License
https://github.com/soundasleep/html2text/blob/master/src/Html2Text.php#L291-294
<a href="http://foo.com">http://foo.com</a>
is converted to http://foo.com
instead of [http://foo.com](http://foo.com)
the latter is the preferred behavior. example use case: URLs are shortened in href
but not in description.
static function iterateOverNode($node, $prevName = null, $in_pre = false, $is_office_document = false, $options)
So if you pass multi line plain text it will return the same text but in one line
Looks like PHP 8.2 is producing this error:
Deprecated: mb_convert_encoding(): Handling HTML entities via mbstring is deprecated; use htmlspecialchars, htmlentities, or mb_encode_numericentity/mb_decode_numericentity instead.
I tried to fix it as the error suggests in a couple of ways, but the tests are breaking. After my attempted fixes, the tests seem to expect HTML back, but are returning plain text, which is the whole point of package, so not sure what is going on.
An update for PHP 8.2 would be greatly appreciated. Let me know of anything I can do to help.
Html special chars replacing <script>alert(1)</script> => <script>alert(1)</script>
From https://code.google.com/p/iaml/issues/detail?id=286&q=html2text with patch
Dear Jevon and Team,
Appreciate your effort in maintaining this library. We just started using this library and noticed a small issue that you may have already addressed.
Our input HTML text contains valid '<' characteras a part of the content (not the html tag). The library DomParser seems to be stripping that out. Is there a way we can escape that character and send as input to your library
hi, it would be nice if the alt attribute can be taken for images. Some times converting images that are wrapped inside a a-tag return empty text
[ ](http://www.example.com/)
Hello, there is no clear information on this repository, nor a link to a doc, that tells what is the minimum version of PHP to run the current version of html2text library.
With PHP releases moving very fast now, this is becoming a very important information. Projects using PHP dependencies must currently handle many libraries dropping PHP 5.x and early PHP 7.x (at least, 7.0). This would be very helpful to share information about this topic for html2text.
(BTW: thanks for the move to MIT ;-) )
Hello !
There is some code doing intentional removal of
nodes when they are the last child of a node that also contained text. Here's a very simple example about how this can lead to incorrect results (this is stuff I'm receiving from bad html emails) :
<font size="+1">Vikings: Wolves of Midgard<br></font><font size="+1">Valkyria Chronicles<br>
<br>
World Of Warcraft Battlechest</font>
The expected output would be
Vikings: Wolves of Midgard
Valkyria Chronicles
World Of Warcraft Battlechest
The actual output is:
Vikings: Wolves of MidgardValkyria Chronicles
World Of Warcraft Battlechest
I'm getting various invalid entity errors when using on what I 'think' are HTML5 pages.
Example:
Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag header invalid in Entity,
Example HTML -> pick any page on www.independent.ie
This may be similar/related to the same issues happening here with the DOM document: http://stackoverflow.com/questions/6090667/php-domdocument-errors-warnings-on-html5-tags
This is a fantastic HTML2TEXT converter. A huge thanks and congratulations to the dev. I hope there's a work around better than turning off errors for this issue.
Cheers.
Unucode text like "Добрый день!" after convert show like -> "Ð�обÑ�Ñ�й денÑ�!"
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<?xml encoding="utf-8" ?>
<html>
<body>
<div dir="ltr"><img src="cid:ii_l388tk2h0" alt="obs-layout.png" width="562" height="237"><br></div><br>
</body>
</html>
[obs-layout.png]
![obs-layout.png](cid:ii_l388tk2h0)
Possible?
Works great with AWS SES as long as you don't have attachments and must send email as RAW. When email is sent as RAW, text causes delivery failure.
It might go into the direction of #35, but it's a little different use case.
I have a mailchimp-email template and they come with weird unsupported entities.
Changing Html2Text.php around line 50 to this fixes it. The text-version of my email looks perfect, it was just throwing ugly exceptions.
$doc = new \DOMDocument();
libxml_use_internal_errors(true);
if (!$doc->loadHTML($html)) {
throw new Html2TextException("Could not load HTML - badly formed?", $html);
}
libxml_clear_errors();
BTW: GREAT WORK WITH THIS TOOL!
I have php 8.2 and I get this error when i call convert function:
E_DEPRECATED: mb_convert_encoding(): Handling HTML entities via mbstring is deprecated; use htmlspecialchars, htmlentities, or mb_encode_numericentity/mb_decode_numericentity instead in /composer/vendor/soundasleep/html2text/src/Html2Text.php on line 54
Originally reported in html2text_ruby: soundasleep/html2text_ruby#5
This test case needs to be copied over to html2text.
Hi, how about a new release including #19 ?
Thanks! :-)
From Claudio Thomas https://code.google.com/p/iaml/issues/detail?id=288:
A small suggestion:
Line 195 change from
if ($href == $output) {
to:
if ($href == $output or $href == "mailto:$output or $href == "http://$output"
Typically the URLs with the mentioned prefixes are self explained, so that you see the same text twice in the text.
>>> $html = '<a href="https://www.google.com?utm_source=croix&utm_campaign=croix">ok</a>'
=> "<a href="https://www.google.com?utm_source=croix&utm_campaign=croix">ok</a>"
>>> \Soundasleep\Html2Text::convert($html);
PHP Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 1 in /Users/maxime/Repos/benevolt-app/vendor/soundasleep/html2text/src/Html2Text.php on line 171
=> "[ok](https://www.google.com?utm_source=croix&utm_campaign=croix)"
>>> $html = '<a href="https://www.google.com?utm_source=croix">ok</a>'
=> "<a href="https://www.google.com?utm_source=croix">ok</a>"
>>> \Soundasleep\Html2Text::convert($html);
=> "[ok](https://www.google.com?utm_source=croix)"
Apologies for the noise, but is there a new release planned? We'd like to make use of the blockquote support.
Hi everyone! I'm no longer a PHP dev and I've run out of capacity to maintain this project, so I'm looking for some maintainers going forward. Alternatively I can archive the project as read-only.
Ideal criteria:
Other than that I'm happy for maintainers to take this project into whatever direction it needs to go! :)
For the future of this project I'd suggest some of the most critical tasks are
Hi,
We've been waiting for an upgrade of this package for compatibility with PHP 8 (#88) for a few months, and even though a fix has been proposed in #86 and #87, no response has been received.
In fact, the last commit & the last release date back to Feb 2019, and so does the last comment activity from @soundasleep I could pinpoint from a quick search in the issue tracker, even though you look active on GitHub recently.
Should we consider this project abandoned and fork it, @soundasleep? Or do you need help from fellow maintainers? I'm happy to take over the project, but I don't want to if you may be willing to pursue it at some point.
I hope you take no offense, it's open-source and it's OK if you cannot/don't want to maintain the project anymore. But please let us know! Thank you.
Sorry to make this an issue, but it's, well, an issue.
This library's still under fairly active development, but the last release was 10 months ago - 19 June. Do you think we could get a new release, so that we can get features like those in #43 without having to target dev-master, which I'd really hate to have to do
Hi There,
Was using this great script in a function and came looking to see if it had been updated - seems quite allot of changes. However the changes to static functions seem to break for my scenario.
I am a hobbyist / old-school php guy - so any guidance or pointers would be greatly appreciated - the old html2text.php file as a standalone effort as brilliant and worked very well.
I am getting the following errors:
Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag time invalid in Entity, line: 307 in /----->/src/Html2Text.php on line 40
Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: ID makeComment already defined in Entity, line: 391 in /----->/src/Html2Text.php on line 40
Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag section invalid in Entity, line: 560 in /----->/src/Html2Text.php on line 40
Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag header invalid in Entity, line: 560 in /----->/src/Html2Text.php on line 40
Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag article invalid in Entity, line: 563 in /----->/src/Html2Text.php on line 40
Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag article invalid in Entity, line: 576 in /----->/src/Html2Text.php on line 40 on line 40
Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag article invalid in Entity, line: 589 in /----->/src/Html2Text.php on line 40
Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag article invalid in Entity, line: 601 in /----->/src/Html2Text.php on line 40
Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag article invalid in Entity, line: 613 in /----->/src/Html2Text.php on line 40
I am using your lib to extract text from html pages to be used as search data. The links, especially the urls are not useful. Can you add an option to discard them?
Thanks a lot!
On a new server environment, I was getting the error Fatal error: Class 'DOMDocument' not found
until an excessive amount of googling led me to try running (on Ubuntu) apt-get install php7.1-xml
, after which everything worked.
If I use html2text in my local project, everything works but if I load the project into a subfolder (/ domain / subfolder) I get the error Uncaught Error: Class 'Html2Text \ Html2Text' not found.
Hi,
I've found a very weird case where html2text returned broken output after upgrading to 0.3.0, and I was able to narrow it down to this line:
$html = str_replace("\xa0", " ", $html);
I examined the contents of $html
in a hex editor immediately before and after the line and found this diff. The original source snippet reads "für Ihre ", with the spaces being nbsp's, and is UTF-8 encoded.
Before: 66 c3 bc 72 c2 a0 49 68 72 65 c2 a0
After: 66 c3 83 c2 bc 72 c3 82 20 49 68 72 65 c3 82 20
So apparently the nbsp's (c2 a0
) have been transformed into c3 82 20
, which looks like a regular space (20
) but with some gibberish in front of it. Also, the multi-byte character 'ü' (c3 bc
) is now c3 83 c2 bc
, which is also nonsensical.
I've downgraded to 0.2.3 and all is fine now, but I'd like to let you know in case you'd like to look into this.
This tool is great, I follow the README and have the composer environment ready with docker container.
$ cat composer.json
{
"require": {
"soundasleep/html2text": "~0.2"
}
}
$ docker pull composer/composer
$ docker run -v $(pwd):/app composer/composer install
Loading composer repositories with package information
Installing dependencies (including require-dev)
- Installing soundasleep/html2text (0.2.3)
Downloading: 100%
Writing lock file
Generating autoload files
$ ls -l
-rw-r--r-- 1 bill staff 59 26 Nov 17:44 composer.json
drwxr-xr-x 5 bill staff 170 26 Nov 17:50 vendor
-rw-r--r-- 1 bill staff 2322 26 Nov 17:50 composer.lock
So seems I have installed the dependency properly. What can I do the rest to convert the html file to text file?
something likes:
$ cat convert.php
<?php
require '/var/www/html/vendor/autoload.php';
$text = Html2Text\Html2Text::convert($html);
?>
$ php convert.php test.html test.txt
Warning: DOMDocument::loadHTML(): Empty string supplied as input in /Users/bill/pdf/vendor/soundasleep/html2text/src/Html2Text.php on line 43
Fatal error: Uncaught exception 'Html2Text\Html2TextException' with message 'Could not load HTML - badly formed?' in /Users/bill/pdf/vendor/soundasleep/html2text/src/Html2Text.php:44
Stack trace:
#0 /Users/bill/pdf/convert.php(3): Html2Text\Html2Text::convert(NULL)
#1 {main}
thrown in /Users/bill/pdf/vendor/soundasleep/html2text/src/Html2Text.php on line 44
So how to feed the parameter (test.html) and get output file (test.txt) with php command?
I have some user provided content in my application. Sometimes user pass content that is unexpected.
Run the following code to recreate the error.
$content = "lorem:ipsum";
$textContent = \Soundasleep\Html2Text::convert($content);
Note: the character is not visible here but can be seen when copied into an editor. The character is after the colon ':'.
In the database it is stored as a Tab (https://www.fileformat.info/info/unicode/char/000b/index.htm), but when I output it and convert it it fails with the error: DOMDocument::loadHTML(): Invalid char in CDATA 0xB in Entity
Is there a list of unicode characters which are not supported in this way?
Is there another way to encode them in the right way?
I have an application in production which relies in user input, which means badly authored HTML so i have a lot of warning messages like these
[27-Mar-2016 06:26:01] PHP Warning: DOMDocument::loadHTML(): ID templateBody already defined in Entity, line: 1340 in /var/.../html2text/src/Html2Text.php on line 44
[27-Mar-2016 06:26:01] PHP Warning: DOMDocument::loadHTML(): ID templateBody already defined in Entity, line: 1367 in /var/../html2text/src/Html2Text.php on line 44
Is it possible to add an option dismiss all those warnings without breaking the error handling? I know there's a function to disable libxml2's error handling but I'm not in the mood of discovering I broke something else the hard way
https://github.com/soundasleep/html2text/blob/master/src/Html2Text.php#L298 specifies to output two LF characters per paragraph. Would you accept a PR which allows one to change the number of LF characters that are output?
The use case is similar to the MsoNormal
, where no margin exists on the paragraphs. For example:
<p>foo</p><p> </p><p>bar</p><p>quz</p>
Output:
foo
bar
quz
For a list like this now:
<ol>
<li>foo</li>
<li>bar</li>
<li>baz</li>
</ol>
The text would become
foobarbaz
But I would at least want it to become
foo
bar
baz
But even better:
1. foo
2. bar
3. baz
Similar for <ul>
.
Maybe someone already has implemented this?
From https://code.google.com/p/iaml/issues/detail?id=284 with patch
- fixed two new lines for hN tags
- added table cells separation (by tab char)
- added new lines for table and tr tags
As in the heading, becomes Ã
Can easily be fixed with:
$cleaner = str_replace(' ', ' ', $inHTML);
$outText = convert_html_to_text($cleaner);
Not sure if any other characters like that arn't handled correctly - I wish I had time to test more and provide a pull request.
Hi,
Just another deprecation triggered by php8.1…
( ! ) Deprecated: Optional parameter $prevName declared before required parameter $options is implicitly treated as a required parameter in /[...]/soundasleep/html2text/src/Html2Text.php on line 231
Thanks for all
My HTML contains a registered symbol ® which is breaking the conversion. Any way to fix to allow html entities?
The library is great. Any chance you could license it under MIT or BSD-3-Clause so I can use it at work, where they only let us use libraries licensed under MIT or BSD-3-Clause?
There are some changes in this project which do nice table converts: https://github.com/mtibben/html2text Could you merge these changes?
The title tag content doesn't show up in the content. How can I get that to show up?
From https://code.google.com/p/iaml/issues/detail?id=289 with patch
Hi,
When I call
Html2Text::convert("<a href='https://google.ru'></a>
", ['drop_links' => true]);
i got href instead empty
I think result should be empty because i use option 'drop_links' => true
On your next major release that can introduce breaking changes, you should consider changing the namespace you use. I just tried to set up a script that preferred your DOM-parser based implementation, but falls back to this regex-based implementation if the input could not be parsed. But, you guys are using the exact same namespace and class name.
You should be \SoundAsleep\Html2Text perhaps.
In your code you remove 'unnecessary empty lines':
// remove unnecessary empty lines
$output = preg_replace("/\n\n\n*/im", "\n\n", $output);
This actually leads to the removal of intended empty lines (at least in my case). It would be great to have some configuration for things like this. Should I submit a PR for this?
Just to give you an example:
Hello <br><br><br> lets test this <br><br><br> Cheers!
is converted to
Hello
lets test this
Cheers!
But I would expect:
Hello
lets test this
Cheers!
Hi there!
$html = "<a href='http://a.com'></a><a href='http://b.com'></a>";
dd(\Soundasleep\Html2Text::convert($html));
Produces http://a.comhttp://b.com
, which produces incorrect HTML if placed through a markdown parser or auto link parser.
I think the output should be one of the following, preferring the ones first mentioned
[](http://a.com)[](http://b.com)
http://a.com http://b.com
, additional space after each linkHi,
When I call
Html2Text::convert($body);
I randomly get such errors:
DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity, line: 51 line 171 code 2 file vendor/soundasleep/html2text/src/Html2Text.php ErrorException: DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity, line: 52
DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 38
OS version: Ubuntu 18.04.2 LTS
PHP version: php 7.2.19
Symfony version: 2.8.51
soundasleep/html2text version: 1.1.0
The remaining href of a link, when it contains ampersands that are correctly encoded as &
will be in the plaintext. That should not happen.
5 minutes later: okay, it leaves all encoded entities in the remaining string. Is that expected?
2 minutes later: will have to look at my code a bit better, I see I'm injecting from different sources. May be not be an issue afterall...
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.