soundasleep / html2text Goto Github PK

A PHP component to convert HTML into a plain text format

License: MIT License

PHP 1.20% HTML 98.80%

html2text's Issues

Suggestion: convert links where URL href = description

https://github.com/soundasleep/html2text/blob/master/src/Html2Text.php#L291-294

<a href="http://foo.com">http://foo.com</a> is converted to http://foo.com instead of [http://foo.com](http://foo.com)

the latter is the preferred behavior. example use case: URLs are shortened in href but not in description.

ErrorException: Required parameter $options follows optional parameter $prevName

	static function iterateOverNode($node, $prevName = null, $in_pre = false, $is_office_document = false, $options)

Removes line endings in plain text

So if you pass multi line plain text it will return the same text but in one line

PHP 8.2 Support

Looks like PHP 8.2 is producing this error:

Deprecated: mb_convert_encoding(): Handling HTML entities via mbstring is deprecated; use htmlspecialchars, htmlentities, or mb_encode_numericentity/mb_decode_numericentity instead.

I tried to fix it as the error suggests in a couple of ways, but the tests are breaking. After my attempted fixes, the tests seem to expect HTML back, but are returning plain text, which is the whole point of package, so not sure what is going on.

An update for PHP 8.2 would be greatly appreciated. Let me know of anything I can do to help.

Html special chars

Html special chars replacing <script>alert(1)</script> => <script>alert(1)</script>

Improve PHPDoc documentation

From https://code.google.com/p/iaml/issues/detail?id=286&q=html2text with patch

Using < character as input to html2text

Dear Jevon and Team,
Appreciate your effort in maintaining this library. We just started using this library and noticed a small issue that you may have already addressed.
Our input HTML text contains valid '<' characteras a part of the content (not the html tag). The library DomParser seems to be stripping that out. Is there a way we can escape that character and send as input to your library

Use images alt attribute

hi, it would be nice if the alt attribute can be taken for images. Some times converting images that are wrapped inside a a-tag return empty text

[ ](http://www.example.com/)

Minimal version of PHP ?

Hello, there is no clear information on this repository, nor a link to a doc, that tells what is the minimum version of PHP to run the current version of html2text library.

With PHP releases moving very fast now, this is becoming a very important information. Projects using PHP dependencies must currently handle many libraries dropping PHP 5.x and early PHP 7.x (at least, 7.0). This would be very helpful to share information about this topic for html2text.

(BTW: thanks for the move to MIT ;-) )

Abusive removal of br nodes leads to incorrect output

Hello !

There is some code doing intentional removal of
nodes when they are the last child of a node that also contained text. Here's a very simple example about how this can lead to incorrect results (this is stuff I'm receiving from bad html emails) :

<font size="+1">Vikings: Wolves of Midgard<br></font><font size="+1">Valkyria Chronicles<br>
<br>
World Of Warcraft Battlechest</font>

The expected output would be

Vikings: Wolves of Midgard
Valkyria Chronicles

World Of Warcraft Battlechest

The actual output is:

Vikings: Wolves of MidgardValkyria Chronicles

World Of Warcraft Battlechest

Invalid Entity Errors on HTML5

I'm getting various invalid entity errors when using on what I 'think' are HTML5 pages.

Example:
Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag header invalid in Entity,

Example HTML -> pick any page on www.independent.ie

This may be similar/related to the same issues happening here with the DOM document: http://stackoverflow.com/questions/6090667/php-domdocument-errors-warnings-on-html5-tags

This is a fantastic HTML2TEXT converter. A huge thanks and congratulations to the dev. I hope there's a work around better than turning off errors for this issue.

Cheers.

Unicode

Unucode text like "Добрый день!" after convert show like -> "Ð�Ð¾Ð±Ñ�Ñ�Ð¹ Ð´ÐµÐ½Ñ�!"

`img` src not being passed through

Input

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<?xml encoding="utf-8" ?>
<html>

<body>
    <div dir="ltr"><img src="cid:ii_l388tk2h0" alt="obs-layout.png" width="562" height="237"><br></div><br>
</body>

</html>

Actual output

[obs-layout.png]

Expected output

![obs-layout.png](cid:ii_l388tk2h0)

Possible?

Causes failure with AWS SES when sending RAW email with attachments

Works great with AWS SES as long as you don't have attachments and must send email as RAW. When email is sent as RAW, text causes delivery failure.

undefined entities - with fix

It might go into the direction of #35, but it's a little different use case.

I have a mailchimp-email template and they come with weird unsupported entities.

Changing Html2Text.php around line 50 to this fixes it. The text-version of my email looks perfect, it was just throwing ugly exceptions.

$doc = new \DOMDocument();
libxml_use_internal_errors(true);
if (!$doc->loadHTML($html)) {
    throw new Html2TextException("Could not load HTML - badly formed?", $html);
}
libxml_clear_errors();

BTW: GREAT WORK WITH THIS TOOL!

mb_convert_encoding is DEPRECATED in php 8.2

I have php 8.2 and I get this error when i call convert function:
E_DEPRECATED: mb_convert_encoding(): Handling HTML entities via mbstring is deprecated; use htmlspecialchars, htmlentities, or mb_encode_numericentity/mb_decode_numericentity instead in /composer/vendor/soundasleep/html2text/src/Html2Text.php on line 54

Strip zero width non joiners

Originally reported in html2text_ruby: soundasleep/html2text_ruby#5

This test case needs to be copied over to html2text.

New release?

Hi, how about a new release including #19 ?

Thanks! :-)

Reduce duplicate URLs within links

From Claudio Thomas https://code.google.com/p/iaml/issues/detail?id=288:

A small suggestion:
Line 195 change from
if ($href == $output) {
to:
if ($href == $output or $href == "mailto:$output or $href == "http://$output"

Typically the URLs with the mentioned prefixes are self explained, so that you see the same text twice in the text.

error for correct url with multiple get params

>>> $html = '<a href="https://www.google.com?utm_source=croix&utm_campaign=croix">ok</a>'
=> "<a href="https://www.google.com?utm_source=croix&utm_campaign=croix">ok</a>"

>>> \Soundasleep\Html2Text::convert($html);
PHP Warning:  DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 1 in /Users/maxime/Repos/benevolt-app/vendor/soundasleep/html2text/src/Html2Text.php on line 171
=> "[ok](https://www.google.com?utm_source=croix&utm_campaign=croix)"

>>> $html = '<a href="https://www.google.com?utm_source=croix">ok</a>'
=> "<a href="https://www.google.com?utm_source=croix">ok</a>"

>>> \Soundasleep\Html2Text::convert($html);
=> "[ok](https://www.google.com?utm_source=croix)"

New release

Apologies for the noise, but is there a new release planned? We'd like to make use of the blockquote support.

Looking for maintainer

Hi everyone! I'm no longer a PHP dev and I've run out of capacity to maintain this project, so I'm looking for some maintainers going forward. Alternatively I can archive the project as read-only.

Ideal criteria:

You have at least one project on GitHub
You have experience releasing components to Composer

Other than that I'm happy for maintainers to take this project into whatever direction it needs to go! :)

For the future of this project I'd suggest some of the most critical tasks are

Move CI from travis-ci to Github Actions
Update to work under PHP 8 e.g. #87

💀 Dead project?

Hi,

We've been waiting for an upgrade of this package for compatibility with PHP 8 (#88) for a few months, and even though a fix has been proposed in #86 and #87, no response has been received.

In fact, the last commit & the last release date back to Feb 2019, and so does the last comment activity from @soundasleep I could pinpoint from a quick search in the issue tracker, even though you look active on GitHub recently.

Should we consider this project abandoned and fork it, @soundasleep? Or do you need help from fellow maintainers? I'm happy to take over the project, but I don't want to if you may be willing to pursue it at some point.

I hope you take no offense, it's open-source and it's OK if you cannot/don't want to maintain the project anymore. But please let us know! Thank you.

No new release for 10 months

Sorry to make this an issue, but it's, well, an issue.

This library's still under fairly active development, but the last release was 10 months ago - 19 June. Do you think we could get a new release, so that we can get features like those in #43 without having to target dev-master, which I'd really hate to have to do

Changes to static functions

Hi There,
Was using this great script in a function and came looking to see if it had been updated - seems quite allot of changes. However the changes to static functions seem to break for my scenario.

I am a hobbyist / old-school php guy - so any guidance or pointers would be greatly appreciated - the old html2text.php file as a standalone effort as brilliant and worked very well.

I am getting the following errors:

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag time invalid in Entity, line: 307 in /----->/src/Html2Text.php on line 40

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: ID makeComment already defined in Entity, line: 391 in /----->/src/Html2Text.php on line 40

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag section invalid in Entity, line: 560 in /----->/src/Html2Text.php on line 40

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag header invalid in Entity, line: 560 in /----->/src/Html2Text.php on line 40

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag article invalid in Entity, line: 563 in /----->/src/Html2Text.php on line 40

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag article invalid in Entity, line: 576 in /----->/src/Html2Text.php on line 40 on line 40

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag article invalid in Entity, line: 589 in /----->/src/Html2Text.php on line 40

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag article invalid in Entity, line: 601 in /----->/src/Html2Text.php on line 40

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag article invalid in Entity, line: 613 in /----->/src/Html2Text.php on line 40

Add option to discard links

I am using your lib to extract text from html pages to be used as search data. The links, especially the urls are not useful. Can you add an option to discard them?

Thanks a lot!

Suggest noting the Apache requirement

On a new server environment, I was getting the error Fatal error: Class 'DOMDocument' not found until an excessive amount of googling led me to try running (on Ubuntu) apt-get install php7.1-xml, after which everything worked.

Class 'Html2Text\Html2Text' not found

If I use html2text in my local project, everything works but if I load the project into a subfolder (/ domain / subfolder) I get the error Uncaught Error: Class 'Html2Text \ Html2Text' not found.

Bug with non-breaking spaces in 0.3.0?

Hi,

I've found a very weird case where html2text returned broken output after upgrading to 0.3.0, and I was able to narrow it down to this line:

$html = str_replace("\xa0", " ", $html);

I examined the contents of $html in a hex editor immediately before and after the line and found this diff. The original source snippet reads "für Ihre ", with the spaces being nbsp's, and is UTF-8 encoded.

Before: 66 c3 bc 72 c2 a0 49 68 72 65 c2 a0
After: 66 c3 83 c2 bc 72 c3 82 20 49 68 72 65 c3 82 20

So apparently the nbsp's (c2 a0) have been transformed into c3 82 20, which looks like a regular space (20) but with some gibberish in front of it. Also, the multi-byte character 'ü' (c3 bc) is now c3 83 c2 bc, which is also nonsensical.

I've downgraded to 0.2.3 and all is fine now, but I'd like to let you know in case you'd like to look into this.

How to convert a html file to text via php command

This tool is great, I follow the README and have the composer environment ready with docker container.

$ cat composer.json

{
  "require": {
    "soundasleep/html2text": "~0.2"
  }
}

$ docker pull composer/composer
$ docker run -v $(pwd):/app composer/composer install
Loading composer repositories with package information
Installing dependencies (including require-dev)
  - Installing soundasleep/html2text (0.2.3)
    Downloading: 100%

Writing lock file
Generating autoload files

$ ls -l 
-rw-r--r--  1 bill  staff        59 26 Nov 17:44 composer.json
drwxr-xr-x  5 bill  staff       170 26 Nov 17:50 vendor
-rw-r--r--  1 bill  staff      2322 26 Nov 17:50 composer.lock

So seems I have installed the dependency properly. What can I do the rest to convert the html file to text file?

something likes:

$ cat convert.php
<?php
require '/var/www/html/vendor/autoload.php';
$text = Html2Text\Html2Text::convert($html);
?>

$ php convert.php test.html test.txt

Warning: DOMDocument::loadHTML(): Empty string supplied as input in /Users/bill/pdf/vendor/soundasleep/html2text/src/Html2Text.php on line 43

Fatal error: Uncaught exception 'Html2Text\Html2TextException' with message 'Could not load HTML - badly formed?' in /Users/bill/pdf/vendor/soundasleep/html2text/src/Html2Text.php:44
Stack trace:
#0 /Users/bill/pdf/convert.php(3): Html2Text\Html2Text::convert(NULL)
#1 {main}
  thrown in /Users/bill/pdf/vendor/soundasleep/html2text/src/Html2Text.php on line 44

So how to feed the parameter (test.html) and get output file (test.txt) with php command?

Problem decoding certain unicode representations

I have some user provided content in my application. Sometimes user pass content that is unexpected.

Run the following code to recreate the error.

    $content = "lorem:ipsum";
    $textContent = \Soundasleep\Html2Text::convert($content);

Note: the character is not visible here but can be seen when copied into an editor. The character is after the colon ':'.

In the database it is stored as a Tab (https://www.fileformat.info/info/unicode/char/000b/index.htm), but when I output it and convert it it fails with the error: DOMDocument::loadHTML(): Invalid char in CDATA 0xB in Entity

Is there a list of unicode characters which are not supported in this way?
Is there another way to encode them in the right way?

Add option to disable warnings during conversion

I have an application in production which relies in user input, which means badly authored HTML so i have a lot of warning messages like these

[27-Mar-2016 06:26:01] PHP Warning:  DOMDocument::loadHTML(): ID templateBody already defined in Entity, line: 1340 in /var/.../html2text/src/Html2Text.php on line 44
[27-Mar-2016 06:26:01] PHP Warning:  DOMDocument::loadHTML(): ID templateBody already defined in Entity, line: 1367 in /var/../html2text/src/Html2Text.php on line 44

Is it possible to add an option dismiss all those warnings without breaking the error handling? I know there's a function to disable libxml2's error handling but I'm not in the mood of discovering I broke something else the hard way

Ability to change number of new lines in output

https://github.com/soundasleep/html2text/blob/master/src/Html2Text.php#L298 specifies to output two LF characters per paragraph. Would you accept a PR which allows one to change the number of LF characters that are output?

The use case is similar to the MsoNormal, where no margin exists on the paragraphs. For example:

<p>foo</p><p>&nbsp;</p><p>bar</p><p>quz</p>

Output:

foo

bar
quz

Support <ol>, <ul> and <li>

For a list like this now:

<ol>
<li>foo</li>
<li>bar</li>
<li>baz</li>
</ol>

The text would become

foobarbaz

But I would at least want it to become

foo
bar
baz

But even better:

1. foo
2. bar
3. baz

Similar for <ul>.

Maybe someone already has implemented this?

Generate tabs for table cells

From https://code.google.com/p/iaml/issues/detail?id=284 with patch

fixed two new lines for hN tags

added table cells separation (by tab char)

added new lines for table and tr tags

  becomes Ã

As in the heading, becomes Ã

Can easily be fixed with:

$cleaner = str_replace('&nbsp;', ' ', $inHTML);
$outText = convert_html_to_text($cleaner);

Not sure if any other characters like that arn't handled correctly - I wish I had time to test more and provide a pull request.

PHP8.1, Deprecated: Optional parameter $prevName declared before required parameter $options

Hi,

Just another deprecation triggered by php8.1…

( ! ) Deprecated: Optional parameter $prevName declared before required parameter $options is implicitly treated as a required parameter in /[...]/soundasleep/html2text/src/Html2Text.php on line 231

Thanks for all

Breaks on Ascii Characters

My HTML contains a registered symbol ® which is breaking the conversion. Any way to fix to allow html entities?

MIT License

The library is great. Any chance you could license it under MIT or BSD-3-Clause so I can use it at work, where they only let us use libraries licensed under MIT or BSD-3-Clause?

tables are converted to a single line

There are some changes in this project which do nice table converts: https://github.com/mtibben/html2text Could you merge these changes?

Title Tag

The title tag content doesn't show up in the content. How can I get that to show up?

Add check for childNodes to avoid warning messages

From https://code.google.com/p/iaml/issues/detail?id=289 with patch

Support RTL

Incorrect operation of the drop_links option

Hi,
When I call
Html2Text::convert("<a href='https://google.ru'></a>", ['drop_links' => true]);
i got href instead empty

I think result should be empty because i use option 'drop_links' => true

Namespace better

On your next major release that can introduce breaking changes, you should consider changing the namespace you use. I just tried to set up a script that preferred your DOM-parser based implementation, but falls back to this regex-based implementation if the input could not be parsed. But, you guys are using the exact same namespace and class name.

You should be \SoundAsleep\Html2Text perhaps.

Surpressing multiple brs

In your code you remove 'unnecessary empty lines':

// remove unnecessary empty lines
$output = preg_replace("/\n\n\n*/im", "\n\n", $output);

This actually leads to the removal of intended empty lines (at least in my case). It would be great to have some configuration for things like this. Should I submit a PR for this?

Just to give you an example:

Hello <br><br><br> lets test this <br><br><br> Cheers!

is converted to

Hello

lets test this

Cheers!

But I would expect:

Hello


lets test this


Cheers!

Links without text should be discarded

Hi there!

$html = "<a href='http://a.com'></a><a href='http://b.com'></a>";
dd(\Soundasleep\Html2Text::convert($html));

Produces http://a.comhttp://b.com, which produces incorrect HTML if placed through a markdown parser or auto link parser.
I think the output should be one of the following, preferring the ones first mentioned

[](http://a.com)[](http://b.com)
Totally empty
http://a.com http://b.com , additional space after each link

random DOMDocument::loadHTML() error

Hi,
When I call
Html2Text::convert($body);
I randomly get such errors:

DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity, line: 51 line 171 code 2 file vendor/soundasleep/html2text/src/Html2Text.php ErrorException: DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity, line: 52

DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 38

OS version: Ubuntu 18.04.2 LTS
PHP version: php 7.2.19
Symfony version: 2.8.51
soundasleep/html2text version: 1.1.0

Leaves ampersand in href

The remaining href of a link, when it contains ampersands that are correctly encoded as & will be in the plaintext. That should not happen.

5 minutes later: okay, it leaves all encoded entities in the remaining string. Is that expected?

2 minutes later: will have to look at my code a bit better, I see I'm injecting from different sources. May be not be an issue afterall...

soundasleep / html2text Goto Github PK

html2text's Issues

Input

Actual output

Expected output

Recommend Projects

Recommend Topics

Recommend Org