Giter Club home page Giter Club logo

html2text's Introduction

example workflow Total Downloads

html2text is a very simple script that uses DOM methods to convert HTML into a format similar to what would be rendered by a browser - perfect for places where you need a quick text representation. For example:

<html>
<title>Ignored Title</title>
<body>
  <h1>Hello, World!</h1>

  <p>This is some e-mail content.
  Even though it has whitespace and newlines, the e-mail converter
  will handle it correctly.

  <p>Even mismatched tags.</p>

  <div>A div</div>
  <div>Another div</div>
  <div>A div<div>within a div</div></div>

  <a href="http://foo.com">A link</a>

</body>
</html>

Will be converted into:

Hello, World!

This is some e-mail content. Even though it has whitespace and newlines, the e-mail converter will handle it correctly.

Even mismatched tags.

A div
Another div
A div
within a div

[A link](http://foo.com)

See the original blog post or the related StackOverflow answer.

Installing

You can use Composer to add the package to your project:

{
  "require": {
    "soundasleep/html2text": "~1.1"
  }
}

And then use it quite simply:

$text = \Soundasleep\Html2Text::convert($html);

You can also include the supplied html2text.php and use $text = convert_html_to_text($html); instead.

Options

Option Default Description
ignore_errors false Set to true to ignore any XML parsing errors.
drop_links false Set to true to not render links as [http://foo.com](My Link), but rather just My Link.
char_set 'auto' Specify a specific character set. Pass multiple character sets (comma separated) to detect encoding, default is ASCII,UTF-8

Pass along options as a second argument to convert, for example:

$options = array(
  'ignore_errors' => true,
  // other options go here
);
$text = \Soundasleep\Html2Text::convert($html, $options);

Tests

Some very basic tests are provided in the tests/ directory. Run them with composer install && vendor/bin/phpunit.

Troubleshooting

Class 'DOMDocument' not found

You need to install the PHP XML extension for your PHP version. e.g. apt-get install php7.4-xml

License

html2text is licensed under MIT, making it suitable for both Eclipse and GPL projects.

Other versions

Also see html2text_ruby, a Ruby implementation.

html2text's People

Contributors

cyrosy avatar edgrosvenor avatar guillaume-ro-fr avatar jaylinski avatar laravel-shift avatar manzoorwanijk avatar maskas avatar pethersonmorenosqg avatar phpfui avatar sdkiller avatar soundasleep avatar stadly avatar thetaylor82 avatar timothyasp avatar tommygnr avatar ulrichsg avatar vanhoavn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

html2text's Issues

Unicode

Unucode text like "Добрый день!" after convert show like -> "Ð�обÑ�Ñ�й денÑ�!"

Use images alt attribute

hi, it would be nice if the alt attribute can be taken for images. Some times converting images that are wrapped inside a a-tag return empty text

[ ](http://www.example.com/)

How to convert a html file to text via php command

This tool is great, I follow the README and have the composer environment ready with docker container.

$ cat composer.json

{
  "require": {
    "soundasleep/html2text": "~0.2"
  }
}

$ docker pull composer/composer
$ docker run -v $(pwd):/app composer/composer install
Loading composer repositories with package information
Installing dependencies (including require-dev)
  - Installing soundasleep/html2text (0.2.3)
    Downloading: 100%

Writing lock file
Generating autoload files

$ ls -l 
-rw-r--r--  1 bill  staff        59 26 Nov 17:44 composer.json
drwxr-xr-x  5 bill  staff       170 26 Nov 17:50 vendor
-rw-r--r--  1 bill  staff      2322 26 Nov 17:50 composer.lock

So seems I have installed the dependency properly. What can I do the rest to convert the html file to text file?

something likes:

$ cat convert.php
<?php
require '/var/www/html/vendor/autoload.php';
$text = Html2Text\Html2Text::convert($html);
?>

$ php convert.php test.html test.txt

Warning: DOMDocument::loadHTML(): Empty string supplied as input in /Users/bill/pdf/vendor/soundasleep/html2text/src/Html2Text.php on line 43

Fatal error: Uncaught exception 'Html2Text\Html2TextException' with message 'Could not load HTML - badly formed?' in /Users/bill/pdf/vendor/soundasleep/html2text/src/Html2Text.php:44
Stack trace:
#0 /Users/bill/pdf/convert.php(3): Html2Text\Html2Text::convert(NULL)
#1 {main}
  thrown in /Users/bill/pdf/vendor/soundasleep/html2text/src/Html2Text.php on line 44

So how to feed the parameter (test.html) and get output file (test.txt) with php command?

Breaks on Ascii Characters

My HTML contains a registered symbol ® which is breaking the conversion. Any way to fix to allow html entities?

Namespace better

On your next major release that can introduce breaking changes, you should consider changing the namespace you use. I just tried to set up a script that preferred your DOM-parser based implementation, but falls back to this regex-based implementation if the input could not be parsed. But, you guys are using the exact same namespace and class name.

You should be \SoundAsleep\Html2Text perhaps.

New release

Apologies for the noise, but is there a new release planned? We'd like to make use of the blockquote support.

Suggest noting the Apache requirement

On a new server environment, I was getting the error Fatal error: Class 'DOMDocument' not found until an excessive amount of googling led me to try running (on Ubuntu) apt-get install php7.1-xml, after which everything worked.

Bug with non-breaking spaces in 0.3.0?

Hi,

I've found a very weird case where html2text returned broken output after upgrading to 0.3.0, and I was able to narrow it down to this line:

$html = str_replace("\xa0", " ", $html);

I examined the contents of $html in a hex editor immediately before and after the line and found this diff. The original source snippet reads "für Ihre ", with the spaces being nbsp's, and is UTF-8 encoded.

Before: 66 c3 bc 72 c2 a0 49 68 72 65 c2 a0
After: 66 c3 83 c2 bc 72 c3 82 20 49 68 72 65 c3 82 20

So apparently the nbsp's (c2 a0) have been transformed into c3 82 20, which looks like a regular space (20) but with some gibberish in front of it. Also, the multi-byte character 'ü' (c3 bc) is now c3 83 c2 bc, which is also nonsensical.

I've downgraded to 0.2.3 and all is fine now, but I'd like to let you know in case you'd like to look into this.

Leaves ampersand in href

The remaining href of a link, when it contains ampersands that are correctly encoded as &amp; will be in the plaintext. That should not happen.

5 minutes later: okay, it leaves all encoded entities in the remaining string. Is that expected?

2 minutes later: will have to look at my code a bit better, I see I'm injecting from different sources. May be not be an issue afterall...

💀 Dead project?

Hi,

We've been waiting for an upgrade of this package for compatibility with PHP 8 (#88) for a few months, and even though a fix has been proposed in #86 and #87, no response has been received.

In fact, the last commit & the last release date back to Feb 2019, and so does the last comment activity from @soundasleep I could pinpoint from a quick search in the issue tracker, even though you look active on GitHub recently.

Should we consider this project abandoned and fork it, @soundasleep? Or do you need help from fellow maintainers? I'm happy to take over the project, but I don't want to if you may be willing to pursue it at some point.

I hope you take no offense, it's open-source and it's OK if you cannot/don't want to maintain the project anymore. But please let us know! Thank you.

Problem decoding certain unicode representations

I have some user provided content in my application. Sometimes user pass content that is unexpected.

Run the following code to recreate the error.

    $content = "lorem:ipsum";
    $textContent = \Soundasleep\Html2Text::convert($content);

Note: the character is not visible here but can be seen when copied into an editor. The character is after the colon ':'.

In the database it is stored as a Tab (https://www.fileformat.info/info/unicode/char/000b/index.htm), but when I output it and convert it it fails with the error: DOMDocument::loadHTML(): Invalid char in CDATA 0xB in Entity

Is there a list of unicode characters which are not supported in this way?
Is there another way to encode them in the right way?

error for correct url with multiple get params

Capture d’écran 2022-11-02 à 12 16 56

>>> $html = '<a href="https://www.google.com?utm_source=croix&utm_campaign=croix">ok</a>'
=> "<a href="https://www.google.com?utm_source=croix&utm_campaign=croix">ok</a>"

>>> \Soundasleep\Html2Text::convert($html);
PHP Warning:  DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 1 in /Users/maxime/Repos/benevolt-app/vendor/soundasleep/html2text/src/Html2Text.php on line 171
=> "[ok](https://www.google.com?utm_source=croix&utm_campaign=croix)"

>>> $html = '<a href="https://www.google.com?utm_source=croix">ok</a>'
=> "<a href="https://www.google.com?utm_source=croix">ok</a>"

>>> \Soundasleep\Html2Text::convert($html);
=> "[ok](https://www.google.com?utm_source=croix)"

Support <ol>, <ul> and <li>

For a list like this now:

<ol>
<li>foo</li>
<li>bar</li>
<li>baz</li>
</ol>

The text would become

foobarbaz

But I would at least want it to become

foo
bar
baz

But even better:

1. foo
2. bar
3. baz

Similar for <ul>.

Maybe someone already has implemented this?

Invalid Entity Errors on HTML5

I'm getting various invalid entity errors when using on what I 'think' are HTML5 pages.

Example:
Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag header invalid in Entity,

Example HTML -> pick any page on www.independent.ie

This may be similar/related to the same issues happening here with the DOM document: http://stackoverflow.com/questions/6090667/php-domdocument-errors-warnings-on-html5-tags

This is a fantastic HTML2TEXT converter. A huge thanks and congratulations to the dev. I hope there's a work around better than turning off errors for this issue.

Cheers.

Incorrect operation of the drop_links option

Hi,
When I call
Html2Text::convert("<a href='https://google.ru'></a>", ['drop_links' => true]);
i got href instead empty

I think result should be empty because i use option 'drop_links' => true

Add option to discard links

I am using your lib to extract text from html pages to be used as search data. The links, especially the urls are not useful. Can you add an option to discard them?

Thanks a lot!

Changes to static functions

Hi There,
Was using this great script in a function and came looking to see if it had been updated - seems quite allot of changes. However the changes to static functions seem to break for my scenario.

I am a hobbyist / old-school php guy - so any guidance or pointers would be greatly appreciated - the old html2text.php file as a standalone effort as brilliant and worked very well.

I am getting the following errors:

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag time invalid in Entity, line: 307 in /----->/src/Html2Text.php on line 40

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: ID makeComment already defined in Entity, line: 391 in /----->/src/Html2Text.php on line 40

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag section invalid in Entity, line: 560 in /----->/src/Html2Text.php on line 40

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag header invalid in Entity, line: 560 in /----->/src/Html2Text.php on line 40

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag article invalid in Entity, line: 563 in /----->/src/Html2Text.php on line 40

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag article invalid in Entity, line: 576 in /----->/src/Html2Text.php on line 40 on line 40

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag article invalid in Entity, line: 589 in /----->/src/Html2Text.php on line 40

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag article invalid in Entity, line: 601 in /----->/src/Html2Text.php on line 40

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag article invalid in Entity, line: 613 in /----->/src/Html2Text.php on line 40

No new release for 10 months

Sorry to make this an issue, but it's, well, an issue.

This library's still under fairly active development, but the last release was 10 months ago - 19 June. Do you think we could get a new release, so that we can get features like those in #43 without having to target dev-master, which I'd really hate to have to do

Looking for maintainer

Hi everyone! I'm no longer a PHP dev and I've run out of capacity to maintain this project, so I'm looking for some maintainers going forward. Alternatively I can archive the project as read-only.

Ideal criteria:

  • You have at least one project on GitHub
  • You have experience releasing components to Composer

Other than that I'm happy for maintainers to take this project into whatever direction it needs to go! :)

For the future of this project I'd suggest some of the most critical tasks are

  • Move CI from travis-ci to Github Actions
  • Update to work under PHP 8 e.g. #87

Minimal version of PHP ?

Hello, there is no clear information on this repository, nor a link to a doc, that tells what is the minimum version of PHP to run the current version of html2text library.

With PHP releases moving very fast now, this is becoming a very important information. Projects using PHP dependencies must currently handle many libraries dropping PHP 5.x and early PHP 7.x (at least, 7.0). This would be very helpful to share information about this topic for html2text.

(BTW: thanks for the move to MIT ;-) )

Class 'Html2Text\Html2Text' not found

If I use html2text in my local project, everything works but if I load the project into a subfolder (/ domain / subfolder) I get the error Uncaught Error: Class 'Html2Text \ Html2Text' not found.

Html special chars

Html special chars replacing  <script>alert(1)</script> => <script>alert(1)</script>

undefined entities - with fix

It might go into the direction of #35, but it's a little different use case.

I have a mailchimp-email template and they come with weird unsupported entities.

Changing Html2Text.php around line 50 to this fixes it. The text-version of my email looks perfect, it was just throwing ugly exceptions.

$doc = new \DOMDocument();
libxml_use_internal_errors(true);
if (!$doc->loadHTML($html)) {
    throw new Html2TextException("Could not load HTML - badly formed?", $html);
}
libxml_clear_errors();

BTW: GREAT WORK WITH THIS TOOL!

Title Tag

The title tag content doesn't show up in the content. How can I get that to show up?

Links without text should be discarded

Hi there!

$html = "<a href='http://a.com'></a><a href='http://b.com'></a>";
dd(\Soundasleep\Html2Text::convert($html));

Produces http://a.comhttp://b.com, which produces incorrect HTML if placed through a markdown parser or auto link parser.
I think the output should be one of the following, preferring the ones first mentioned

  1. [](http://a.com)[](http://b.com)
  2. Totally empty
  3. http://a.com http://b.com , additional space after each link

PHP 8.2 Support

Looks like PHP 8.2 is producing this error:

Deprecated: mb_convert_encoding(): Handling HTML entities via mbstring is deprecated; use htmlspecialchars, htmlentities, or mb_encode_numericentity/mb_decode_numericentity instead.

I tried to fix it as the error suggests in a couple of ways, but the tests are breaking. After my attempted fixes, the tests seem to expect HTML back, but are returning plain text, which is the whole point of package, so not sure what is going on.

An update for PHP 8.2 would be greatly appreciated. Let me know of anything I can do to help.

&nbsp; becomes Ã

As in the heading,   becomes Ã

Can easily be fixed with:

$cleaner = str_replace('&nbsp;', ' ', $inHTML);
$outText = convert_html_to_text($cleaner);

Not sure if any other characters like that arn't handled correctly - I wish I had time to test more and provide a pull request.

MIT License

The library is great. Any chance you could license it under MIT or BSD-3-Clause so I can use it at work, where they only let us use libraries licensed under MIT or BSD-3-Clause?

Using < character as input to html2text

Dear Jevon and Team,
Appreciate your effort in maintaining this library. We just started using this library and noticed a small issue that you may have already addressed.
Our input HTML text contains valid '<' characteras a part of the content (not the html tag). The library DomParser seems to be stripping that out. Is there a way we can escape that character and send as input to your library

`img` src not being passed through

Input

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<?xml encoding="utf-8" ?>
<html>

<body>
    <div dir="ltr"><img src="cid:ii_l388tk2h0" alt="obs-layout.png" width="562" height="237"><br></div><br>
</body>

</html>

Actual output

[obs-layout.png]

Expected output

![obs-layout.png](cid:ii_l388tk2h0)

Possible?

Add option to disable warnings during conversion

I have an application in production which relies in user input, which means badly authored HTML so i have a lot of warning messages like these

[27-Mar-2016 06:26:01] PHP Warning:  DOMDocument::loadHTML(): ID templateBody already defined in Entity, line: 1340 in /var/.../html2text/src/Html2Text.php on line 44
[27-Mar-2016 06:26:01] PHP Warning:  DOMDocument::loadHTML(): ID templateBody already defined in Entity, line: 1367 in /var/../html2text/src/Html2Text.php on line 44

Is it possible to add an option dismiss all those warnings without breaking the error handling? I know there's a function to disable libxml2's error handling but I'm not in the mood of discovering I broke something else the hard way

random DOMDocument::loadHTML() error

Hi,
When I call
Html2Text::convert($body);
I randomly get such errors:

DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity, line: 51 line 171 code 2 file vendor/soundasleep/html2text/src/Html2Text.php ErrorException: DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity, line: 52

DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 38

OS version: Ubuntu 18.04.2 LTS
PHP version: php 7.2.19
Symfony version: 2.8.51
soundasleep/html2text version: 1.1.0

Abusive removal of br nodes leads to incorrect output

Hello !

There is some code doing intentional removal of
nodes when they are the last child of a node that also contained text. Here's a very simple example about how this can lead to incorrect results (this is stuff I'm receiving from bad html emails) :

<font size="+1">Vikings: Wolves of Midgard<br></font><font size="+1">Valkyria Chronicles<br>
<br>
World Of Warcraft Battlechest</font>

The expected output would be

Vikings: Wolves of Midgard
Valkyria Chronicles

World Of Warcraft Battlechest

The actual output is:

Vikings: Wolves of MidgardValkyria Chronicles

World Of Warcraft Battlechest

mb_convert_encoding is DEPRECATED in php 8.2

I have php 8.2 and I get this error when i call convert function:
E_DEPRECATED: mb_convert_encoding(): Handling HTML entities via mbstring is deprecated; use htmlspecialchars, htmlentities, or mb_encode_numericentity/mb_decode_numericentity instead in /composer/vendor/soundasleep/html2text/src/Html2Text.php on line 54

Surpressing multiple brs

In your code you remove 'unnecessary empty lines':

// remove unnecessary empty lines
$output = preg_replace("/\n\n\n*/im", "\n\n", $output);

This actually leads to the removal of intended empty lines (at least in my case). It would be great to have some configuration for things like this. Should I submit a PR for this?

Just to give you an example:

Hello <br><br><br> lets test this <br><br><br> Cheers!

is converted to

Hello

lets test this

Cheers!

But I would expect:

Hello


lets test this


Cheers!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.