Giter Club home page Giter Club logo

php-htmldiff's Introduction

php-htmldiff

Scrutinizer Code Quality Build Status Code Coverage Packagist Average time to resolve an issue Percentage of issues still open

php-htmldiff is a library for comparing two HTML files/snippets and highlighting the differences using simple HTML.

This HTML Diff implementation was forked from rashid2538/php-htmldiff and has been modified with new features, bug fixes, and enhancements to the original code.

For more information on these modifications, read the differences from rashid2538/php-htmldiff or view the CHANGELOG.

Demo

https://php-htmldiff.caxy.com/

Installation

The recommended way to install php-htmldiff is through Composer. Require the caxy/php-htmldiff package by running following command:

composer require caxy/php-htmldiff

This will resolve the latest stable version.

Otherwise, install the library and setup the autoloader yourself.

Working with Symfony

If you are using Symfony, you can use the caxy/HtmlDiffBundle to make life easy!

Usage

use Caxy\HtmlDiff\HtmlDiff;

$htmlDiff = new HtmlDiff($oldHtml, $newHtml);
$content = $htmlDiff->build();

CSS Example

See https://github.com/caxy/php-htmldiff/blob/master/demo/codes.css for starter CSS you can use for displaying the HTML diff output.

Configuration

The configuration for HtmlDiff is contained in the Caxy\HtmlDiff\HtmlDiffConfig class.

There are two ways to set the configuration:

  1. Configure an Existing HtmlDiff Object
  2. Create and Use a HtmlDiffConfig Object

Configure an Existing HtmlDiff Object

When a new HtmlDiff object is created, it creates a HtmlDiffConfig object with the default configuration. You can change the configuration using setters on the object:

use Caxy\HtmlDiff\HtmlDiff;

// ...

$htmlDiff = new HtmlDiff($oldHtml, $newHtml);

// Set some of the configuration options.
$htmlDiff->getConfig()
    ->setMatchThreshold(80)
    ->setInsertSpaceInReplace(true)
;

// Calculate the differences using the configuration and get the html diff.
$content = $htmlDiff->build();

// ...

Create and Use a HtmlDiffConfig Object

You can also set the configuration by creating an instance of Caxy\HtmlDiff\HtmlDiffConfig and using it when creating a new HtmlDiff object using HtmlDiff::create.

This is useful when creating more than one instance of HtmlDiff:

use Caxy\HtmlDiff\HtmlDiff;
use Caxy\HtmlDiff\HtmlDiffConfig;

// ...

$config = new HtmlDiffConfig();
$config
    ->setMatchThreshold(95)
    ->setInsertSpaceInReplace(true)
;

// Create an HtmlDiff object with the custom configuration.
$firstHtmlDiff = HtmlDiff::create($oldHtml, $newHtml, $config);
$firstContent = $firstHtmlDiff->build();

$secondHtmlDiff = HtmlDiff::create($oldHtml2, $newHtml2, $config);
$secondHtmlDiff->getConfig()->setMatchThreshold(50);

$secondContent = $secondHtmlDiff->build();

// ...

Full Configuration with Defaults:

$config = new HtmlDiffConfig();
$config
    // Percentage required for list items to be considered a match.
    ->setMatchThreshold(80)
    
    // Set the encoding of the text to be diffed.
    ->setEncoding('UTF-8')
    
    // If true, a space will be added between the <del> and <ins> tags of text that was replaced.
    ->setInsertSpaceInReplace(false)
    
    // Option to disable the new Table Diffing feature and treat tables as regular text.
    ->setUseTableDiffing(true)
    
    // Pass an instance of \Doctrine\Common\Cache\Cache to cache the calculated diffs.
    ->setCacheProvider(null)

    // Disable the HTML purifier (only do this if you known what you're doing)
    // This bundle heavily relies on the purified input from ezyang/htmlpurifier
    ->setPurifierEnabled(true)
    
    // Set the cache directory that HTMLPurifier should use.
    ->setPurifierCacheLocation(null)
    
    // Group consecutive deletions and insertions instead of showing a deletion and insertion for each word individually. 
    ->setGroupDiffs(true)
    
    // List of characters to consider part of a single word when in the middle of text.
    ->setSpecialCaseChars(array('.', ',', '(', ')', '\''))
        
    // List of tags (and their replacement strings) to be diffed in isolation.
    ->setIsolatedDiffTags(array(
        'ol'     => '[[REPLACE_ORDERED_LIST]]',
        'ul'     => '[[REPLACE_UNORDERED_LIST]]',
        'sub'    => '[[REPLACE_SUB_SCRIPT]]',
        'sup'    => '[[REPLACE_SUPER_SCRIPT]]',
        'dl'     => '[[REPLACE_DEFINITION_LIST]]',
        'table'  => '[[REPLACE_TABLE]]',
        'strong' => '[[REPLACE_STRONG]]',
        'b'      => '[[REPLACE_B]]',
        'em'     => '[[REPLACE_EM]]',
        'i'      => '[[REPLACE_I]]',
        'a'      => '[[REPLACE_A]]',
    ))
    
    // Sets whether newline characters are kept or removed when `$htmlDiff->build()` is called.
    // For example, if your content includes <pre> tags, you might want to set this to true.
    ->setKeepNewLines(false)
;

Contributing

See CONTRIBUTING file.

Contributor Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms. See CODE_OF_CONDUCT file.

Credits

Did we miss anyone? If we did, let us know or put in a pull request!

License

php-htmldiff is available under GNU General Public License, version 2. See the LICENSE file for details.

TODO

  • Tests, tests, and more tests! (mostly unit tests) - need more tests before we can major refactoring / cleanup for a v1 release
  • Add documentation for setting up a cache provider (doctrine cache)
    • Maybe add abstraction layer for cache + adapter for doctrine cache
  • Make HTML Purifier an optional dependency - possibly use abstraction layer for purifiers so alternatives could be used (or none at all for performance)
  • Expose configuration for HTML Purifier (used in table diffing) - currently only cache dir is configurable through HtmlDiffConfig object
  • Performance improvements (we have 1 benchmark test, we should probably get more)
    • Algorithm improvements - trimming alike text at start and ends, store nested diff results in memory to re-use (like we do w/ caching)
    • Benchmark using DOMDocument vs. alternatives vs. string parsing
    • Consider not using string parsing for HtmlDiff in order to avoid having to create many DOMDocument instances in ListDiff and TableDiff
  • Benchmarking
  • Refactoring (but... tests first)
    • Overall design/architecture improvements
    • API improvements so a new HtmlDiff isn't required for each new diff (especially so that configuration can be re-used)
  • Split demo application to separate repository
  • Add documentation on alternative htmldiff engines and perhaps some comparisons

php-htmldiff's People

Contributors

adamcaxy avatar adamgoose avatar berarma avatar bobvandevijver avatar burgoyn1 avatar danepowell avatar dbergunder avatar di-maroo avatar dubletar avatar faceleg avatar gondo avatar iluuu1994 avatar irkallacz avatar jerray avatar jprado avatar jschroed91 avatar lukeleber avatar mgersten-caxy avatar rashid2538 avatar richardbrinkman avatar savagetiger avatar snebes avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

php-htmldiff's Issues

HTML entities are not kept intact, leading to invalid HTML

The diff algorithm does not always keep HTML entity references intact, which can be problematic when later loading the resulting diff as HTML (e.g. when diffing a list entry). This is best illustrated by the following test:

<options></options>

<oldText>
    <ol><li>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non justo &amp; sapien;</li></ol>
</oldText>

<newText>
    <ol><li>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non sapien et justo;</li></ol>
</newText>

<expected>
    <ol class="diff-list"><li class="normal">Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non <ins class="diffins">sapien et </ins>justo<del class="diffdel"> &amp; sapien</del>;</li></ol>
</expected>

This test currently crashes with an error message DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 1 from ListDiffLines.php:409. This is because it tries to load the string

<body>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non <ins class="diffins">sapien et </ins>justo<del class="diffdel"> &amp</del>;<del class="diffdel"> sapien;</del></body>

You can see that the diff algorithm broke up the &amp from its terminating ;, which leads to invalid HTML.

The solution might be to consider HTML entities a special case in the regex in AbstractDiff.php line 457 (e.g. adding an extra &[a-zA-Z0-9]+; case there). That fixes the given test without breaking the others, but I cannot oversee what further possible impact that may have.

Latest release can cause segmentation faults

A project I work on recently upgraded its dependency on caxy/php-htmldiff from 0.1.6 to 0.1.7. After the update, we are seeing segmentation faults when calling php-htmldiff. The fault actually occurs in the c standard library, not in the php binary.

The only change in the 0.1.7 release is this PR: #72

This might be a little hard to reproduce, because I'm actually not sure how php-htmldiff is being invoked. It's used by a dependency of a dependency of a dependency of a dependency on our project 😄

But I'd still appreciate any input you might have on what could be causing segfaults with this change. Most likely a problem with the mbstring extension I'd imagine.

Option for retaining Attributes

Hi,

I was testing your tool and its really usefull to me. The only thing that bothers me is the fact that in the diff all attributes like width etc. get stripped. Is there simple way to put them back in?

Why my test result is always wrong?

$a = "<span>abcd</span>";
$b = 'abcd';
$diffConfig = new HtmlDiffConfig();
$diffConfig->setEncoding('UTF-8');
$diff = HtmlDiff::create($a, $b, $diffConfig);

    $diffResult = $diff->build();
    dump($diffResult);

the result is <ins class="mod">abcd</span>
But it should be same. :)

If I change it to below, except the wrong result, the close tag lost also.
$a = '<span>你好</span>';
$b = '你好';
$diffConfig = new HtmlDiffConfig();
$diffConfig->setEncoding('UTF-8');
$diff = HtmlDiff::create($a, $b, $diffConfig);

    $diffResult = $diff->build();

<ins class="mod"><ins class="diffmod"></ins>你好</span>

Purifier Config settings not having affect across all classes (e.g. TableDiff)

Seems like the Purifier config is not being passed down to "nested diff instances";
When a document is being processed new HtmlDiff instances and instances of other classes such as TableDiff are created, if the user provides a HTMLPurifier_Config when creating the top level diff its not passed down to any subsequent instances that are created.

For example the following code should ensure the entire document is returned instead of just the body content.

$oldHtml = '<html><head><style>h1 {color: red}</style></head><body><h1>Hello World</h1><p>Testing</p></body></html>';
$newHtml = '<html><head><style>h1 {color: red}</style></head><body><h1>Hello Every Body!</h1><p>Testing</p></body></html>';
$diff = HtmlDiff::create($oldHtml, $newHtml);
$config =  \HTMLPurifier_Config::createDefault();
$config->set('Core.ConvertDocumentToFragment', false);
$diff->setHTMLPurifierConfig($config);
$diff->build();

results in the following:

<h1>Hello <del class="diffmod">World</del><ins class="diffmod">Every Body!</ins></h1><p>Testing</p>

When you would expect the html/head tags to be included in the output.

Stepping through this with xdebug you notice that the PurifierConfig value Core.ConvertDocumentToFragment is set to true (the default value)

HTMLPurifier not writable

I've updated php-htmldiff from version 1.0.0 to 1.0.9 because it wasn't able to compare a 5.500 word article within 30 sec before the PHP timeout kicked in. After updating I got the error message:
/vendor/ezyang/htmlpurifier/library/HTMLPurifier/DefinitionCache/Serializer not writable, please chmod to 777

I've added the config settings to a dir with 777 rights like this:

$htmlDiffConfig = new HtmlDiffConfig();
$htmlDiffConfig->setPurifierCacheLocation(Config::getDocRoot().'tempdir');
$htmlDiff = new HtmlDiff('text1 ', 'text2', $htmlDiffConfig);

But that setting doesn't seem to be passed on by HtmlDiff so there is not way to set the new (temp) directory. Version 1.0.9 also struggled with the large text, so I think downgrading is the only option I have?

Add placeholder for the HTML5 <picture> element.

The <picture> element needs some special treatment, much like the <img> element.

I think we might be able to just wrap a <del> and <ins> around the entire element if anything inside has changed? We can't really inject any new markup into the <picture> due to HTML rules.

Lists being marked as changed when the output is 100% the same.

I've got the following two blocks of HTML being marked as changed, even though I've diffed them with WinMerge and it tells me they're 100% identical.

Block 1:

<h3>References<a id="References" href="#References" name="References" class="heading-permalink ambientimpact-link-has-image" aria-hidden="true" title="Permalink"><span class="ambientimpact-icon ambientimpact-icon--name-link ambientimpact-icon--bundle-libricons ambientimpact-icon--text-hidden ambientimpact-icon--icon-standalone ambientimpact-icon--is-bundle-loaded ambientimpact-icon--icon-standalone-loaded"><svg class="ambientimpact-icon__icon" viewBox="0 0 24 24" width="24" height="24" aria-hidden="true"><use xlink:href="/modules/ambientimpact/ambientimpact_icon/icons/libricons.svg?qq6uy7#icon-link"></use></svg><span class="ambientimpact-icon__text"><span class="ambientimpact-link-has-image__text">Permalink</span></span></span></a></h3>
<div class="references" role="doc-endnotes"><ol><li class="references__list-item" id="reference-conv" role="doc-endnote"><p>Kierney, L. (May 2029). “Bigger Fish To Fry: An Interview With William Lassgard.” <em>forbes.com</em>.&nbsp;<a class="references__backreference-link" rev="footnote" href="#backreference-conv" role="doc-backlink"></a></p></li>
<li class="references__list-item" id="reference-dott" role="doc-endnote"><p>Bridges, C. (August 2012). “Translation of domestication of Thunnus thynnus into an innovative commercial application.” <em>transdott.eu</em>.&nbsp;<a class="references__backreference-link" rev="footnote" href="#backreference-dott" role="doc-backlink"></a></p></li>
<li class="references__list-item" id="reference-12" role="doc-endnote"><p>Åkesson, N. (October 2039). “Leaked correspondence between Xu Shaoyong and William Lassgard paints dramatic picture.” <em>Dagens Nyheter</em>.&nbsp;<a class="references__backreference-link" rev="footnote" href="#backreference-12" role="doc-backlink"></a></p></li></ol></div></div>

Block 2:

<h3>References<a id="References" href="#References" name="References" class="heading-permalink ambientimpact-link-has-image" aria-hidden="true" title="Permalink"><span class="ambientimpact-icon ambientimpact-icon--name-link ambientimpact-icon--bundle-libricons ambientimpact-icon--text-hidden ambientimpact-icon--icon-standalone ambientimpact-icon--is-bundle-loaded ambientimpact-icon--icon-standalone-loaded"><svg class="ambientimpact-icon__icon" viewBox="0 0 24 24" width="24" height="24" aria-hidden="true"><use xlink:href="/modules/ambientimpact/ambientimpact_icon/icons/libricons.svg?qq6uy7#icon-link"></use></svg><span class="ambientimpact-icon__text"><span class="ambientimpact-link-has-image__text">Permalink</span></span></span></a></h3>
<div class="references" role="doc-endnotes"><ol><li class="references__list-item" id="reference-conv" role="doc-endnote"><p>Kierney, L. (May 2029). “Bigger Fish To Fry: An Interview With William Lassgard.” <em>forbes.com</em>.&nbsp;<a class="references__backreference-link" rev="footnote" href="#backreference-conv" role="doc-backlink"></a></p></li>
<li class="references__list-item" id="reference-dott" role="doc-endnote"><p>Bridges, C. (August 2012). “Translation of domestication of Thunnus thynnus into an innovative commercial application.” <em>transdott.eu</em>.&nbsp;<a class="references__backreference-link" rev="footnote" href="#backreference-dott" role="doc-backlink"></a></p></li>
<li class="references__list-item" id="reference-12" role="doc-endnote"><p>Åkesson, N. (October 2039). “Leaked correspondence between Xu Shaoyong and William Lassgard paints dramatic picture.” <em>Dagens Nyheter</em>.&nbsp;<a class="references__backreference-link" rev="footnote" href="#backreference-12" role="doc-backlink"></a></p></li></ol></div></div>

Could it be due the emoji or the <p> elements? Not sure if the <p> elements are valid nesting, so I'll likely try to remove those, but they're being automatically generated by CommonMark or a Drupal filter.

Spaces near quotes etc.

Looks like added spaces are missing in diff output.

$htmlOld = 'He said:"OK!"';
$htmlNew = 'He said: "OK!"';
$htmlDiff = new \Caxy\HtmlDiff\HtmlDiff($htmlOld, $htmlNew);
echo $htmlDiff->build();

prints
He said:"OK!"
while I expected highlighted space after colon. Using v0.1.14.

Detecting link changes

It seems, neither this fork nor the upstream can detect link changes. For example:

-<a href="http://lorem.com">this part doesn't change</a>
+<a href="http://ipsum.com">this part doesn't change</a>

GitHub rich text diff plays nice with this case. Styling is not in the scope of this library but I've captured some screenshots for inspiration. If ahref attribute is changed, GitHub dotted-underlines it and presents a tooltip:

ekran goruntusu 2016 02 21 22_25_38

Tooltip:

ekran goruntusu 2016 02 21 22_25_47

If link text is also changed, underline and tooltip is gone:

ekran goruntusu 2016 02 21 22_26_10

Could we have this detection in this library? Thanks in advance.

Possible diff algorithm improvement

I'm using the "Override Demo 5" in the demo and I get this result:

ekran goruntusu 2016 04 10 23_34_34

Another diff app I try gives this result for the same HTML:

ekran goruntusu 2016 04 10 23_34_46

Note how it handles the first paragraph. I don't know how complex it is to implement but it's a better algorithm for this example HTML. Thanks in advance.

lone image inside a paragraph tag

Adding or removing a lone image inside a paragraph tag when the paragraph above also has a change doesn't show as an addition or a deletion for the image. Tested on http://php-htmldiff.caxy.com/ with this code:

OLD HTML:

<p>this is a test</p>
<p>new test</p>
<p><img src="https://storage.googleapis.com/gweb-uniblog-publish-prod/static/blog/images/google-200x200.7714256da16f.png" alt="" /></p>
<p></p>
<p></p>

NEW HTML:

<p>this is a test</p>
<p></p>
<p></p>

Encoding issues with umlauts

I'm not sure if this repo is still maintained but I'd really appreciate some help. I'm having some encoding issues with umlauts and I cannot for the life of me figure out how to fix this. This is the code I'm running:

$oldValue = 'Änderung';
$newValue = 'Test Änderung';

var_dump($oldValue);
var_dump(mb_detect_encoding($oldValue));
var_dump($newValue);
var_dump(mb_detect_encoding($newValue));

$htmlDiff = new HtmlDiff($oldValue, $newValue);
$result = $htmlDiff->build();

var_dump($result);
var_dump(mb_detect_encoding($result));

//> string(9) "Änderung"
//> string(5) "UTF-8"
//> string(14) "Test Änderung"
//> string(5) "UTF-8"
//> string(74) "test� Test � nderung"
//> string(5) "UTF-8"

For some reason HtmlDiff breaks the encoding. What confuses me is that the example works perfectly on the http://php-htmldiff.caxy.com/ website. Is this not supported or am I doing something wrong?

Different results

Hi!

$old = "текст tekst";
$new = "тест test"; // Every word has only one character changed

v0.1.6

// the result is (php-htmldiff.caxy.com shows the same result)
те<del class="diffdel">к</del>ст // It is detected in cyrillic word
<del class="diffmod">tekst</del><ins class="diffmod">test</ins> // but it isn't detected in latin word

Is it possible to get a result like for cyrillic word but for Latin too?

v0.1.7

// the result is
<del class="diffmod">текст tekst</del><ins class="diffmod">тест test</ins></span>

// with groupDiff=false
<del class="diffmod">текст</del><ins class="diffmod">тест</ins>
<del class="diffmod">tekst</del><ins class="diffmod">test</ins>

Diff styling for the demo

I suggest adding a red/green styling for the demo diffs. Below is a fancy one:

ins {
    color: #333333;
    background-color: #eaffea;

    text-decoration: none;
}
del {
    color: #AA3333;
    background-color: #ffeaea;

    text-decoration: line-through;
}

Offer diff by characters

For Chinese / Janpanese / Korean, there is no space between words. So it's more friendly to inlinely display differences character by character. For example, I just change the 50 to 500 in the screenshot below:
image

I just want to show the differences by the added 0.

Major decrease in speed after upgrading from v0.1.5 to v0.1.7

We have an application where we generate a comparison between two files. This file is pretty big and on v0.1.5 this took 21 seconds. After upgrading to v0.1.7 the comparison took 26 minutes.

After looking at v0.1.5...v0.1.7 - I suspect the multibyte functions to be the cause of this but I did not dug any further into this issue. There probably is some inefficient part in the code that causes this insane increase of time but I have not enough domain knowledge to pinpoint what's going on.

I'm more than happy to help out on this. Just let me know what I can do.

Config object not properly used when using the HtmlDiffBundle service

Hi,

I had some issues today with the cache folder location of the html purifier. After some digging around it turns out that the constructor in HtmlDiff/HtmlDiff sets the config after the AbstractDiff is created. However the purifier is initialized inside the AbstractDiff so the cache path (from the config object) is not properly set at the time of execution.

Handling HTML Comments

It does not handle comments properly. Ideally, it will be great if there is an option to skip certain tags.
Comments appear like this:
<!-- /wp:paragraph -- class="diffmod">
It breaks the format completely, and browser will render the above string.

Is it possible to skip certain HTML tags?

When I diff a new text with <!DOCTYPE html ...>, bodyor head tag and the old text doesn't include these the HtmlDiffwill add a class="diffmod" to the tag, is there anyway to avoid/skip these kind of tags?

Individual Element Styling is Removed

I have an individual problem with the HTMLDiff and it removing styling of lists.

Say I have the following text to compare:

<ol style="list-style-type:upper-latin;">
<li>No public officer or employee having the power or duty to perform an official act related to a contract or transaction which is the subject of an official act or action of the Town shall:
<ol>
<li>Have or thereafter acquire an interest in such contract or transaction unless said contract or transaction resulted from the proper bid process for the Town;&nbsp;</li>
</ol>
</li>
</ol>

Everytime it compares the above text, it strips the style of the ol, rending the lists basically useless. Is there a way I could keep the styling? I need it to maintain the lists in our manuals but I would like to maintain the lists as they are originally, not with the list style stripped out.

Thanks!

Processing time is very long on large diffs

This is a great module. In most cases it works absolutely flawlessly.

However, it's worth noting that for big diffs (where the new and old text lengths) exceed 65K (in my case it was 128k worth of HTML).

Our workaround was simply to do do some pre-optimization to get the processing time down to <30s. (This was for ~9k and ~10k for old/new blocks). We removed some inlined images and did some de-duplication of elements within the blocks of HTML.

~40k of html combined (old + new) took around 316 seconds and ~60 didn't complete before 10 minutes.

Diff fails on image as a link inside a table

Hello and thank you for this awesome library!
I am reporting an issue tested in https://php-htmldiff.caxy.com/.
The diff doesn't detect the left image deletion in the following 2-cells-table:

OLD

<table>
    <tbody>
        <tr>
            <td><a href="#somewhere"><img src="https://fr.wikipedia.org/static/images/mobile/copyright/wikipedia.png"/></a></td>
            <td><img src="https://fr.wikipedia.org/static/images/mobile/copyright/wikipedia.png"/></td>
        </tr>
    </tbody>
</table>

NEW

<table>
    <tbody>
        <tr>
        <td></td>
        <td><img src="https://fr.wikipedia.org/static/images/mobile/copyright/wikipedia.png"/></td>
        </tr>
    </tbody>
</table>

RESULT

<table>
    <tbody>
        <tr>
            <td rowspan="1" colspan="1">
                <a href="#somewhere" class="diffmod">
                    <img src="https://fr.wikipedia.org/static/images/mobile/copyright/wikipedia.png" alt="wikipedia.png">
                </a>
                <del class="diffmod"> </del>
                <ins class="diffmod"></ins>
            </td>
            <td rowspan="1" colspan="1">
                <img src="https://fr.wikipedia.org/static/images/mobile/copyright/wikipedia.png" alt="wikipedia.png">
            </td>
        </tr>
    </tbody>
</table>
  • If I extract the first image from tag or if I add a space inside this tag, the deletion is detected inside the cell.
  • If I remove the second column in both OLD and NEW, the deletion of the left image link is detected at the level.

I would like the diff to detect the image deletion, does anybody know how to achieve this goal ?
Thank you in advance,

Bug: INS tag not closed

Hi,

$content = Caxy\HtmlDiff\HtmlDiff::create('<p>Lorem ipsum dolor sit amet</p>', '<p>Lorem ipsum dolor <strong>sit</strong> amet</p>')->build(); print_r(htmlentities($content));

Result:

<p>Lorem ipsum dolor <del class="diffmod">sit</del><strong class="diffmod"><ins class="mod"><ins class="diffmod">sit</ins></strong> amet</p>

<ins class="mod"> is not closed.

Performance

Hi,

First I would like to thank for your library. It seems to be the best solution for php at this moment. But I have one big problem with it: The HTML-Blocks I'm comparing have about 1000 words or 8000 chars (with tags). It works but it takes very long to process, sometimes about 2 minutes. I'm using PHP 5.6.

Are there any possibilities to enhance the performance?

thanks for any hint!

Kind regards, Oliver

del inside ins

Hi,
This is my test:

$a='<b><i>test</i></b>';
$b='<i>test</i>';

$htmlDiffConfig = new \Caxy\HtmlDiff\HtmlDiffConfig();
$htmlDiff = \Caxy\HtmlDiff\HtmlDiff::create($a, $b, $htmlDiffConfig);
echo $htmlDiff->build();

result:

<i class="diffmod">
<ins class="mod">
<del class="diffmod">test</del>
</i>
</b>
<i class="diffmod">
<ins class="mod">
<ins class="diffmod">test</ins>
</i>

missing </ins>.
why is there a </b>?
I can't write a proper css, because there is a del tag inside an ins tag.

Please help!

HTML Tags

Is it possible to retain the html tag, body tag, css and js files? like get the diff only in the body and the rest will be retain?

Thanks.

Diff of bold, italic, underline and strikethrough differs

If you compare <p>This is a text</p>
with the same text in bold <p><strong>This is a text</strong></p>,
the result is <p><del class="diffmod">This is a text</del><strong class="diffmod"><ins class="mod"><ins class="diffmod">This is a text</ins></ins></strong></p>

If you compare <p>This is a text</p>
with the same text in italic <p><em>This is a text</em></p>,
the result is <p><del class="diffmod">This is a text</del><em class="diffmod"><ins class="diffmod">This is a text</ins></em></p>

If you compare <p>This is a text</p>
with the same text underlined <p><u>This is a text</u></p>,
the result is <p><u class="diffmod"><ins class="mod">This is a text</ins></u></p>

If you compare <p>This is a text</p>
with the same text struck through <p><u>This is a text</u></p>,
the result is <p><s class="diffmod"><ins class="mod">This is a text</ins></s></p>

I'm not 100% sure whats the correct way but it should be the same for all of them. I'm not able to style the diff properly.

no iframes in result

Hi!

I have a html that contains iframes, for example a youtube-embed-code. HtmlDiff filters the iframes out when comparing (the iframes are missing in the result).

Is there a reason for that and can it be avoided?

Cheers

Kamil

Does not work on PHP 7.3

I try to use it on PHP 7.3. and I got Warning: preg_match(): Compilation failed: invalid range in character class at offset 4

What I found that PHP 7.3. use PCRE2: http://php.net/manual/en/migration73.other-changes.php#migration73.other-changes.pcre witch cause problem in php simple html dom parser sunra/php-simple-html-dom-parser#64

First option is use different library like https://github.com/Kub-AT/php-simple-html-dom-parser

Second options is wait until sunra/php-simple-html-dom-parser have new version.

Warning DOMDocument::loadHTML(): Unexpected end tag : strong

$str1 = '<ul>
<li>li 1</li>
<li><strong>li 2</strong></li>
</ul>';
        $str2 = '<ul>
<li>li 1</li>
<li><strong>li 2</strong></li>
<li>li 4</li>
</ul>';
        $htmlDiff = new HtmlDiff($str1, $str2);
        $str = $htmlDiff->build();
        echo $str;

Code above raises warning DOMDocument::loadHTML(): Unexpected end tag : strong in caxy/php-htmldiff/lib/Caxy/HtmlDiff/ListDiffLines.php at line 409

[Feature] show condensed changes with context

For long html strings with few changes, it would be nice to have an option to show only a compact overview of changed parts.
Similar to unix "grep --context=3" or the github commit change overview.

Slow diff even on small text input

My strings are mostly text (actually markdown) and sometimes contain a few HTML tags.

Suppose I delete a couple of words near the start of a line. On the same line towards the end, I add a couple of words. This is one of the few diff programs I know that highlights changes at word level. That's cool!

But it's slow. It takes 40+ seconds for a small file of about 100 lines and 1500 words.

Anything I'm missing such as setting a config option properly? Thx

Problem with Match in PHP8

Caxy\HtmlDiff\Match class is not usable because of PHP8 "match" reserved key.
So, obviously, diffing is not working in PHP8. It needs refactoring (Renaming class name to Matcher at least)

Does not work on PHP 5.3.10

The ListDiffLines.php file uses [ ] as declaring arrays which is not supported in PHP 5.3 so composer will need to be updated to be greater than 5.3 or the array declaration will need to be changed.

"As of PHP 5.4 you can also use the short array syntax, which replaces array() with []."

Feature: count of changed chars

Is there a way to have the count of the added/removed chars?

I need to show something like:

300 chars added, 600 chars removed

How to ignore changes in an attribute?

Specifically, I have links whose href attributes change, but for the purposes of what we're doing, it's irrelevant to the user that will be seeing the diff. If you're wondering exactly why one would need this, we're basically faking a future Wikipedia clone for a narrative project. Here's an example of the diff in the current Drupal 7 site that uses an older diffing, and we're working on porting the site to Drupal 8 with this library for the diffing.

nested/unclosed tags on attribute differences

When comparing identical tags (tag and content), of which one has an attribute that the other doesn't, the produced output presents two opening tags, the identical content, and only one closing tag.
Also, although I am not sure of what the expected behavior is, if both have the attribute, with different values, no difference is detected and the output is the old html.
The following code

<?php

require_once 'vendor/autoload.php';

use Caxy\HtmlDiff\HtmlDiff;

function compare($old, $new) {
    echo (new HtmlDiff($new, $old))->build() . "\n";
}

compare(
    '<p title="A">content</p>',
    '<p>content</p>'
);
compare(
    '<p>content</p>',
    '<p class="A">content</p>'
);
compare(
    '<p title="A">content</p>',
    '<p title="B">content</p>'
);

outputs the following :

<p class="diffmod"><p title="A" class="diffmod">content</p>
<p class="diffmod A"><p class="diffmod">content</p>
<p title="A">content</p>

Some diffs take far too long even with no multi-byte

Hi there. I've been using this on a Drupal project where we need to highlight the differences between the rendered output of two nodes, and it works alright for most of our content, but a few seem to take far longer to generate, and sometimes hit the PHP max execution time limit (30 seconds on the remote server, 120 on the local dev).

I've tried to figure out exactly what might be causing such a big variation in diff times, without success:

  • Made sure that the strings provided for diffing were not triggering the use of PHP multi-byte string functions. (See #57 and #77)

  • Disabled isolated list diffing.

  • Disabled almost all Drupal input filters, especially ones that added extra data and attributes to the output.

My last resort if this can't be resolved is likely to be to do the diffing asynchronously, but I'd prefer to avoid having to implement that if I can. Any advice?

encoding

one of the parameter for constructor in AbstractDiff class is $encoding, however this is not used anywhere.
is something missing? what is the point of it?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.