Giter Club home page Giter Club logo

Comments (6)

zimmski avatar zimmski commented on September 26, 2024 1

Let's work together by defining what should be done to resolve this issue, and I will implement it then. Does that sound OK to you?

I hardly encounter invalid utf8 (usually it is a transformation issue from one encoding to the other for me) but it is certainly the first time I need to handle it in Go. So I will document my findings a little for people who search the net with the same problem.

First of all, the output with the � characters and the 32 bytes length "could be" expected since the characters were substituted by Go while processing the invalid values. The character is the value of unicode/utf8.RuneError (value in bits 1111111111111101 octal 177775 hex FFFD, the character simply occupies 3 bytes in utf8) and is better know as the Unicode replacement character. Which is a substitution for invalid data if you iterate over a string rune/unicode-code-point-wise either with the range expression or with one of the functions in unicode/utf8. This is also defined in the Go spec. And since there is no error return argument for these functions, it is the only way of showing that an error occurred. In this case it is invalid data for each of this invalid code points. DiffMain is just an alias for DiffMainRunes, so we are iterating in the same sense as the range expression over both texts and diffing them. I would therefore say, that it is OK and normal, at least in Go, to substitute with unicode/utf8.RuneError since this is the spec-defined behavior for invalid utf8 unicode code points in Go. So one change for this issue would be to document this like it is documented in the unicode/utf8 package.

Now to the second problem which is the "diff on a byte-level". Since the whole diff-match-patch internal code is using rune slices we automatically loose the information what original values the invalid code points had. People are expecting that the diffs act on runes i.e. splits do not happen on byte level but on rune level e.g. #27. Too much depends on the rune handling that it would be too tricky and would lead to complex code to handle bytes and runes at once. However, I think it would be rather easy to add a new function called "DiffMainBytes" which does the following:

  • Take two byte slices as parameters
  • Diff them using the internal code for runes
  • The outcome would have unicode/utf8.RuneError in it if there are invalid code points
  • Transform the result back to the original invalid code points using the original two byte slices

This would make the new DiffMainBytes handle splits on unicode code point while still holding the invalid code points in their original byte form.

What do you think of these two changes? Do they make sense not just in your use case but in general?

@sergi: Maybe you could take a look at this too.

from go-diff.

zimmski avatar zimmski commented on September 26, 2024 1

Since I am currently short on time I will open an issue for the implementation of DiffMainBytes and will just document the behavior for now.

from go-diff.

zimmski avatar zimmski commented on September 26, 2024

I tested this on the latest master. The output is still the same.

@shurcooL: I guess what you mean is that the value for "diff text" should be the same as for "input"?

from go-diff.

dmitshur avatar dmitshur commented on September 26, 2024

I don't know if it should, but that would be one way of resolving the issue of this behavior being undocumented and unclear (I don't know if there's an issue or not here...).

The issue is that I don't know the answer to the following question:

Is that expected behavior?

from go-diff.

dmitshur avatar dmitshur commented on September 26, 2024

Let's work together by defining what should be done to resolve this issue, and I will implement it then. Does that sound OK to you?

Sure. But first, let me just mentioned how I expected this issue to be resolved. As you mentioned, invalid UTF-8 is quite rare, and I would argue there's a good chance it's better to make it something that go-diff does not deal with. It can be taken care of by the user in pre-processing, before calling go-diff funcs.

So, if that's the approach you choose to go with, this issue can be resolved simply by documenting that input must be valid UTF-8:

// DiffMain finds the differences between two texts.
// text1, text2 must be valid UTF-8 strings.
func (dmp *DiffMatchPatch) DiffMain(text1, text2 string, checklines bool) []Diff {

from go-diff.

dmitshur avatar dmitshur commented on September 26, 2024

What do you think of these two changes? Do they make sense not just in your use case but in general?

I'll post my thoughts below.

I would therefore say, that it is OK and normal, at least in Go, to substitute with unicode/utf8.RuneError since this is the spec-defined behavior for invalid utf8 unicode code points in Go.

That makes sense to me.

So one change for this issue would be to document this like it is documented in the unicode/utf8 package.

Can you elaborate on how it'd be documented? I think this would be good, yeah.

Since the whole diff-match-patch internal code is using rune slices we automatically loose the information what original values the invalid code points had. People are expecting that the diffs act on runes i.e. splits do not happen on byte level but on rune level e.g. #27. Too much depends on the rune handling that it would be too tricky and would lead to complex code to handle bytes and runes at once.

Makes sense, I agree. I don't think you should try to change that.

However, I think it would be rather easy to add a new function called "DiffMainBytes" which does the following:

  • Take two byte slices as parameters
  • Diff them using the internal code for runes
  • The outcome would have unicode/utf8.RuneError in it if there are invalid code points
  • Transform the result back to the original invalid code points using the original two byte slices

This would make the new DiffMainBytes handle splits on unicode code point while still holding the invalid code points in their original byte form.

Hmm. I think this sounds good. I'm not sure how easy the "Transform the result back to the original invalid code points using the original two byte slices" step is. But I'm not very familiar with the internal code of this package, so I'll take your word on that doing this would be "rather easy".

I'll mention that I personally don't have a concrete need for this now, so you should only implement this if you think it's worth it for other users/in general.

from go-diff.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.