I tried looking at the <a href="https://godoc.org/github.com/sergi/go-diff/diffmatchpa

What is expected behavior when input is not valid utf8? about go-diff HOT 6 CLOSED

sergi commented on September 26, 2024

What is expected behavior when input is not valid utf8?

from go-diff.

Comments (6)

zimmski commented on September 26, 2024 1

Let's work together by defining what should be done to resolve this issue, and I will implement it then. Does that sound OK to you?

I hardly encounter invalid utf8 (usually it is a transformation issue from one encoding to the other for me) but it is certainly the first time I need to handle it in Go. So I will document my findings a little for people who search the net with the same problem.

First of all, the output with the � characters and the 32 bytes length "could be" expected since the characters were substituted by Go while processing the invalid values. The character is the value of unicode/utf8.RuneError (value in bits 1111111111111101 octal 177775 hex FFFD, the character simply occupies 3 bytes in utf8) and is better know as the Unicode replacement character. Which is a substitution for invalid data if you iterate over a string rune/unicode-code-point-wise either with the range expression or with one of the functions in unicode/utf8. This is also defined in the Go spec. And since there is no error return argument for these functions, it is the only way of showing that an error occurred. In this case it is invalid data for each of this invalid code points. DiffMain is just an alias for DiffMainRunes, so we are iterating in the same sense as the range expression over both texts and diffing them. I would therefore say, that it is OK and normal, at least in Go, to substitute with unicode/utf8.RuneError since this is the spec-defined behavior for invalid utf8 unicode code points in Go. So one change for this issue would be to document this like it is documented in the unicode/utf8 package.

Now to the second problem which is the "diff on a byte-level". Since the whole diff-match-patch internal code is using rune slices we automatically loose the information what original values the invalid code points had. People are expecting that the diffs act on runes i.e. splits do not happen on byte level but on rune level e.g. #27. Too much depends on the rune handling that it would be too tricky and would lead to complex code to handle bytes and runes at once. However, I think it would be rather easy to add a new function called "DiffMainBytes" which does the following:

Take two byte slices as parameters
Diff them using the internal code for runes
The outcome would have unicode/utf8.RuneError in it if there are invalid code points
Transform the result back to the original invalid code points using the original two byte slices

This would make the new DiffMainBytes handle splits on unicode code point while still holding the invalid code points in their original byte form.

What do you think of these two changes? Do they make sense not just in your use case but in general?

@sergi: Maybe you could take a look at this too.

from go-diff.

zimmski commented on September 26, 2024 1

Since I am currently short on time I will open an issue for the implementation of DiffMainBytes and will just document the behavior for now.

from go-diff.

zimmski commented on September 26, 2024

I tested this on the latest master. The output is still the same.

@shurcooL: I guess what you mean is that the value for "diff text" should be the same as for "input"?

from go-diff.

dmitshur commented on September 26, 2024

I don't know if it should, but that would be one way of resolving the issue of this behavior being undocumented and unclear (I don't know if there's an issue or not here...).

The issue is that I don't know the answer to the following question:

Is that expected behavior?

from go-diff.

dmitshur commented on September 26, 2024

Let's work together by defining what should be done to resolve this issue, and I will implement it then. Does that sound OK to you?

Sure. But first, let me just mentioned how I expected this issue to be resolved. As you mentioned, invalid UTF-8 is quite rare, and I would argue there's a good chance it's better to make it something that go-diff does not deal with. It can be taken care of by the user in pre-processing, before calling go-diff funcs.

So, if that's the approach you choose to go with, this issue can be resolved simply by documenting that input must be valid UTF-8:

// DiffMain finds the differences between two texts.
// text1, text2 must be valid UTF-8 strings.
func (dmp *DiffMatchPatch) DiffMain(text1, text2 string, checklines bool) []Diff {

from go-diff.

dmitshur commented on September 26, 2024

What do you think of these two changes? Do they make sense not just in your use case but in general?

I'll post my thoughts below.

I would therefore say, that it is OK and normal, at least in Go, to substitute with unicode/utf8.RuneError since this is the spec-defined behavior for invalid utf8 unicode code points in Go.

That makes sense to me.

So one change for this issue would be to document this like it is documented in the unicode/utf8 package.

Can you elaborate on how it'd be documented? I think this would be good, yeah.

Since the whole diff-match-patch internal code is using rune slices we automatically loose the information what original values the invalid code points had. People are expecting that the diffs act on runes i.e. splits do not happen on byte level but on rune level e.g. #27. Too much depends on the rune handling that it would be too tricky and would lead to complex code to handle bytes and runes at once.

Makes sense, I agree. I don't think you should try to change that.

However, I think it would be rather easy to add a new function called "DiffMainBytes" which does the following:

Take two byte slices as parameters

Diff them using the internal code for runes

The outcome would have unicode/utf8.RuneError in it if there are invalid code points

Transform the result back to the original invalid code points using the original two byte slices

This would make the new DiffMainBytes handle splits on unicode code point while still holding the invalid code points in their original byte form.

Hmm. I think this sounds good. I'm not sure how easy the "Transform the result back to the original invalid code points using the original two byte slices" step is. But I'm not very familiar with the internal code of this package, so I'll take your word on that doing this would be "rather easy".

I'll mention that I personally don't have a concrete need for this now, so you should only implement this if you think it's worth it for other users/in general.

from go-diff.

What is expected behavior when input is not valid utf8? about go-diff HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent