Comments (6)
Let's work together by defining what should be done to resolve this issue, and I will implement it then. Does that sound OK to you?
I hardly encounter invalid utf8 (usually it is a transformation issue from one encoding to the other for me) but it is certainly the first time I need to handle it in Go. So I will document my findings a little for people who search the net with the same problem.
First of all, the output with the � characters and the 32 bytes length "could be" expected since the characters were substituted by Go while processing the invalid values. The character is the value of unicode/utf8.RuneError
(value in bits 1111111111111101 octal 177775 hex FFFD, the character simply occupies 3 bytes in utf8) and is better know as the Unicode replacement character. Which is a substitution for invalid data if you iterate over a string rune/unicode-code-point-wise either with the range expression or with one of the functions in unicode/utf8. This is also defined in the Go spec. And since there is no error return argument for these functions, it is the only way of showing that an error occurred. In this case it is invalid data for each of this invalid code points. DiffMain is just an alias for DiffMainRunes, so we are iterating in the same sense as the range expression over both texts and diffing them. I would therefore say, that it is OK and normal, at least in Go, to substitute with unicode/utf8.RuneError
since this is the spec-defined behavior for invalid utf8 unicode code points in Go. So one change for this issue would be to document this like it is documented in the unicode/utf8
package.
Now to the second problem which is the "diff on a byte-level". Since the whole diff-match-patch internal code is using rune slices we automatically loose the information what original values the invalid code points had. People are expecting that the diffs act on runes i.e. splits do not happen on byte level but on rune level e.g. #27. Too much depends on the rune handling that it would be too tricky and would lead to complex code to handle bytes and runes at once. However, I think it would be rather easy to add a new function called "DiffMainBytes" which does the following:
- Take two byte slices as parameters
- Diff them using the internal code for runes
- The outcome would have
unicode/utf8.RuneError
in it if there are invalid code points - Transform the result back to the original invalid code points using the original two byte slices
This would make the new DiffMainBytes handle splits on unicode code point while still holding the invalid code points in their original byte form.
What do you think of these two changes? Do they make sense not just in your use case but in general?
@sergi: Maybe you could take a look at this too.
from go-diff.
Since I am currently short on time I will open an issue for the implementation of DiffMainBytes and will just document the behavior for now.
from go-diff.
I tested this on the latest master. The output is still the same.
@shurcooL: I guess what you mean is that the value for "diff text" should be the same as for "input"?
from go-diff.
I don't know if it should, but that would be one way of resolving the issue of this behavior being undocumented and unclear (I don't know if there's an issue or not here...).
The issue is that I don't know the answer to the following question:
Is that expected behavior?
from go-diff.
Let's work together by defining what should be done to resolve this issue, and I will implement it then. Does that sound OK to you?
Sure. But first, let me just mentioned how I expected this issue to be resolved. As you mentioned, invalid UTF-8 is quite rare, and I would argue there's a good chance it's better to make it something that go-diff
does not deal with. It can be taken care of by the user in pre-processing, before calling go-diff
funcs.
So, if that's the approach you choose to go with, this issue can be resolved simply by documenting that input must be valid UTF-8:
// DiffMain finds the differences between two texts.
// text1, text2 must be valid UTF-8 strings.
func (dmp *DiffMatchPatch) DiffMain(text1, text2 string, checklines bool) []Diff {
from go-diff.
What do you think of these two changes? Do they make sense not just in your use case but in general?
I'll post my thoughts below.
I would therefore say, that it is OK and normal, at least in Go, to substitute with
unicode/utf8.RuneError
since this is the spec-defined behavior for invalid utf8 unicode code points in Go.
That makes sense to me.
So one change for this issue would be to document this like it is documented in the unicode/utf8 package.
Can you elaborate on how it'd be documented? I think this would be good, yeah.
Since the whole diff-match-patch internal code is using rune slices we automatically loose the information what original values the invalid code points had. People are expecting that the diffs act on runes i.e. splits do not happen on byte level but on rune level e.g. #27. Too much depends on the rune handling that it would be too tricky and would lead to complex code to handle bytes and runes at once.
Makes sense, I agree. I don't think you should try to change that.
However, I think it would be rather easy to add a new function called "DiffMainBytes" which does the following:
- Take two byte slices as parameters
- Diff them using the internal code for runes
- The outcome would have unicode/utf8.RuneError in it if there are invalid code points
- Transform the result back to the original invalid code points using the original two byte slices
This would make the new DiffMainBytes handle splits on unicode code point while still holding the invalid code points in their original byte form.
Hmm. I think this sounds good. I'm not sure how easy the "Transform the result back to the original invalid code points using the original two byte slices" step is. But I'm not very familiar with the internal code of this package, so I'll take your word on that doing this would be "rather easy".
I'll mention that I personally don't have a concrete need for this now, so you should only implement this if you think it's worth it for other users/in general.
from go-diff.
Related Issues (20)
- runtime error: index out of range while processing diff HOT 2
- How to tell percent different HOT 2
- New line characters removed after applying patch HOT 1
- Go 1.15: conversion from int to string yields a string of one rune, not a string of digits
- DiffPrettyText in windows cmd prints garbage
- diffmatchpatch: invalid diff output with CRLF line endings HOT 5
- Whitespace does not get shown with DiffPrettyText HOT 1
- Index out of range error, would like help investigating HOT 3
- how to only show different part? HOT 2
- Trying to match a long pattern fails HOT 1
- Get diff by words HOT 1
- [BUG] v1.2.0 seems to produce incorrect diff HOT 10
- Could an unified diff be output with go-diff? HOT 2
- panic: runtime error: slice bounds out of range HOT 2
- PatchApply panics with slice bounds out of range
- Is this repo still maintained? Friendly fork? HOT 3
- where are the non pretty options? HOT 4
- DiffPrettyText does does not colour all diffs with newlines
- Munged text leads to incorrect diffs HOT 4
- [BUG] Strange diff bug i can not explain HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from go-diff.