Hi, awesome library and documentation. The explanations are really nice to read.</

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

levenshtein automata not matching Japanese Characters correctly about fst HOT 15 OPEN

burntsushi commented on August 23, 2024

levenshtein automata not matching Japanese Characters correctly

from fst.

Comments (15)

BurntSushi commented on August 23, 2024

Thanks for the report. This does appear to be a bug.

In the future, please provide a minimal runnable program that reproduces your issue. Here's one I came up with:

extern crate fst;

use fst::{Set, Levenshtein, IntoStreamer};

fn main() {
    let k = "寿司は焦げられない";
    let keys = vec![k];
    let set = Set::from_iter(keys.iter()).unwrap();

    let lev = Levenshtein::new(k, 2).unwrap();
    let stream = set.search(lev).into_stream();
    let keys = stream.into_strs().unwrap();
    println!("{:?}", keys);
}

I don't know the cause of it off the top of my head, but I fear this may be due to a bug in the Levenshtein automaton construction itself. If that's the case, I don't know when it will get fixed.

from fst.

PSeitz commented on August 23, 2024

Hi,

I took a closer look on the issue with some insight, although there are some things I don't completely understand with regard to the states data-structures.

The issue seems to occur when there are two characters with same first byte.
When we alter the orignal example to these:

let keys = vec!["来探"];
let lev = Levenshtein::new("来探", 1).unwrap();
--> fails

let keys = vec!["来食"];
let lev = Levenshtein::new("来食", 1).unwrap();
--> correct

来 bytes [230, 157, 165]
探 bytes [230, 142, 162]
食 bytes [233, 163, 159]

I printed the states from the dfa, but I'm not sure how to read them, are there multiple entry points or does it always start at state 0? What does is_match mean?

Although my understanding is, that following the maching from state 0, I should be able to complete the original byte sequences:

来探 -> 230, 157, 165, 230, 142, 162
来食 -> 230, 157, 165, 233, 163, 159

"来探" Failing state machine
This is the only path from state 0, and the first 142 should be 157 ?
230 -> 142 -> 162 -> 230 -> 142 -> 162
Also, there are states which are not reachable from state 0.

"来食" Correct state machine
There are multiple paths from state 0 and the original byte sequence is part of it.
230 -> 157 -> 165 -> 233 -> 163 -> 159

Btw, if you like the states representation, I can create a pull request for it.

from fst.

BurntSushi commented on August 23, 2024

@PSeitz I think you're on your own here. I haven't looked at that code in a long time, and I'd need to spend a couple of hours loading up my context to help you. (At which point, I would probably just try to go ahead and fix the bug.)

You probably want some background reading:

http://julesjacobs.github.io/2015/06/17/disqus-levenshtein-simple-and-fast.html

http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata

http://fulmicoton.com/posts/levenshtein/

from fst.

PSeitz commented on August 23, 2024

Thanks for the links. I think I understood the basic concepts. Theses examples although all use characters in their states and not bytes, which makes them a lot easier, because a new character is just one new step.

I don't really understand how a multibyte character is transferred to multiple steps here, and as far as I could tell this seems to be where the issue is.

from fst.

BurntSushi commented on August 23, 2024

@PSeitz Yes, they are encoded as UTF-8 automata. I don't know any well documented place where they are explained, but you can see the utf8-ranges crate docs for a little bit of explanation. (This crate is used in the implementation here.)

from fst.

WarpspeedSCP commented on August 23, 2024

Um, correct me if I'm wrong, but in cases like this, isn't it always better to compare individual bytes rather than "characters", since then it becomes a matter of comparing two numbers?

from fst.

BurntSushi commented on August 23, 2024

@WarpspeedSCP I don't understand your question. Please show examples and tie it back to the specific implementation in this crate.

from fst.

WarpspeedSCP commented on August 23, 2024

What I meant was that comparing groups of bytes at a time as characters wouldn't be as easy as comparing bytes individually in the string to arrive at an answer. For example, you wouldn't just know which of 😎 and 😕 comes first, but if you looked at the bytes that represent them, it is possible to come to a conclusion. If one has fewer bytes it becomes immediately apparent, for example. But I don't know how utf8 works, so I am probably wrong.

from fst.

BurntSushi commented on August 23, 2024

@WarpspeedSCP Sorry, but what you're saying doesn't make any sense to me. I suspect you aren't using the words "bytes" and "character" correctly, but I don't know for sure. Moreover, this talk about "which one coming first" doesn't make any sense to me; what does ordering have to do with this ticket? Please write out an example in more detail.

I'd also like to encourage you to carefully read the comments in this issue. Drive by comments typically aren't helpful, and it honestly kind of feels like that's what your comments here are. For example, a couple comments up I linked to the docs for the utf8-ranges crate as a way of explaining how the UTF-8 automata are constructed. If you read those docs, then the fact that the implementation is in fact byte based should be very clear. And if isn't, then something is wrong with the docs and I'd appreciate it if you pointed out those problems to me.

Making matters more complex is that regardless of the implementation strategy, the concept of character in the abstract sense still matters. In particular, while the Levenshtein automata in this crate are byte oriented, they measure differences between strings in terms of characters. The specific interpretation of character used in this crate is that of a single Unicode scalar value. This is intended as a less than ideal approximation (where a better approximation would be a Unicode grapheme cluster) because it is simpler to implement.

from fst.

WarpspeedSCP commented on August 23, 2024

What I meant by character, was the set of bytes that represent a Unicode code point.

For example,
来 is represented by the set of bytes [230, 157, 165], and
探 is represented by the set of bytes [230, 142, 162], right?

I now understand that your library works on the bytes of the string and not the characters.
So at that time, I had thought that just comparing the bytes directly at each state would be enough to say whether or not two strings were equal. If I were to do this with your library, would this be right?-

A state machine has been constructed for 来 and 探 is queried from it.-

let machine = fst::Levenshtein::new("来", 0).unwrap();
let query = vec!["探"];
let set = fst::Set::from_iter(query.iter()).unwrap();
let stream = set.search(machine).into_stream();
let results = stream.into_strs().unwrap();

Now results wouldn't contain anything since there were no matches, right?

The first byte would match, but the second would not, so the machine would output that there is no match. This was after I had skimmed the docs for this library, as well as the article just once. Is this how the library works?

from fst.

BurntSushi commented on August 23, 2024

comparing the bytes directly at each state would be enough to say whether or not two strings were equal

This ticket isn't about equality. It's about building a Levenshtein automaton and matching keys that are within a certain edit distance of a query. The edit distance is measured in units of Unicode scalar values. The automaton is built such that its transitions are byte based.

This program emits no results, which is correct:

extern crate fst;
extern crate fst_levenshtein;

use fst::{IntoStreamer, Set};
use fst_levenshtein::Levenshtein;

fn main() {
    let set = Set::from_iter(vec!["探"]).unwrap();
    let query = Levenshtein::new("来", 0).unwrap();
    let results = set.search(query).into_stream().into_strs().unwrap();
    println!("{:?}", results);
}

This program also emits no results, but is incorrect. This is presumably the bug:

extern crate fst;
extern crate fst_levenshtein;

use fst::{IntoStreamer, Set};
use fst_levenshtein::Levenshtein;

fn main() {
    let set = Set::from_iter(vec!["探"]).unwrap();
    let query = Levenshtein::new("来", 1).unwrap();
    let results = set.search(query).into_stream().into_strs().unwrap();
    println!("{:?}", results);
}

from fst.

WarpspeedSCP commented on August 23, 2024

Thank you, now I understand why this bug is occurring. In the second case, the search returns only the first byte [230], which is not a valid character, and won't print, right?

from fst.

BurntSushi commented on August 23, 2024

In the second case, the search returns only the first byte [230], which is not a valid character, and won't print, right?

It returns nothing. Returning something that is invalid UTF-8 would be an egregious bug.

from fst.

WarpspeedSCP commented on August 23, 2024

Ok, so it either outputs a valid answer or nothing.

from fst.

PSeitz commented on August 23, 2024

I had another look on the issue, and the problem is the overwrite, where the old branch gets orphaned. That's the reason, why this appears only with same byte characters

This also fits the initial observation, where next state is overwritten from 230 -> 142 to 230 -> 157

fst/fst-levenshtein/src/lib.rs

Lines 291 to 303 in 91ab58a

 fn add_utf8_range( 

 &mut self, 

 overwrite: bool, 

 from: usize, 

 to: usize, 

 range: &Utf8Range, 

 ) { 

 for b in range.start as usize..range.end as usize + 1 { 

 if overwrite || self.dfa.states[from].next[b].is_none() { 

 self.dfa.states[from].next[b] = Some(to); 

 } 

 } 

 }

So I tried with a simple merge of the old states nexts into the new state nexts. This fixes the problem in this case, but I think is generally not allowed, because we allow to much?

Also some cases would require to do this recursively, and that is not feasible performance wise.

    fn add_utf8_range(
        &mut self,
        overwrite: bool,
        from: usize,
        to: usize,
        range: &Utf8Range,
    ) {
        for b in range.start as usize..range.end as usize + 1 {
            if overwrite || self.dfa.states[from].next[b].is_none() {
                if let Some(state) = self.dfa.states[from].next[b] {
                    //merge existing into new state
                    for i in 0..256 {
                        if let Some(si) = self.dfa.states[state].next[i] {
                            self.dfa.states[to].next[i] = Some(si);
                        }
                    }
                }
                self.dfa.states[from].next[b] = Some(to);
            }
        }
    }

from fst.

levenshtein automata not matching Japanese Characters correctly about fst HOT 15 OPEN

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	fn add_utf8_range(
	&mut self,
	overwrite: bool,
	from: usize,
	to: usize,
	range: &Utf8Range,
	) {
	for b in range.start as usize..range.end as usize + 1 {
	if overwrite \|\| self.dfa.states[from].next[b].is_none() {
	self.dfa.states[from].next[b] = Some(to);
	}
	}
	}