Giter Club home page Giter Club logo

fst's Introduction

fst

This crate provides a fast implementation of ordered sets and maps using finite state machines. In particular, it makes use of finite state transducers to map keys to values as the machine is executed. Using finite state machines as data structures enables us to store keys in a compact format that is also easily searchable. For example, this crate leverages memory maps to make range queries very fast.

Check out my blog post Index 1,600,000,000 Keys with Automata and Rust for extensive background, examples and experiments.

Build status

Dual-licensed under MIT or the UNLICENSE.

Documentation

https://docs.rs/fst

The regex-automata crate provides implementations of the fst::Automata trait when its transducer feature is enabled. This permits using DFAs compiled by regex-automata to search finite state transducers produced by this crate.

Installation

Simply add a corresponding entry to your Cargo.toml dependency list:

[dependencies]
fst = "0.4"

Example

This example demonstrates building a set in memory and executing a fuzzy query against it. You'll need fst = "0.4" with the levenshtein feature enabled in your Cargo.toml.

use fst::{IntoStreamer, Set};
use fst::automaton::Levenshtein;

fn main() -> Result<(), Box<dyn std::error::Error>> {
  // A convenient way to create sets in memory.
  let keys = vec!["fa", "fo", "fob", "focus", "foo", "food", "foul"];
  let set = Set::from_iter(keys)?;

  // Build our fuzzy query.
  let lev = Levenshtein::new("foo", 1)?;

  // Apply our fuzzy query to the set we built.
  let stream = set.search(lev).into_stream();

  let keys = stream.into_strs()?;
  assert_eq!(keys, vec!["fo", "fob", "foo", "food"]);
  Ok(())
}

Check out the documentation for a lot more examples!

Cargo features

  • levenshtein - Disabled by default. This adds the Levenshtein automaton to the automaton sub-module. This includes an additional dependency on utf8-ranges.

fst's People

Contributors

ahmedcharles avatar ajrpayne avatar antonha avatar atouchet avatar brson avatar burntsushi avatar dtolnay avatar fulmicoton avatar gereeter avatar kerollmops avatar kevinji avatar lcnr avatar manuel-woelker avatar matklad avatar nabijaczleweli avatar nemo157 avatar nicolashahn avatar pombredanne avatar proycon avatar rreverser avatar upsuper avatar vorner avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fst's Issues

document the FST format

I'm trying to understand this library better. Since every fst can be used with mmap, I know that the build process creates a serialized fst representation, and at read time, this representation is used in a zero-alloc way. After reading the code in the raw crate, I can kind of follow along, but I still don't understand the file format.

Is there an overview of the file format anywhere? Maybe just an example, to start? Thanks!

levenshtein automata not matching Japanese Characters correctly

Hi,
awesome library and documentation. The explanations are really nice to read.

I have an example that should probably match but, doesn't. I guess it has to do something with the utf8-ranges matching. (Btw: This means sushi can't be burned :)

let keys = vec!["寿司は焦げられない"];
let set = try!(Set::from_iter(keys.iter()));

let lev = try!(Levenshtein::new("寿司は焦げられない", 2));
let stream = set.search(lev).into_stream();
let keys = try!(stream.into_strs());

Thread panics on read operations of FST set file

I'm seeing a very small percentage of "corrupt" FST set files that are triggering panics in Rust (leading to a Python interpreter segfault). The errors look like:

thread '<unnamed>' panicked at 'index out of bounds: the len is 17498006 but the index is 15336395951936096993', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/fst-0.3.0/src/raw/node.rs:306:17

thread '<unnamed>' panicked at 'index out of bounds: the len is 89225255 but the index is 15119944950614189002', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/fst-0.3.0/src/raw/node.rs:306:17

thread '<unnamed>' panicked at 'index out of bounds: the len is 16285338 but the index is 3532794445444415790', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/fst-0.3.0/src/raw/node.rs:306:17                      

This occurs on approximately 13 out of 4635 files, ranging in size from 20MB to > 100MB. I have not been able to narrow things down past this, but wanted to know what might cause this?

I'm shelling out to the fst-bin crate to combine multiple input files into larger files, then doing set operations on the merged output. The fst binary was built on Rust nightly; I'm not sure of the exact version at the moment.

Please provide a FromIterator impl

fst has all the necessary infrastructure for FromIterator to work; please consider implementing it, so that collect works, and in particular to make it easy to collect into Result<Fst, ...> or Result<Set, ...>.

Thank you!

1.0 release?

This lib is four years old and seems consolidated. What's blocking an 1.0 release?

Fst made of fixed length byte strings: bounded hamming distance query

I have this problem: 3B k/v where the key are fixed length bytes strings either 32, 64 or 128 bits (in a given fst all keys have the same length though).

I need to find all values with a key that is within a bounded (small) bit hamming distance of existing keys.

I was thinking of a simple hamming distance query approach based on "Detecting Near-Duplicates for Web Crawling" at http://gurmeet.net/computer-science/research-papers/

So say I want a hamming distance of 3, and my keys are 32 bits. At query time, I could build a query levensthein fst that would allow any key with up to 3 byte to be changed (not added/removed) to still match, e.g. this would match all the keys that that are within hamming distance 3 or less.

This should work, but is this a proper use of the levensthein automaton? e.g. would there be a better construction for hamming distance than levensthein?

Since this would be raw bytes, this might not work well with the unicode support of levensthein?

Would the fact that the first three bytes can be anything with the last byte still match force a full sequential scan of the fst?

@fulmicoton ping too as you seem to grok levensthein automatons quite well

key positions in fst

Is it possible to:

  • return the key position from contains? That is instead of bool return Option<usize> where None means not found and Some(pos) is the position of the key in the lexicographic order of keys represented by the automaton
  • return a key by its position. That is implement ops::Index for set, such that myset[pos] returns the key at pos.

If the above is possible then more possibilities open up:

  • map from bytes to bytes: "set" fst + int -> byte map (possibly another "set" fst)
  • map from bytes to ints, where ints are monotonically increasing when bytes are in lexicographic order (this is very common when building search indices): "set" fst + succinct integer encoding

Fuzzy prefix searches?

Wow, what a slick library. I was just evaluating this for use in building custom indices, and I'm really impressed. I think I'm going to get a lot of mileage out of this. :-)

However, I ran into one use case that would be very handy: Fuzzy prefix matching. Given a string Masach, I can search for Levenshtein::new("Masach", 1) or Regex::new("Masach.*"). But what I'd really love to do is autocomplete to Massachusetts, with the extra s.

It seems like it should be possible to build something like:

// Match anything with a prefix that would be matched by A.
impl<A: Automaton> Automaton for Prefix<A> { ... }

But I'm not quite sure how to go about something like that. Is this basically trivial, or is it much harder than it looks?

Add mmap feature to fst-regex Cargo.toml and forward it to its fst dependency

Tantivy depends on both fst and fst-regex.

I am trying to compile tantivy to web assembly.
In order to remove the dependency to memmap, I added a mmap default feature.
The same strategy is used in the fst crate, so I forward the feature to the fst crate.

Unfortunately, fst-regex also depends on the fst crate and does not have any such feature, so tantivy
ends up depending on memmap either way.

Could we add a mmap default feature in the fst-regex crate too?
Is there another solution?

consider adding a tokenizer/scanning routine

@llogiq brought up this use case where one might have a really big set of tokens that one wants to use to scan some text. It turns out that this is pretty easy to support using existing public APIs, but it might be nice to actually make this a proper part of the API. Here is a candidate implementation:

use fst::raw::Fst;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let fst = Fst::from_iter_map(vec![
        (" ", 0),
        ("a", 1),
        ("ab", 2),
        ("abc", 3),
        ("bc", 4),
        ("uvwxyz", 5),
    ]).unwrap();

    let mut input = "abc a uvwxyz uvwxyz ab ab a ab abc";
    while !input.is_empty() {
        let (end, id) = match scan(&fst, input.as_bytes()) {
            None => return Err(From::from("unrecognized token")),
            Some((end, id)) => (end, id),
        };
        // don't print whitespace
        if id != 0 {
            println!("{:?} :: {:?}", id, &input[..end]);
        }
        input = &input[end..];
    }

    Ok(())
}

fn scan<D: AsRef<[u8]>>(fst: &Fst<D>, input: &[u8]) -> Option<(usize, u64)> {
    let mut node = fst.root();
    let mut out = fst::raw::Output::zero();
    // If the empty string is a token, then `last_match` probably needs to
    // be initialized in a smarter way.
    let mut last_match: Option<(usize, u64)> = None;
    for (i, &b) in input.iter().enumerate() {
        node = match node.find_input(b) {
            None => return last_match,
            Some(trans_index) => {
                let t = node.transition(trans_index);
                out = out.cat(t.out);
                fst.node(t.addr)
            }
        };
        if node.is_final() {
            last_match = Some((i+1, out.value()));
        }
    }
    last_match
}

And the output:

$ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.00s
     Running `target/debug/fst-longest`
3 :: "abc"
1 :: "a"
5 :: "uvwxyz"
5 :: "uvwxyz"
2 :: "ab"
2 :: "ab"
1 :: "a"
2 :: "ab"
3 :: "abc"

support for adding byte vector as value in map

It'll be very useful to store byte array as a value.

I'm using fst in Pathivu for searching indexes.

corresponding posting list for indexes is stored in rocksdb.

It'll be very helpful, if the fst crate itself can store the posting list.

I'm willing to take this issue if it is accepted. But, need some mentoring.

Limited Multi-Regex as the keys in the FST

Note: I know I can search with a regex, fst.search(Regex::new(...)), but it doesn't satisfy my usecase.

Note: This suggestion is very similar to multi-regex, but hopefully by only introducing a limited feature set from regex we can keep the state machine small, and therefore really fast. I think an FST with "complex states" would be more efficient than RegexSet, however, you're the expert, so I'd trust you if you said otherwise.

Summary: My usecase basically needs a RegexSet with a few million Regexs, I know its crazy. But this is why I'd prefer the FST to RegexSet, I'm hoping for better memory efficiency (Ideally, improved throughput too).

What are your thoughts around adding limited regex features into the keys of the FST?
For example: A system which parses log files for a 500+ route API

// The URLs in log files have unique ids, and so they need to be mapped back into their original "route patten" for appropriate metrics aggregations.
// There are simpler ways to solve this problem, like logging the "routing pattern" along with the URL. However, this was the simplest example to describe my usecase.

// Used to map the values in the FST to other things.
let names = vec!["GET_FOO_BAR_ALL", "GET_FOO_BAR_ONE", "PUT_FOO_BAR_ONE"]

let mut build = MapBuilder::new(wtr)?;
build.insert("GET:/foo/.*/bar", 0).unwrap();
build.insert("GET:/foo/.*/bar/.*", 1).unwrap();
build.insert("POST:/foo/.*/bar/.*", 2).unwrap();
...

log_lines.map(line -> route_mapper.search(line))

The only real change would be to create states which can transition back to itself (i.e. the graph becomes kind-of cyclic), but "transitioning back to itself" is equivalent to "the state remains unchanged" (i.e. the graph remains acyclic).

From my minimal understanding, the features (this might not be exhaustive) which cause regexs to be significantly less performant than an FST:

  • Any Greedy matcher/state [Don't need]
  • Capture Groups [Don't need]
  • Possibly, alternations (but I really hope not) [Do need]

Edit: The big performance consideration I didn't take into account actually was that because states can be shared, any state which can loop back to itself will basically get stuck on the traversal "frontier", and need to be re-checked for each character streaming through the machine. But I suppose this is alright, because its "pay for what you use"?

Questions regarding non-string use

I was thinking about writing an FST based point-in-poly algorithm. This is partially for fun, partially to test the applicability (and potential performance) of FST solutions to other classic computing problems. I am aware that it's probably not ideal for what I have in mind, but I thought it'd be an interesting test anyway. In thinking about how I would set about doing this I came up with some questions I thought best to get the answers to before continuing. I understand that this library probably wasn't meant to be used for what I have in mind, I'm just trying to gauge how difficult the problem is and how much work I'm setting for myself before I get too deep.

Problem Domain

My problem initially concerns how to store the polygon's information. This might be as 2D coordinates, vectors, or included triangles. Preprocessing the polygon will probably be necessary. I intend to write a query automaton (based on fst::Levenshtein) to answer point-in-poly queries. I understand that I may also need to overload (or partially rewrite) some of the fst::raw::* modules in order to achieve this functionality or to better suit the data structure to storing what I have in mind.

Questions:

  1. Assume that we're storing numeric tuples, representing a coordinate or vector: what is the relative tradeoff between different encoding methods? In the primes example you use big-endian byte order to preserve lexicographic ordering. I could do the same for integer or floating-point values alone, but what about when stored together as a struct? A JSON representation obviates this issue and the lexicographic issue, but probably makes matters worse elsewhere.

  2. How important is lexicographic ordering? Would any other strict-weak-ordering work? How extensively is this assumption encoded in the library?

  3. How important is it that nodes only store 1 byte? Could this requirement be relaxed such that every node stores N-bytes instead? How extensively is this assumption encoded in the library? If it's not possible, would it be sensible for the automaton to implement a look-ahead or backtrack operation?

  4. Assume the coordinates for a polygon are preprocessed and stored such that the FST contains coordinate sequences for monotone, polygon chains. This would mean that a point is inside the polygon if two chains can be found which bound (a linear comparison) the point. Could an automaton be constructed to resolve this query?

  5. Assume the FST is storing JSON encoded coordinate pairs. How would one write an automaton that could be used to retrieve all the strings in the set which represent points within a given linear distance of a provided point?

experiment using fsts for Unicode tables

I'm interested to see how well FSTs would do for storing Unicode tables, particularly for use in the regex-syntax crate. It could make set operations especially efficient and reduce binary size.

In order for this to work, the fst crate itself can't depend on regex-syntax. I think the right answer here would be to create a new sub-crate, called fst-regex, which provides the Automaton impl. That would trim off the current regex-syntax and utf8-ranges deps, and the mmap feature could be disabled, which would leave only byteorder, which is fine.

@fulmicoton As my sole (public) user, would this pose any problems for you? (I don't think it would.)

Expose the number of bytes written so far by the MapBuilder

I would like to be able to know how much have been written in the underlying Writer of the Fst map builder.

My use case is as follows :

I want to read a dictionary that is both large, and only accessible via a distributed file system
(think S3, HDFS, etc.)
The filesystem allow for random reads, but each reads has a great latency (~100 ms). Bandwidth is ok (~10MB/s). Downloading the whole thing locally is not an option.

I would rather avoid hashing my dictionary first, because it will make sorted iteration impossible.
Changing the abstraction for the data is not a solution either because I would still suffer from locality. I would like to add items to the dictionary, and when I have written 1MB, close the builder and open a new one...
The index leading to the correct K, V chunk will itself be a fst (or rather a tree of fst).

Would you accept pull request ?

Prefer using an Arc<[u8]> instead of an Arc<Vec<u8>>

In a previous comment it seemed you were ready to probably introduce breaking changes to the library.

I found out that an fst (and thus a set and a map) can be created from shared bytes, those are represented by an Arc of Vec<u8>, I was wondering why you chose to use this type instead of an Arc<[u8]>.

One of the advantage of an Arc<[u8]> is that it can be used every where AsRef<[u8]> is needed, an Arc<Vec<u8>> doesn't implement this trait.

The only advantage I see in this choice is for performance reason that will be rarely triggered: constructing an Arc<Vec<u8>> from an existing vec is cheaper as it will not copy the entirety of the bytes, Arc<[u8]> does.

I will stop bothering you, and let you work on more interresting features now :)
I just wanted to make sure to break things when possible and clean the library a little bit more if this makes any sense, of course.

Map a Streamer with a generic type

I am currently working on an improvement of the fst library and I am facing a problem, I read the Streamer Trait explanation and the whole lifetime nomicon explanations to try to fix this lifetime problem.

struct StreamWithStateOutput<S>(S);

impl<'a, S, T> Streamer<'a> for StreamWithStateOutput<S>
where
    S: Streamer<'a, Item=(&'a [u8], u64, T)>,
{
    type Item = (&'a [u8], raw::Output, T);

    fn next(&'a mut self) -> Option<Self::Item> {
        self.0.next().map(|(k, v, s)| (k, raw::Output::new(v), s))
    }
}

This is the current mapping struct that I produced to map a Streamer into another Streamer but that produce raw::Outputs rather than u64 (like the StreamOutput struct you already define) but I'm facing a problem of lifetime (like the one you talk about in the Streamer Trait documentation I suppose).

If I follow the help message that rustc outputs, I add a lifetime constraint T: 'a to the StreamWithStateOutput Streamer implementation and the problem comes after: the OpWithStateBuilder need a T: 'static constraint.

So I didn't add this constraint because it blocks everything above, the code is less fexible.
I just want to understand the problem here, it's hard to work with lifetimes 😢.

I also think of using IntoStreamer to construct a StreamWithStateOutput, like you said in the Streamer explanations but it's not possible I fear.

Note that I successfully defined a raw::OpWithStateBuilder without forcing a 'static bound on T by using the same constraints on the push method than the ones on OpWithStateBuilder (not raw), the only thing that is different is the StreamWithStateOutput mapping struct (used to pass from a u64 to an Output).

Keys with a value of zero may return wrong output

It seems that keys associated to Output::zero() may return the wrong value; specifically, a zero-value key will emit the output of the longest non-zero-value key within its prefix.

Here is a test for a (nearly) minimal failure case which illustrates the problem:

#[test]
fn suffix_to_zero() {
    use std::collections::BTreeMap;
    let pairs = &[("a", Output::new(1)), ("ab", Output::new(0)), ("abc", Output::new(0))];
    let btree : BTreeMap<&str, Output> = pairs.iter().cloned().collect();
    let mut builder = raw::Builder::memory();
    builder.extend_iter(btree.clone());
    let fst = Fst::from_bytes(builder.into_inner().unwrap()).unwrap();

    let from_btree = btree.get("abc").cloned();  // Output(0), correct
    let from_fst = fst.get("abc");               // Output(1), wrong
    assert!(from_fst == from_btree);
}

Looking up either "ab" or "abc" in the raw FST will return 1 instead of 0, suggesting that their paths are polluted by the shared prefix "a". Presumably this if is the culprit.

I'll see if I can work out a fix later today.

Two lines of input don't work

The commandline

fst map two.csv two.fst

does work, but the following command:

fst range -o two.fst

prints only one single line and the command

fst dot two.fst | dot -Tpng > two.png

produces a graph with three nodes - ok. But only a single final node, a wrong output on label 'a' (1) and no output on label 'b'.

Weighted automata

Hi,

I've been experimenting with weighted automata in https://github.com/frankier/fst-extra-aut . My current approach is to wrap up weighted NFAs in a trait which acts like an fst crate Automata. This approach seems to work fairly well apart from one thing. I need the accepting State in order to unwrap them and retrieve the weights. (Currently in my proof of concept is at https://github.com/frankier/finlearnspellcorrect I rerun the automata on the final string to get the state out.) My current idea is I could create a new version of search that would also return accepting Automata states as well as strings and fst/index states, but at very least this would probably mean needing parameterise StreamBuilder and Stream.

I was wondering you had any implementation advice on the best/lowest impact/most likely to be accepted as a PR way to achieve this.

Request: looser lifetime for loading from byte slice

Currently, the primary way to load an fst object (Map/Set/Raw) from a byte slice is to use from_static_slice. While that solves the use case of loading data that lives for the lifetime of the program, it appears to be overly-restrictive for other use cases.

For example, I have fst data that I've written into a network buffer:

  • I do not want to give ownership of that buffer to fst, so using from_bytes won't work.
  • The fst data is not 'static, so from_static_slice won't work.
  • The fst data is a subslice of a greater slice (the network buffer), so to make it work with from_shared_bytes I would need to change my internal APIs to use Arc<Vec<u8>> instead of &[u8] (which may add bugs and performance regressions) .

While I could try to use the existing mmap API, I believe I would need to create a file-like object that wraps my byte slice, and then memory map it, which is unergonomic and error-prone.

My recommendation is to add a from_slice(bytes: &[u8]) constructor to Map, Set, and Raw. My understanding is that this is unfortunately not trivial, because the FstData type does not take a lifetime parameter, which would be required while it holds a reference to the byte slice.

Nevertheless, is there interest in adding this feature?

Thanks!

streamer range selection is inconsistent

Consider following code, second and third asserts are not giving expected results.

// cargo-deps: fst="*"
extern crate fst;

use fst::{IntoStreamer, Streamer, Map};

fn main() {
    let map = Map::from_iter(vec![
        ([0, 0, 1], 1),
        ([0, 1, 1], 2),
        ([1, 2, 1], 3),
        ([1, 4, 1], 4),
        ([2, 4, 1], 5),
    ]).unwrap();
    let mut count = 0;

    let mut stream = map.range().ge(&[1, 2]).le(&[1, 3]).into_stream();
    while let Some((_, _)) = stream.next() {
        count += 1;
    }
    assert_eq!(count, 1);

    count = 0;
    let mut stream = map.range().ge(&[1, 2]).le(&[1, 2]).into_stream();
    while let Some((_, _)) = stream.next() {
        count += 1;
    }
    assert_eq!(count, 1); // assert fails; gives count == 0; why??

    count = 0;
    let mut stream = map.range().ge(&[1]).le(&[1]).into_stream();
    while let Some((_, _)) = stream.next() {
        count += 1;
    }
    assert_eq!(count, 2); // // assert fails; gives count == 0; why??
}

Optimization for sparse automatons

Hello

The fst can be queried with a specific key or with an automaton. The key is optimized by just „going forward“, while the automaton is „offered“ one byte at a time to modify the state.

Now, I have a sparse automaton ‒ one that doesn't describe one specific key only, but it has only few branching points and is just single possible next byte in most of the states.

I wonder if it's possible to somehow make the querying faster, given the above. Instead of offering it all the bytes that won't match for sure, just going forward on the straight parts. I could write a code specific to my use case, but is there something that could be done in general?

Maybe extending the automaton with a fn hint(&self) -> Option<&[u8]> (default to None) that would not store the intermediate visited states and just go forward? Would it make it faster? Would it make sense to have something like that as part fst itself? Or, is there a better way? If so, I might try to implement it, but I want to ask first before I invest the time.

Thank you

Collation of string sort?

Can you elaborate what collation order is used by this library? I was loading data into a Set that was retrieved from PostgreSQL, and was ordered by the en_US collation ( see https://www.postgresql.org/docs/current/static/collation.html ) and got the OutOfOrder error. My database is configured to use UTF8 encoding and en_US collation.

For example:

example=# select * from (values ('aabbcc.exe'), ('{12345-345324}.sys')) foo order by column1 COLLATE "en_US";
      column1       
--------------------
 {12345-345324}.sys
 aabbcc.exe
(2 rows)

example=# select * from (values ('aabbcc.exe'), ('{12345-345324}.sys')) foo order by column1 COLLATE "C";
      column1       
--------------------
 aabbcc.exe
 {12345-345324}.sys
(2 rows)

It appears that this library uses C collation. Can it check the environment to see what order it should use, or will it always need to use C? If the latter, I can change my Postgres query to specify the collation to sort by, but it would be much preferable if this library could use the collation defined in the environment (i.e. LANG=en_US.UTF-8).

TooManyStates(10000)

I played with the example in documentation and found that Levenshtein does not accept long strings. Here is an example

let lev = Levenshtein::new("monomorphization", 3).unwrap();

This line causes panic

thread 'main' panicked at 'called Result::unwrap()on anErr value: TooManyStates(10000)', src/libcore/result.rs:1051:5

Any sufficiently long string results in panic.

Update the levenshtein construction algorithm

I implemented the method used in Fast String Correction with Levenshtein-Automata.
Currently the project is available here. https://github.com/tantivy-search/levenshtein-automata

Building a new automaton requires to precompute a datastructure for each type of Levenshtein automaton considered. This precomputation is rather slow, but is typically amortized after building one automaton. If this is a problem, it is also possible to serialize it and ship it with the library. That's the approach taken by Lucene.

The benefits compared with the current implementation of fst-levenshtein are as follows

  • building a DFA is much faster. For instance for a distance of 2, and for the string "Levenshtein",
    the current algorithm takes 2.5ms while the new implementation takes 0.09ms. Note that clearly 2.5ms is fast enough for most application, and fst intersection will dominate anyway.

  • the produced DFA is smaller, so we can hope for less cache miss. (!?)
    Microbenchmarking this with honesty is obviously very difficult... Going through the DFA above
    over the string Leventshtein went from 72 ns to 8 ns. The difference might also be due to
    the interface being easier to deal with for the compiler (state are u32 that I pass by value).
    For a levenshtein distance of 2 without transposition, the number of states for the DFA (before utf8 conversion) is bounded to 30 x (query len + 1)

  • the DFA outputs the actual distance rather than a boolean. This is useful for scoring.

  • Optionally transposition can be given a cost of 1 (Levenshtein-Damerau).

  • It addresses #38

The Cons as I see it are:

  • It is yet an extra crate
  • Code quality is not as good as yours and bugs might be lurking somewhere.

Should we switch to this implementation and how do we proceed ?

If you decide to go with this implementation, the way I see it is : I should publish the project as a different crate. This sucks to have to have two repo for levenshtein automaton, but currently fst_levenshtein cannot be used without depending on fst which is not idea either.
fst_levenshtein could depend on this library and wrap its DFA.

An alternative solution, would be to integrate my code into the fst_levenshtein crate...

raw FST returns wrong value for key (with find_longest_prefix)

I use an FST to match word tokens (see #104 for the implementation). The fst in question should map "d" to 1040, "di" to 4487, "distance" to 3292, "distances" to 12103 and "distant" to 6802. However, it returns 3292 for "distant". I have added a number of dbg!() statements to follow the execution. I'll check if I can attach the serialized FST later.

[src/lib.rs:84] out.value() = 0
[src/lib.rs:87] b = 100 'd'
[src/lib.rs:89] fst.node(t.addr) = NODE@53353
  end_addr: 53245
  size: 109 bytes
  state: AnyTrans(StateAnyTrans(85))
  is_final: true
  final_output: Output(0)
  # transitions: 21
  transitions:
    (a, 1114) -> 44815
    (b, 15922) -> 0
    (c, 4847) -> 0
    (d, 19275) -> 0
    (e, 1099) -> 48105
    (f, 27854) -> 48108
    (h, 15439) -> 48133
    (i, 1066) -> 50727
    (j, 5480) -> 50735
    (l, 20429) -> 0
    (m, 21101) -> 50748
    (n, 5024) -> 5595
    (o, 1039) -> 51882
    (q, 24370) -> 0
    (r, 1812) -> 52580
    (s, 15193) -> 52588
    (t, 25678) -> 0
    (u, 1036) -> 53068
    (v, 3926) -> 53077
    (w, 10189) -> 53141
    (y, 4281) -> 53244

[src/lib.rs:91] out.cat(node.final_output()).value() = 0
[src/lib.rs:94] out.value() = 1040
[src/lib.rs:87] b = 105 'i'
[src/lib.rs:89] fst.node(t.addr) = NODE@50727
  end_addr: 50639
  size: 89 bytes
  state: AnyTrans(StateAnyTrans(81))
  is_final: true
  final_output: Output(2381)
  # transitions: 17
  transitions:
    (a, 4217) -> 48322
    (c, 3874) -> 48396
    d -> 48402
    (e, 245) -> 48454
    (f, 261) -> 48594
    (g, 1511) -> 48669
    (l, 12496) -> 48698
    (m, 6683) -> 48759
    (n, 2490) -> 48836
    (o, 3695) -> 48875
    (p, 5935) -> 48930
    (r, 366) -> 49049
    (s, 106) -> 50459
    (t, 11927) -> 12001
    (v, 301) -> 50622
    (x, 9251) -> 50634
    (z, 12743) -> 50638

[src/lib.rs:91] out.cat(node.final_output()).value() = 3421
[src/lib.rs:94] out.value() = 2106
[src/lib.rs:87] b = 115 's'
[src/lib.rs:89] fst.node(t.addr) = NODE@50459
  end_addr: 50368
  size: 92 bytes
  state: AnyTrans(StateAnyTrans(18))
  is_final: false
  final_output: Output(0)
  # transitions: 18
  transitions:
    (a, 3207) -> 49226
    (b, 6320) -> 49244
    (c, 1391) -> 49591
    (d, 22922) -> 17708
    (e, 2083) -> 49602
    (g, 10509) -> 49649
    (h, 7629) -> 49658
    (k, 7573) -> 49666
    (l, 16747) -> 49675
    (m, 5007) -> 49728
    (n, 4161) -> 49739
    (o, 6549) -> 49751
    (p, 2441) -> 49904
    (q, 11997) -> 49922
    (r, 18063) -> 49958
    (s, 6102) -> 50039
    t -> 50367
    (u, 24746) -> 10277

[src/lib.rs:94] out.value() = 2212
[src/lib.rs:87] b = 116 't'
[src/lib.rs:89] fst.node(t.addr) = NODE@50367
  end_addr: 50341
  size: 27 bytes
  state: AnyTrans(StateAnyTrans(5))
  is_final: false
  final_output: Output(0)
  # transitions: 5
  transitions:
    (a, 1080) -> 50063
    (i, 2970) -> 50149
    (o, 16900) -> 50163
    r -> 50308
    (u, 10279) -> 50340

[src/lib.rs:94] out.value() = 2212
[src/lib.rs:87] b = 97 'a'
[src/lib.rs:89] fst.node(t.addr) = NODE@50063
  end_addr: 50054
  size: 10 bytes
  state: AnyTrans(StateAnyTrans(2))
  is_final: false
  final_output: Output(0)
  # transitions: 2
  transitions:
    (l, 26041) -> 0
    n -> 50053

[src/lib.rs:94] out.value() = 3292
[src/lib.rs:87] b = 110 'n'
[src/lib.rs:89] fst.node(t.addr) = NODE@50053
  end_addr: 50044
  size: 10 bytes
  state: AnyTrans(StateAnyTrans(2))
  is_final: false
  final_output: Output(0)
  # transitions: 2
  transitions:
    c -> 50043
    (t, 3510) -> 0

[src/lib.rs:94] out.value() = 3292
[src/lib.rs:87] b = 116 't'
[src/lib.rs:89] fst.node(t.addr) = NODE@0
  end_addr: 0
  size: 0 bytes
  state: EmptyFinal
  is_final: true
  final_output: Output(0)
  # transitions: 0
  transitions:

[src/lib.rs:91] out.cat(node.final_output()).value() = 3292
[src/lib.rs:94] out.value() = 6802
[src/lib.rs:100] last_match = Some(
    (
        7,
        3292,
    ),
)

Set.difference is broken for some cases

It seems that the result of the difference operation on sets is broken for some inputs:

    #[test]
    fn difference_broken() {
        let v = fst_difference(vec![
            vec!["bar", "foo"],
            vec!["baz", "foo"],
        ]);
        assert_eq!(v, vec!["bar"]);
        // is actually: vec!["bar", "foo"]
    }

Dynamic add new elements to Set

Reading the docs say that a Set:

Once constructed, a Set can never be modified.

There is any work around to this? I'm learning rust now and doing a little search engine for fun, one of the requirements is to add new elements and/or update older ones.
If everytime i need to complete rebuild the set just to add a element, gonna get very slow.

Thanks!

Methods to filter fst into new fst based on bytes

For some uses of an FST, before applying a large number of queries to the fst, the caller may be able to prune out states and transitions from the FST based on knowledge of valid keys. For instance, if using an FST to store a dictionary, and performing lookups in that dictionary for a game, the caller may know that only a subset of letters are available, and could start by pre-processing the FST to prune any states or transitions that use unavailable letters.

Please consider providing ways to efficiently process an FST and prune out states and transitions based on inputs. The simplest could be a filter on valid u8 inputs for transitions.

Return the automaton state with the `(key, value)`

In some cases it can be useful to know the automaton state as well, when intersecting the fst with an automaton using .search(...).

For instance, levenshtein's automaton's state hold the actual levenshtein distance of the current key. It is a very valuable information that could be used for ranking the results for instance.

Use as compressed bitmap?

Can I use FST for efficiently storing bitmaps and doing set operations on them? Like roaring-bitmap?

If so, why and when would this be a good idea?

provide reverse mapping from value to key in some cases

It is sometimes useful to have String <-> StringOrdinal mapping, where StringOrdinal is its position within the set of strings assuming the strings have been sorted.

The fst repo makes a very good String -> StringOrdinal mapping.
Wouldn't it be possible to get the StringOrdinal -> String by binary searching the correct transition at each node?

Querying a 63GB file took a long time

I'm using this library for querying a 63GB FST of about 7 billion strings of length 9. With more than enough memory available too completely load/cache the FST, it still took 16 minutes to only query about 15 thousand strings. Then I first copied the files to a tmpfs before doing the exact same thing and I was able to query more than 300 million strings in only 36 minutes.

Do you have any idea what might be the problem in the first case or how I would go about bisecting the issue?

Feature Request: add union and concatination for keys

It would be nice if this crate offered some way to build up more complex key values before adding them to the FST. My specific use case is mapping byte sequences with embedded utf8 codepoints and byte sequences with an escaped ascii representation of those same codepoints to the same value. This is perfectly possible with the current API.

use fst::Map;
let both_reps_to_same = Map::from_iter(vec![("Δ", 1), (r"\u0394", 1)]);

The problem is that the size of the set of keys you have to add will grow pretty fast. Handling 2 Deltas would be:

use fst::Map;
let both_reps_to_same = Map::from_iter(vec![(r"ΔΔ", 1), (r"Δ\u0394", 1), (r"\u0394Δ", 1), (r"\u0394\u0394", 1)]);

It seems like, even though the final fst would be compact, the work to build it up would be O(2^n). It would be nice to have some API for composing keys in an fst that could avoid this. I'm thinking something like:

let m = MapBuilder::new(...)?;
let either_or = Key::alt(vec![b"Δ", b"\u0394"]);
let two_deltas = either_or.concat(either_or.clone());
m.insert_complex_key(two_deltas, 1);

I looked at the code a bit to try to figure out if there is some internal API that just isn't exposed, but I didn't see an obvious place to hook into. Is this something that would be reasonable for the fst crate to provide? It seems like a RegexSet would probably also work for my use case, but it would be nice to guarantee that the automata is compiled ahead of time and deterministic.

I apologize if I missed a part of the exiting API which already allows this.

This might be related to #58.

Should fst::Automaton::start be changed to an associated constant?

The documentation claims "This method should always return the same value for each implementation." Which sounds like the prime case of associated constants. So at some point in the future, when a new release of this crate comes out that does not support pre 1.20 rust, does it make sense to change it to a constant?

FST walker for forward/backward transitions

For some uses, it would make sense to statefully walk through the FST, stepping forward with input bytes, but also popping off the last input and returning to the previous state.

I'd love to have something with roughly the following interface:

impl FstWalker {
    fn new(fst: &Fst) -> FstWalker;
    fn push(&mut self, input: u8) -> bool; // false if no transition
    fn node(&self) -> Node;
    fn pop(&mut self);
    fn value(&self) -> &[u8];
}

The FstWalker would internally maintain the byte-string pushed so far, which it may want to reserve_exact on based on the longest key in the Fst to avoid needing to grow it. value() would return the current byte-string.

The node() function would allow calling is_final() and/or final_output().

I can build this easily enough based on the interfaces already provided by raw::Fst, but I'm wondering if this looks like something useful enough to provide in the fst crate itself. If I wrote a patch for this, would you be interested?

(I also don't know if this would beat the performance of just recursing on the fst, but I plan to test that.)

Serialization support?

Is there any plan to support serialization via serde for fst::Map? If not what is the best (fastest) way to recreate a Map from persistent data?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.