Giter Club home page Giter Club logo

dawgsharp's Introduction

Build status [NuGet Package] [Get Commercial License]

DawgSharp, a clever string dictionary in C#

DAWG (Directed Acyclic Word Graph) is a data structure for storing and searching large word lists and dictionaries. It can be 40x more efficient than the .NET Dictionary class for certain types of data.

As an example, my website hosts a 2 million word dictionary which used to take up 56 meg on disk and took 7 seconds to load (when using Dictionary and BinarySerializer). After switching to DAWG, it now takes 1.4 meg on disk and 0.3 seconds to load.

How is this possible? Why is the standard Dictionary not as clever as DAWG? The thing is, DAWG works well with natural language strings and may not work as well for generated strings such as license keys (OIN1r4Be2su+UXSeOj0TaQ). Human language words tend to have lots of common letter sequences eg -ility in ability, possibility, agility etc and the algorithm takes advantage of that by finding those sequences and storing them only once for multiple words. DAWG has also proved useful in representing DNA data (sequences of genes). The history of DAWG dates back as far as 1985. For more backgroud, google DAWG or DAFSA (Deterministic Acyclic Finite State Automaton).

DawgSharp is an implementation of DAWG, one of many. What makes it special?

  • It is written in pure C#, compiles to MSIL (AnyCPU) and runs on .NET 3.5 and above.
  • It has no dependencies.
  • It introduces no limitations on characters in keys. Some competing implementations allow only 26 English letters. This implementation handles any Unicode characters.
  • The compaction algorithm visits every node only once which makes it really fast (5 seconds for my 2 million word list).
  • It offers out-of-the-box persistence: call Load/Save to write the data to disk and read it back.
  • It has unit tests (using the Visual Studio testing framework).
  • It has received several man-hours of performance profiling sessions so it's pretty much as fast as it can be. The next step of making it faster would be rewriting the relevant code in IL.
  • It's GC-friendly: the Garbage Collector only sees 3 large arrays of ints where a Dictionary would store millions of strings.

Usage

In this example we will simulate a usage scenario involving two programs, one to generate the dictionary and write it to disk and the other to load that file and use the read-only dictionary for lookups.

First get the code by cloning this repository or installing the NuGet package.

Create and populate a DawgBuilder object:

var words = new [] { "Aaron", "abacus", "abashed" };

var dawgBuilder = new DawgBuilder <bool> (); // <bool> is the value type.
                                             // Key type is always string.
foreach (string key in words)
{
    dawgBuilder.Insert (key, true);
}

(Alternatively, do var dawgBuilder = words.ToDawgBuilder(key => key, _ => true);)

Call BuildDawg on it to get the compressed version and save it to disk:

Dawg<bool> dawg = dawgBuilder.BuildDawg (); 
// Computer is working.  Please wait ...

using (var file = File.Create ("DAWG.bin")) 
    dawg.SaveTo (file);

Now read the file back in and check if a particular word is in the dictionary:

var dawg = Dawg <bool>.Load (File.Open ("DAWG.bin"));

if (dawg ["chihuahua"])
{
    Console.WriteLine ("Word is found.");
}

The Value Type, <TPayload>

The Dawg and DawgBuilder classes take a template parameter called <TPayload>. It can be any type you want. Just to be able to test if a word is in the dictionary, a bool is enough. You can also make it an int or a string or a custom class. But beware of one important limitation. DAWG works well only when the set of values that TPayload can take is relatively small. The smaller the better. Eg if you add a definition for each word, it will make each entry unique and your graph will become a tree (which may not be too bad!).

MatchPrefix()

One other attractive side of DAWG is its ability to efficiently retrieve all words starting with a particular substring:

dawg.MatchPrefix("awe")

The above query will return an IEnumerable<KeyValuePair> which might contain keys such as awe, aweful and awesome. The call dawg.MatchPrefix("") will return all items in the dictionary.

If you need to look up by suffix instead, there is no MatchSuffix method. But the desired effect can be achieved by adding the reversed keys and then using MatchPrefix() on the reversed keys:

dawgBuilder.Insert("ability".Reverse(), true);
...
dawg.MatchPrefix("ility".Reverse())

GetPrefixes()

GetPrefixes() returns all dictionary items whose keys are substrings of a given string. For example:

dawg.GetPrefixes("awesomenesses")

Might return keys such as awe, awesome, awesomeness and finally awesomenesses.

GetLongestCommonPrefixLength()

One other neat feature is the method int GetLongestCommonPrefixLength(IEnumerable<char> word). If word is found in the dictionary, it will return its length; if not, it will return the length of the longest word that is found in the dictionary and that is also the beginning of the given word. For example, if prepare is in the dictionary but preempt is not, then dawg.GetLongestCommonPrefixLength("preempt") will return 3 which is the length of "pre".

Thread Safety

The DawgBuilder class is not thread-safe and must be accessed by only one thread at any particular time.

The Dawg class is immutable and thus thread-safe.

MultiDawg

The MultiDawg class can store multiple values agaist a single string key in a very memory-efficient manner.

Future plans

More usage scenarios

The API was designed to fit a particular usage scenario (see above) and can be extended to support other scenarios eg being able to add new words to the dictionary after it's been compacted. I just didn't need this so it's not implemented. You won't get any exceptions. There is just no Insert method on the Dawg class.

Better API

Implement the IDictionary interface on both DawgBuilder and Dawg (#5).

Literature

Competing Implementations

License

DawgSharp is licensed under GPLv3 which means it can be used free of charge in open-sources projects. Read the full license

If you would like to use DawgSharp in a proprietary project, please purchase a commercial license at http://morpher.co.uk.

dawgsharp's People

Contributors

bamby084 avatar bzaar avatar feodorfitsner avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

dawgsharp's Issues

OutOfMemoryException

I am trying this simple piece of code, for benchmarking purposes, but I get an OutOfMemoryException when reaching around 350K elements inserted.

DawgBuilder dawgBuilder = new DawgBuilder();
for (int i = 0; i < 10000000; i++)
{
dawgBuilder.Insert(Guid.NewGuid().ToString(), true);
}

How can i get a random word ?

I want to pick a random word from dictionary, however there is no such method that can help with that. As you developed this amazing library so can you please guide how'd you approach such requirement?
Thanks

Optimize the compaction algorithm for speed

This will involve two passes. In pass 1, it computes the hashes for all nodes. A hash for a node cosists of the payload has plus the hashes of all children's chars plus the hashes of all child nodes, i.e. it is recursive.
In pass two it uses the hashes to compare and merge the nodes effectively.
Both oasses are bottom up, so each node is visited only twice.

Possible error with your sample code

I'm trying to get this to work in VS2008 so I've had to tweak the code a bit (to replace the => syntax with normal properties). Now it compiles; but when I try your simple Usage example from the GitHub project page, the Dawg.Save call fails:
"DawgSharp.Dawg does not contain a definition for Save..."

And indeed, the only Dawg.Save method is one with private accessibility and two parameters:
private void Save (Stream stream, Action <OldDawg, BinaryWriter> save)

Looks like I should use SaveTo instead (a public method) -- so does your sample code need fixing?

Dawg could use less memory

The Node class has a children field which is a Dictionary (a hash table essentially). This is not particularly memory efficient. It could do with a simple sorted array (Node []) and use BinarySearch for lookups.

Consider MIT or LGPL please?

Hello,

I'm thinking about recommending this to be used in an upcoming project, but I can't use GPL code. Could you release this as possibly LGPL? I'd be happy to maybe help contribute to the project if I find problems with the code (if this DAWG implementation gets selected), but the licensing thing is a showstopper.

Great library, by the way! Thanks for considering!

Dawg.MatchSuffix feature?

Hi there,

This amazing library already has MatchPrefix which covers half of the cases. Can you please either provide MatchSuffix
or advise how to implement this?

Thanks

Use shorts instead of ints if the number of nodes is < 32K

My main use case involved 2.5M words that were represented using 150K nodes. But other users might have < 32K nodes in which case they may benefit from a 2x memory footprint cut (as if we haven't done enough!) It might be worth taking this number up to 64K-1 as we only need one special value.

If we are doing this, we may also want to think of users who might have more than 2G nodes and add a ulong version. Unlikely, but future-ready. Add a long version of GetNodeCount ().

I.e. add another template parameter to YaleDawg<TPayload>, TIndex which can be byte, ushort, uint or ulong and instantiate the required specialization based on the total number of nodes.

EndOfStreamException: Unable to read beyond the end of the stream.

I am trying to store a List in the value field. I am using the SaveTo overload which excepts a custom action to serialize the type. The file is successfully written. When I try to read in the file using the overload of the Load method I get the exception:

System.IO.EndOfStreamException: 'Unable to read beyond the end of the stream.'

I know the List payload is successfully deserialized in the Func as I can break before returning and see the List containing the expected data.

I am able to reproduce the exception in Visual Studio 16.6.0 Preview 2 and LinqPad 6 on Windows 10 machine.

private static void Main(string[] args)
        {
            var dawgBuilder = new DawgBuilder<List<string>>(); 
            dawgBuilder.Insert("test", new List<string> { "test", "hello", "world" });
            var dawg = dawgBuilder.BuildDawg(); 

            using (var file = File.Create(@"E:\Data\Dawg\DAWG-Test.bin"))
                dawg.SaveTo(file, new Action<BinaryWriter, List<string>>((r, payload) =>
                {
                    byte[] bytes = null;
                    BinaryFormatter bf = new BinaryFormatter();
                    using (MemoryStream ms = new MemoryStream())
                    {
                        bf.Serialize(ms, payload);
                        bytes = ms.ToArray();
                    }
                    r.Write(bytes, 0, bytes.Length);
                }));

            dawg = Dawg<List<string>>.Load(File.Open(@"E:\Data\Dawg\DAWG-Test.bin", FileMode.Open), new Func<BinaryReader, List<string>>(r =>
            {
                List<string> result = null;
                using (var ms = new MemoryStream())
                {
                    r.BaseStream.CopyTo(ms);
                    BinaryFormatter bf = new BinaryFormatter();
                    ms.Seek(0, SeekOrigin.Begin);
                    result = (List<string>)bf.Deserialize(ms);
                }
                return result;
            }));
        }

Dawg.MatchSuffix with max Count

Hi there,

Currently MathSuffx returns all the words which maynot be ideal in some situation so please provide a maxCount optional argument which for example if i say 10 then if then i get only maximum 10 words.

Thanks

How do i use it or anagram solving

Hi there,

Great library, and im considering it for commercial user however there are few things i need to clear.
How can i use it for solving anagram ? Lets suppose i have random letters like a,c,x,y and what i want to see how many words can be formed with only these 4 letters what function i use?

Thanks

Unexpected behaviour of DawgBuilder.TryGetValue()

DawgBuilder.TryGetValue() might return true even the queried key has never been added. E.g.:

var builder = new DawgBuilder<string>();
builder.Insert("dates", "dates");
bool b = builder.TryGetValue("date", out var v);
Assert.Null(v);  // Okay
Assert.True(b);  // that's unexpected

I'd expect b to be false here because "date" was not added and no value were obtained by TryGetValue()

Make DawgBuilder.Insert thread-safe

... and verify that this actually speeds up things on a typical 4-core system. Excessive locking might negate all the benefits of parallelism. Try locking at Node and DawgBuilder levels and see what difference it makes.

Error when loading bin file

Hello,

I am receiving this error when trying to load a Dawg bin file
unable to read beyond the end of the stream

Here is how I am creating the dawg file:

string path = Environment.GetFolderPath(Environment.SpecialFolder.Desktop) + "/EN_words.txt";
var s = File.ReadAllLines(path);

var dawgBuilder = new DawgSharp.DawgBuilder<bool>();
foreach (var line in s)
{
   dawgBuilder.Insert(line, true);
}

var dawgpath = Environment.GetFolderPath(Environment.SpecialFolder.Desktop) + "/EN_dawg.bin";
var dawg = dawgBuilder.BuildDawg();
dawg.SaveTo(File.Create(dawgpath));

and here is how i am loading the dawg file

var fpath = Environment.GetFolderPath(Environment.SpecialFolder.Desktop) + "/EN_dawg.bin";
var file = File.Open(fpath, FileMode.Open);
dawgg = Dawg<bool>.Load(file);

The error occours in YaleDawg.cs at line 66.
Screenshot 2020-02-27 at 19 49 30

Thank you very much for the attention.

Поддерживается ли поиск с возможными заменами е<>ё?

Приветствую! Пишу по-русски, т.к. вижу, что вы русский, и вопрос касается русского языка.

В реализации DAWG на Python есть поиск с заменами, который позволяет находить слова с ё даже если записать их через е. Есть ли здесь такое?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.