Giter Club home page Giter Club logo

udf_levenshtein's Introduction

Mysql Levenshtein UDF

Implementation of levenshtein edit distance algorithm to be used in mysql as a UDF. Most of the implementations I found did not allow for using table data in queries as they only allocated memory once. This implementation has both a shared memory buffer and bounds checking if the input exceeds the initialized memory where it will allocate new memory for each row.

Using scons to build the "libudf_levenshtein.so" file, but also included a Ruby FFI version and a rakefile with specs in it for testing.

Move the libudf_levenshtein.so file to you mysql plugins directly and install with:

"CREATE FUNCTION levenshtein RETURNS INTEGER SONAME 'libudf_levenshtein.so';"

Usage

SQL:

"select * from table where levenshtein(column1, column2) < 10"

"select * from table where levenshtein(column1, column2, 10) < 10"

The 3 parameter signature adds a threshold to the comparison and can significantly increase the execution speed. (If it can exit upon exceeding the threshold or determine that it does not need to execute because of the length differences then we get the execution cycles back and everybody wins).

Performance

While there are certainly things that can be done to increase the speed, this implementation of levenshtein is quite fast and efficient. The space required is only 2*min(length(A), length(B)).

rough benchmark: A full table scan on the first traunch of the google N-gram data (which contains ~48million words) takes approx 56 sec on my machine. Adding a naive levenshtein distance calculation to each word only results in approx 10 sec change in execution speed. (ie - it is pretty fast)

udf_levenshtein's People

Contributors

abrandoned avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.