Giter Club home page Giter Club logo

stringcompare's Introduction

Python package codecov CodeFactor Lifecycle Maturing Release version Sponsors

StringCompare: Efficient String Comparison Functions

StringCompare is a Python package for efficient string similarity computation and approximate string matching. It is inspired by the excellent comparator and stringdist R packages, and from the equally excellent py_stringmatching, jellyfish, and textdistance Python packages.

The key feature of StringCompare is a focus on speed, extensibility and maintainability through its pybind11 C++ implementation. StringCompare is faster than most other Python libraries (see benchmark below) and much more memory efficient when dealing with long strings.

The complete API documentation is available on the project website olivierbinette.github.io/StringCompare.

Installation

Install the released version from github using the following commands:

    pip install git+https://github.com/OlivierBinette/StringCompare.git@release

Project Roadmap

StringCompare currently implements edit distances and similarity functions, such as the Levenshtein, Damerau-Levenshtein, Jaro, and Jaro-Winkler distances. This is stage 1 of the following development roadmap:

Stage Goals Status
1 pybind11 framework and edit-based distances (Levenshtein, Damerau-Levenshtein, Jaro, and Jaro-Winkler) ✔️
2 Token-based and hybrid distances (tf-idf similarity, LSH, Monge-Elkan, ...) ...
3 Vocabulary optimizations and metric trees ...
4 Embeddings and string distance learning ...

Examples

Comparison algorithms are instanciated as Comparator object, which provides the compare() method (equivalent to __call__()) for string comparison.

from stringcompare import Levenshtein, Jaro, JaroWinkler, DamerauLevenshtein, LCSDistance

lev = Levenshtein(normalize=True, similarity=False)

lev("Olivier", "Oliver") # same as lev.compare("Olivier", "Oliver")
0.14285714285714285

Comparator objects also provide the elementwise() function for elementwise comparison between lists

lev.elementwise(["Olivier", "Olivier"], ["Oliver", "Olivia"])
array([0.14285714, 0.26666667])

and the pairwise() function for pairwise comparisons.

lev.pairwise(["Olivier", "Oliver"], ["Olivier", "Olivia"])
array([[0.        , 0.26666667],
       [0.14285714, 0.28571429]])

Benchmark

Comparison of the Damerau-Levenshtein implementation speed for different Python packages, when comparing the strings "Olivier Binette" and "Oilvier Benet":

from timeit import timeit
from tabulate import tabulate

# Comparison functions
from stringcompare import DamerauLevenshtein
cmp = DamerauLevenshtein()
from jellyfish import damerau_levenshtein_distance
from textdistance import damerau_levenshtein

functions = {
    "StringCompare": cmp.compare,
    "jellyfish": damerau_levenshtein_distance,
    "textdistance": damerau_levenshtein,
}

table = [
    [name, timeit(lambda: fun("Olivier Binette", "Oilvier Benet"), number=1000000) * 1000]
    for name, fun in functions.items()
]
print(tabulate(table, headers=["Package", "avg runtime (ns)"]))
Package          avg runtime (ns)
-------------  ------------------
StringCompare             697.834
jellyfish                 974.363
textdistance             3982.73

Performance notes

The use of pybind11 comes with a small performance overhead. We could be faster if we directly interfaced with CPython.

However, the use of pybind11 allows the library to be easily extensible and maintainable. The C++ implementation has little to worry about Python, excepted for the use of a pybind11 numpy wrapper in some places. Pybind11 takes care of the details of exposing the C++ API to Python.

Known Bugs

pybind11 has compatibility issues with gcc 11 (e.g. on Ubuntu 21.10). If running Linux and gcc --version is 11, then use the following commands to configure your environment before (re)installing:

        sudo apt install g++-9 gcc-9
        export CC=gcc-9 CXX=g++-9

If this is unsuccessful, you might want to use StringCompare within a Docker container. I recommend using the python:3.7.9 base image. For example, after installing docker, you can launch an interactive bash session and install StringCompare as follows:

        sudo docker run -it python:3.7.9 bash
        pip install git+https://github.com/OlivierBinette/StringCompare.git
        python
        >>> import stringcompare

Please report installation issues here.

Contribute

StringCompare is currently in early development stage and contributions are welcome! See the contributing page for more information.

Acknowledgements

This project is made possible by the support of the Natural Sciences and Engineering Research Council of Canada (NSERC) and by the support of a G-Research grant.

I would also like to thank the support of my individual Github sponsors.

stringcompare's People

Contributors

garrett-allen avatar olivierbinette avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

stringcompare's Issues

Implement Jaccard similarity function in Python

Implement the Jaccard similarity function which, given two strings s and t, and given a Tokenizer instance tokens, returns the jaccard similarity between tokens(s) and tokens(t).

Here a tokenizer is a function which transforms a string into a set (or a multiset). The Jaccard similarity between two sets A and B is defined as |A and B|/|A or B|

Add weighting parameters to edit distances

Edit-based distances can have their different operations (transposition, substitution, deletion, or insertion) be weighted to allow more flexibility. It would be nice to provide this functionality to the package.

We need to keep in mind that users might want to be able to learn these weights from data. So tuning these parameters should bring as little computational overhead as possible.

It would be best to wait working on this issue until we have a working case study for which we want to implement string distance learning.

Checking for null case should be done at the token bag level for the Jacard Similarity

The check for null case should be done at the token bag level rather than the string level:

if len(s) + len(t) == 0:

I would recommend refactoring jaccard.py as follows:

  1. Have the jacard() function take two token sets as arguments and compute their jaccard similarity (overlap percentage). Checking for empty token bags should be done here.
  2. Have the compare() function deal with the tokenization and anything else (e.g. transforming the distance to a similarity).

Add user examples

Create an examples folder structured as follows:

examples/
├─ 1-getting-started.ipynb
├─ 2-interesting-use-case.ipynb
└─ ...

Each python notebook should be numbered and contain a working example of StringCompare's features and usage.

The first notebook, 1-getting-started.ipynb, should walk the user through the installation of the package, the logical structure of the package, and simple examples (kind of like the package README does, but in a bit more detail).

Following notebooks should introduce more advanced features and use cases. For instance, we could cover how to learn the parameters of string comparison functions from data, how to do blocking to speed-up finding close matches, or how to feed string comparisons to a machine learning algorithm for fuzzy string matching. Any working example would be suitable here.

Ideas for specific user examples can be proposed by opening separate Github Issues.

The goal of user examples

There are three main goals to providing user examples in this way:

  1. It provides users with a guide to working with the package and working code as a starting point.
  2. It provides us (the developers) with motivating use cases and a playground on which to experiment with new functionality.
  3. It provides us with a way to test end-to-end workflows involving the use of StringCompare. Formal tests are as simple as using pytest and the testbook package to automatically check that all notebooks run without error.

Add benchmark user example

Add a user example (see #2) which provides a fair and exhaustive benchmark of StringCompare against other string comparison libraries.

Create proper separation between Python and C++ code

Currently, the package uses the C++ code by default and leaves the use of the pure Python implementation to the discretion of the user.

It might be better to more clearly separate the Python and C++ implementation. Some ideas:

  • provide separate "strincompare" and "cstringcompare" packages,
  • at package load time, try using C++ implementation and fall back on the pure Python implementation if loading fails.

There are also some issues with having all C++ code in headers. Compilation is still very quick but I'm not sure how it will scale as the package grows. It would also be nice to be able to provide the C++ code as a C++ library which is independent of the Python implementation...

We'll see what refactoring is needed as the package grows.

Create Tokenizer class

Create a class to tokenize strings.

It should be able to tokenize strings into both sets and multisets.

Address matching case study

Create a user example (see #2) which shows how StringCompare can be used to match business names.

That is, suppose we have a long list L of business names. Given another business name provided by a user, we want to be able to find the name in L which most closely matches it.

We can address this problem in a few steps:

  1. Identify an open dataset to work with for the case study.
  2. Identify a string comparison function which works best to match similar business names.
  3. Implement the brute force solution which computes all distances and returns the closest match.
  4. Try to speed up computation using indexing/blocking.
  5. Try to speed up computation using B-trees.
  6. Try to find quick approximate solutions using locality sensitive hashing.

Steps 1-3 are the most important. Steps 4-6 can be explored if they seem interesting.

References

[BUG] Importing(using ".", ".."). Unable to access preprocessing directory

Describe the bug

When we try to import packages in Jaccard, we get "Import Error: attempted relative import with no known parent package". We specifically have issues with
"from .comparator import StringComparator
from ..preprocessing.tokenizer import Tokenizer, WhitespaceTokenizer"
and the dots before comparator and preprocessing.

When we remove the dots the code runs but with the dots there the code doesn't run.

Here is our code so far:

from .comparator import StringComparator
from ..preprocessing tokenizer import Tokenizer, WhitespaceTokenizer
`
def jaccard(s, t, token):
s_token = token(s)
t_token = token(t)

intersection = [value for value in s_token if value in t_token]
len_intersection = len(intersection)
return len_intersection / (len(s_token) + len(t_token) - len_intersection)

class Jaccard(StringComparator):
def init(self, similarity = True, tokenizer = WhitespaceTokenizer()):
self.similarity = similarity
self.tokenizer = tokenizer

def compare(self, s, t):
    if self.similarity:
        return jaccard(s, t, self.tokenizer)
    else:
        return 1 - jaccard(s, t, self.tokenizer)

'
We think that the issue has to do with relative paths somehow, as .comparator gives us the error above, but comparator gives us no errors. We could not get ..preprocessing to import unless we brought tokenizer into our local directory and imported that file directly, but obviously this doesn't solve the problem. We tried on both mac and linux systems using PyCharm and a cloud-based service called repl.it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.