Giter Club home page Giter Club logo

fuzzywuzzy's Introduction

This is an in-progress port of seatgeek's fuzzywuzzy Python library to C++. When done, this library will have the same interface and behavior.

The underlaying C-library (python-Levenshtein, mirrored here) has been stripped of its Python interfacing and been wrapped around some C++ code.

files in src/ Python/C-lib equivalent
fuzzywuzzy.{c,h}pp and string_matcher.{c,h}pp Line-by-line Python-to-C++ translations of the Python library and python-Levenshtein's StringMatcher.py.
wrapper.{c,h}pp (Python-interfaced-)C-to-C++ wrapper of ratio_py, get_opcodes_py, get_matching_blocks_py, etc. from python-Levenshtein.
utils.{c,h}pp Utility functions, translated from the Python library's utils.py.
levenshtein.{c,h} The underlaying C functions, copied verbatim.

Usage

#include <fuzzywuzzy>

Simple Ratio

fuzz::ratio("this is a test", "this is a test!"); // returns 97

Partial Ratio

fuzz::partial_ratio("this is a test", "this is a test!"); // return 100

Token Sort Ratio

fuzz::ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear"); // returns 91

fuzz::token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear"); // returns 100

Token Set Ratio

fuzz::token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear"); // returns 83 (this should be 84)

fuzz::token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear"); // returns 100

fuzzywuzzy's People

Contributors

judaew avatar marty1885 avatar neartox avatar nickdunning avatar tmplt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

fuzzywuzzy's Issues

Add CI

Repo is picking up activity after being dormant for a while. I don't have a significant amount of time for this repo. A CI should be written, with tests, to lighten the burden.

write a CLI interface

Useful for scripts where any fuzzy string matching is wanted. Should #5 (rewrite into C) be resolved first?

Call the CLI fuzz?

Segmentation fault in dedupe on large records.

This routine works fine on small sample size (around couple of thousands).

#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include "fuzzywuzzy.hpp"
#include "process.hpp"

int main() {
  std::ifstream file("names.txt");
  std::string str;
  std::vector<std::string> v;

  while (std::getline(file, str))
    v.push_back(str);

  auto erg = fuzz::dedupe(v, 99);
  std::cout << "Number of records " << v.size() << "\nNumber of duplicates "
            << v.size() - erg.size() << std::endl;

  std::ofstream outFile("dedupe.txt");
  for (const auto &e : erg) outFile << e << "\n";
  std::cout << "file 'dedupe.txt' has been created.\n";
}

But, When trying to remove duplicates from a large list of names.

$ wc -l names.txt
45796 names.txt

After leaving it running for around 7 hours, it segfaulted.

Segmentation fault (core dumped)

real    413m42.439s
user    413m10.378s
sys     0m26.022s

I don't think it's lack of resources on my end. RAM consumption during runtime was below 10%.

Progress and Todo

Implementation Progress

Simple and advanced scoring functions:

  • ratio()
  • partial_ratio()
    • fix coredump
  • token_sort_ratio()
  • partial_token_sort_ratio()
  • token_set_ratio()
  • partial_token_set_ratio()

Combination API:

  • QRatio()
    • port the code over
    • test that it works
  • WRatio()
    • port the code over
    • fix segfault

Other:

  • Process
    • extractWithoutOrder()
    • extract()
    • extractBests()
    • extractOne()
    • dedupe()
  • Unicode equivalent of the simple/advanced scoring functions

TODO

  • Compile the C code with a C-compiler instead?
  • Rename diffutils to ̶w̶r̶a̶p̶ wrapper?
  • Add compiler requirements to README.
  • Copy test bench from the Python library.
  • Generalize fuzzywuzzy.hpp with templates (objects exposing a .c_str())?
  • Split headers into an include directory.
  • Use the ICU library for string operations?

Unicode/UTF-8 support

Hi, I want to contribute Unicode support for the library. Do you know how much/what have to be done for the feature?

Rewrite into C?

Doing something serious in C other than just the odd exercise certainly would help me in learning the language, which I've put on ice for a while.

partial_ratio incorrectly using substr causing ratio to be incorrect

Within the partial_ratio function there is a bug with how we're using substr function causing the ratio to be incorrect.

https://github.com/tmplt/fuzzywuzzy/blob/master/src/fuzzywuzzy.cpp#L50-L52

/*
     * Each block represents a string of matching characters
     * in a string of the form (idx_1, idx_2, len). The best
     * partial match will block align with at least one
     * of those blocks.
     * e.g. shorter = "abcd", longer "XXXbcdeEEE"
     * block = (1, 3, 3)
     * best score == ratio("abcd", "Xbcd")
     */
    vector<double> scores;
    for (const auto &block : blocks) {
        size_t long_start = utils::max(0, block.dpos - block.spos);
        size_t long_end   = long_start + shorter.length();  // BUG -> this should just be shorter.length(), not long_start + shorter.length()

        auto long_substr = longer.substr(long_start, long_end);
        auto m2 = string_matcher(shorter, long_substr);
        double r = m2.ratio();

        if (r > 0.995)
            return 100;
        else
            scores.push_back(r);
    }

If you compare the functionality between python's library and this Cpp library by using print statements you can see the ratios being calculated are different.

Example between the two.

  • GIVEN: I'm calling fuzzy::partial_token_set_ratio("Move", "Label name ConstraintNoRackMoveIn")

  • WHEN: the python library version

  • THEN: partial_token_set_ratio will return a ratio => 100

  • GIVEN: I'm calling fuzzy::partial_token_set_ratio("Move", "Label name ConstraintNoRackMoveIn")

  • WHEN: the Cpp library version

  • THEN: partial_token_set_ratio will return a ratio => 50

The Expectation:

  • The two ratios returned are different when both should return 100.

The Issue:

  • When partial_ratio is called, it's called with the following request
    • partial_ratio("move", "constraintnorackmovein label name")
  • When it gets to the blocks section. There are two possible blocks.
    • block: (0, 16, 3) => 0 is char 'm' in "move", 16 is char 'm' in "constraintnorackmovein" and 3 is the length of matching chars
    • block: (3, 32, 1) => 3 is char 'e' in "move", 32 is char 'e' in "name" and 1 is the length of matching chars
  • The first block is the one that matters here since that's going to be the biggest match in terms of length.
  • Now within the loop of the block section, we would get the following results
    • long_start: 16 => max(0, 16-0 = 16) = 16
      • size_t long_start = utils::max(0, block.dpos - block.spos);
    • long_end: 20 => 16+4=20
      • size_t long_end = long_start + shorter.length();
    • long_substr: "movein label name" => longer.substr(16, 20) => string.substr(position, position+length)
      • auto long_substr = longer.substr(long_start, long_end);
  • The issue is that the second value passed into substr shouldn't be "20", it should be "4" because "move" is the match and it's a total length of 4. So it should be...
    • long_start: 16 => max(0, 16-0 = 16) = 16
      • size_t long_start = utils::max(0, block.dpos - block.spos);
    • long_end: 4 => "move".length() => 4
      • size_t long_end = shorter.length();
    • long_substr: "move" => longer.substr(16, 4) => string.substr(position, position+length)
      • auto long_substr = longer.substr(long_start, long_end);
  • By changing long_end to just be shorter.length() only this will correct the ratio being returned in C++ so that it correctly returns 100 as the ratio.

C++ substr:
The substring function takes two values pos and len as an argument and returns a newly constructed string object with its value initialized to a copy of a sub-string of this object. Copying of string starts from pos and is done till pos+len means [pos, pos+len).

The fix:
https://github.com/tmplt/fuzzywuzzy/blob/master/src/fuzzywuzzy.cpp#L50-L52

/*
     * Each block represents a string of matching characters
     * in a string of the form (idx_1, idx_2, len). The best
     * partial match will block align with at least one
     * of those blocks.
     * e.g. shorter = "abcd", longer "XXXbcdeEEE"
     * block = (1, 3, 3)
     * best score == ratio("abcd", "Xbcd")
     */
    vector<double> scores;
    for (const auto &block : blocks) {
        size_t long_start = utils::max(0, block.dpos - block.spos);
        size_t long_end   = shorter.length();  // FIX

        auto long_substr = longer.substr(long_start, long_end);
        auto m2 = string_matcher(shorter, long_substr);
        double r = m2.ratio();

        if (r > 0.995)
            return 100;
        else
            scores.push_back(r);
    }

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.