jakebayer / fuzzysharp Goto Github PK

View Code? Open in Web Editor NEW

585.0 585.0 77.0 111 KB

C# .NET fuzzy string matching implementation of Seat Geek's well known python FuzzyWuzzy algorithm.

License: MIT License

C# 100.00%

fuzzysharp's People

Contributors

Stargazers

Watchers

Forkers

sganz milleniumbug iagto injdakov mmykitiuk chrisrrowland daniel-rck spaladugula bobld davmarksman sakthishock kiddohr94 jtone123 bang4l iesoftwaredeveloper taieb1919 groovefish josepalacid webia1 johnbabb moayyaed xor-el chillitom xxgl1tchxx tkh0110 amitkumbhar14 nongzhsh 7orlum paneer pascona ggnaegi rpdunne71 siddharthan108219 about518 saravananpalanivel ahamidou kritulrathod chandusekhar minthantsin blowin romanzseldinov click2install nggoswami85 jamieyello farokhihouman atriastar contour-document-imaging yigitgnc ruanzx zachdean quartzidea bjorn-anamatch mmunchandersen umferrari travispotterbh 409544041 fredatgithub praveen-prakash ahamedminhaj faaztechsolutions spicycatgames lyjpy2005 eleisonchr ezeji hdqy rbertizini niftyhat jasonkbabie-education jbphillips tessecrack zachary-york tmasabari icecubi steentottrup acidic-lighthouse

fuzzysharp's Issues

How does performance compare to Fastenshtein?

Any idea on performance of FuzzySharp vs. Fastenshtein?

[Feature] Is it possible to improve the algorithm to a GC-free version?

I tried to write a search application demo (WPF UI) by this library, but found that the UI was lagging on input.

Here is my code: I have dispatched the search part of the logic to a non-UI thread and, in fact, the algorithm is fast enough not to block the UI.

            Observable
                .FromEvent<string>(
                    handler => SearchTextChanged += handler,
                    handler => SearchTextChanged -= handler)
                .ObserveOn(NewThreadScheduler.Default)
                .Throttle(TimeSpan.FromMilliseconds(200))
                .Select(text => text.Replace(" ", string.Empty).ToLowerInvariant())
                .Select(text => _fullCollection
                    .Select(item => (
                        entity: item,
                        radio: new[]
                        {
                            Fuzz.PartialRatio(text, item.Name, FuzzySharp.PreProcess.PreprocessMode.Full),
                            Fuzz.PartialRatio(text, item.Alias),
                            Fuzz.Ratio(text, item.Id),
                            //item.Name.Contains(text) ? 100 : 0,
                            //item.Alias.Contains(text) ? 100 : 0,
                            //item.Id.Contains(text) ? 100 : 0,
                        }.Max()))
                    .Where(item => item.radio > 50)
                    .OrderByDescending(item => item.radio == 100 ? item.entity.Frequency : item.radio)
                    .Take(10))
                .ObserveOnDispatcher(DispatcherPriority.Background)
                .Subscribe(searchResult =>
                {
                    SearchResult.CanNotify = false;
                    SearchResult.Clear();
                    foreach ((BrandCodeEntity entity, int radio) item in searchResult)
                    {
                        item.entity.Radio = item.radio;
                        SearchResult.Add(item.entity);
                    }

                    SearchResult.CanNotify = true;
                });

So, I did a performance analysis and found that the algorithm execution cause 1 second of time in doing GC.
If I use string.Contains(), I don't have this problem.

TokenAbbreviationScorerBase doesn't handle empty strings

I'm not sure if its really logical or not, but the other implementations of IRatioScorer can score blank and empty strings without throwing an exception. The TokenAbbreviationScorerBase classes throw an exception.

FuzzySharp for non-Latin languages

We would like to use FuzzySharp for string comparison in some non-Latin languages, including Greek and Russian.
Please can you confirm the best way to use FuzzySharp for this purpose is the Extract methods in the Process class, using the parameter (s) => s
Many thanks

matched index

Hi,
could I get a list of index for match result?
for example
Fuzz.TokenInitialismRatio("NASA", "National Aeronautics and Space Administration");
could return
result(89, [0, 10, 26, 32])

the first number is the score, the second array is a list of index for char N A S A in National Aeronautics and Space Administration.

Ignore value or some kind of a key?

Is there a way when searching to set a value that will be ignored, which will serve as a key for result.

In my case I have some entities and from them I extract some data and make a search, at the time when I got the result I don't know to which Entity found data is related

Extract method with `(string query, IEnumerable<T> choices)` signature

Currently the Process.Extract... methods have 2 signatures:

1: string query, IEnumerable<string> choices:

  public static IEnumerable<ExtractedResult<string>> ExtractAll(
      string query, 
      IEnumerable<string> choices, 
      Func<string, string> processor = null, 
      IRatioScorer scorer = null,
      int cutoff = 0)

and 2: T query, IEnumerable<T> choices:

  public static IEnumerable<ExtractedResult<T>> ExtractAll<T>(
      T query, 
      IEnumerable<T> choices,
      Func<T, string> processor,
      IRatioScorer scorer = null,
      int cutoff = 0)

In my case the user enters a string to filter a List<T> of objects.

I can use 1 if I convert to string first, collect the results to HashSet<string>, and use that to filter the original List<T>:

  public static IEnumerable<Dto> Example1(string query, IEnumerable<Dto> list)
  {
      var set = Process.ExtractAll(query, list.Select(x => x.Name))
          .Select(result => result.Value)
          .ToImmutableHashSet();
      return list.Where(dto => set.Contains(dto.Name));
  }

Or 2 if I create a dummy T query object from the string entered by the user:

  public static IEnumerable<Dto> Example2(string query, IEnumerable<Dto> list)
  {
      var dummy = new Dto(query);
      return Process.ExtractAll(dummy, list, dto => dto.Name)
          .Select(result => result.Value);
  }

The 2nd one isn't that bad... but tbh I struggle to think of a case where you would have a T query? Especially since the Func<T, string> processor is required for this overload.

So I think a signature like this would be useful:

  public static IEnumerable<ExtractedResult<T>> ExtractAll<T>(
      string query, 
      IEnumerable<T> choices,
      Func<T, string> processor,
      IRatioScorer scorer = null,
      int cutoff = 0)

To be used like:

public static IEnumerable<Dto> Example3(string query, IEnumerable<Dto> list)
{
    return Process.ExtractAll(query, list, dto => dto.Name)
        .Select(result => result.Value);
}

[Feature] Provide a fuzzy search via dotnet CLI tool

I think the functionality is perfect candidate to be used as a CLI tool, specifically as a dotnet CLI tool https://docs.microsoft.com/en-us/dotnet/core/tools/global-tools

What do you think?

There is a concerning TODO comment in Levenshtein.cs

It reads: TODO: Is this right?

Very scary. Please remove or fix.

Unexpected results with WeightedRatioScorer

Using the WeightedRatioScorer for two different strings, one of them gives an unexpected result.

Input

+30.0% Damage to Close Enemies [30.01%
+14.3% Damage to Crowd Controlled Enemies [7.5 - 18.0]%

Choices

"+#% Damage",
"+#% Damage to Crowd Controlled Enemies",
"+#% Damage to Close Enemies",
"+#% Damage to Chilled Enemies",
"+#% Damage to Poisoned Enemies",
"#% Block Chance#% Blocked Damage Reduction",
"#% Damage Reduction from Bleeding Enemies",
"#% Damage Reduction",
"+#% Cold Damage"

Results

Scorer: WeightedRatioScorer
Input 1: +30.0% Damage to Close Enemies [30.01%

Main: (string: +#% Damage, score: 90, index: 0)
Main: (string: +#% Damage to Close Enemies, score: 90, index: 2)
Main: (string: +#% Damage to Chilled Enemies, score: 77, index: 3)
Main: (string: +#% Damage to Poisoned Enemies, score: 75, index: 4)
Main: (string: +#% Damage to Crowd Controlled Enemies, score: 67, index: 1)
Main: (string: +#% Cold Damage, score: 61, index: 8)
Main: (string: #% Damage Reduction from Bleeding Enemies, score: 59, index: 6)
Main: (string: #% Damage Reduction, score: 50, index: 7)
Main: (string: #% Block Chance#% Blocked Damage Reduction, score: 48, index: 5)
Elapsed time: 39
---
Scorer: WeightedRatioScorer
Input 2: +14.3% Damage to Crowd Controlled Enemies [7.5 - 18.0]%

Main: (string: +#% Damage to Crowd Controlled Enemies, score: 90, index: 1)
Main: (string: +#% Damage to Close Enemies, score: 73, index: 2)
Main: (string: +#% Damage to Chilled Enemies, score: 69, index: 3)
Main: (string: +#% Damage to Poisoned Enemies, score: 68, index: 4)
Main: (string: +#% Cold Damage, score: 61, index: 8)
Main: (string: +#% Damage, score: 60, index: 0)
Main: (string: #% Damage Reduction from Bleeding Enemies, score: 56, index: 6)
Main: (string: #% Damage Reduction, score: 50, index: 7)
Main: (string: #% Block Chance#% Blocked Damage Reduction, score: 40, index: 5)
Elapsed time: 0
---

For some reason input1 gives +#% Damage a score of 90. While for Input2 it works as expected and +#% Damage gets score of 60.

Source

Here is the source to reproduce the issue.

Click me

using FuzzySharp;
using FuzzySharp.SimilarityRatio;
using FuzzySharp.SimilarityRatio.Scorer.Composite;
using System.Reflection;

namespace FuzzySharpTest
{
    internal class Program
    {
        static void Main(string[] args)
        {
            string input1 = "+30.0% Damage to Close Enemies [30.01%";
            string input2 = "+14.3% Damage to Crowd Controlled Enemies [7.5 - 18.0]%";

            List<string> choices = new List<string>()
            {
                "+#% Damage",
                "+#% Damage to Crowd Controlled Enemies",
                "+#% Damage to Close Enemies",
                "+#% Damage to Chilled Enemies",
                "+#% Damage to Poisoned Enemies",
                "#% Block Chance#% Blocked Damage Reduction",
                "#% Damage Reduction from Bleeding Enemies",
                "#% Damage Reduction",
                "+#% Cold Damage"
            };

            // WeightedRatioScorer - input1
            var watch = System.Diagnostics.Stopwatch.StartNew();
            Console.WriteLine("Scorer: WeightedRatioScorer");
            Console.WriteLine($"Input 1: {input1}");
            Console.WriteLine(string.Empty);

            var results = Process.ExtractTop(input1, choices, scorer: ScorerCache.Get<WeightedRatioScorer>(), limit: 9);
            foreach (var r in results)
            {
                Console.WriteLine($"{MethodBase.GetCurrentMethod()?.Name}: {r}");
            }

            watch.Stop();
            var elapsedMs = watch.ElapsedMilliseconds;
            Console.WriteLine($"Elapsed time: {elapsedMs}");
            Console.WriteLine("---");


            // WeightedRatioScorer - input2
            watch = System.Diagnostics.Stopwatch.StartNew();
            Console.WriteLine("Scorer: WeightedRatioScorer");
            Console.WriteLine($"Input 2: {input2}");
            Console.WriteLine(string.Empty);

            results = Process.ExtractTop(input2, choices, scorer: ScorerCache.Get<WeightedRatioScorer>(), limit: 9);
            foreach (var r in results)
            {
                Console.WriteLine($"{MethodBase.GetCurrentMethod()?.Name}: {r}");
            }

            watch.Stop();
            elapsedMs = watch.ElapsedMilliseconds;
            Console.WriteLine($"Elapsed time: {elapsedMs}");
            Console.WriteLine("---");
        }
    }
}

[Query] Can we get score in a case insensitive manner?

First of all, thanks for the awesome library! 💯
I am couple of doubts. It would be great help if you please answer those.

How is the tokenization done? Based on white space as far as I have browsed through the code. Is there a way to direct the scorer to split camel case tokens? For example: the string MyDocuments will be tokenized to ["My", " Documents"]
I do not see any param to direct the scorer to score in a case-insensitive manner. Is it not possible or I am missing something?
Below are the scores for couple of pair of strings, (mysmilarstring, MyawfullySimilarStirng) and (mysmilarstring, myawfullysimilarstirng). Scores are different for the pairs where as they are different only by casing of letters.

-------------------------------FuzzySharp-------------------------------------------------
mysmilarstring ||MyawfullySimilarStirng || Ratio = 56
mysmilarstring ||MyawfullySimilarStirng || PartialRatio = 71
mysmilarstring ||MyawfullySimilarStirng || TokenSortRatio = 56
mysmilarstring ||MyawfullySimilarStirng || PartialTokenSortRatio = 71
mysmilarstring ||MyawfullySimilarStirng || TokenSetRatio = 56
mysmilarstring ||MyawfullySimilarStirng || PartialTokenSetRatio = 71
mysmilarstring ||MyawfullySimilarStirng || TokenInitialismRatio = 0
mysmilarstring ||MyawfullySimilarStirng || PartialTokenInitialismRatio = 0
mysmilarstring ||MyawfullySimilarStirng || WeightedRatio = 64

-------------------------------FuzzySharp-------------------------------------------------
mysmilarstring ||myawfullysimilarstirng || Ratio = 72
mysmilarstring ||myawfullysimilarstirng || PartialRatio = 86
mysmilarstring ||myawfullysimilarstirng || TokenSortRatio = 72
mysmilarstring ||myawfullysimilarstirng || PartialTokenSortRatio = 86
mysmilarstring ||myawfullysimilarstirng || TokenSetRatio = 72
mysmilarstring ||myawfullysimilarstirng || PartialTokenSetRatio = 86
mysmilarstring ||myawfullysimilarstirng || TokenInitialismRatio = 0
mysmilarstring ||myawfullysimilarstirng || PartialTokenInitialismRatio = 0
mysmilarstring ||myawfullysimilarstirng || WeightedRatio = 77

partialtokensetratio is wrong

Hi Jake
thanks for this good stuff. but I'd like to state that paritaltokensetratio is wrong. it always shows %100. u better check it out.

Release 1.0.2 on NuGet

I noticed the latest commit bumped the version to 1.0.2 and added compatibility with netstandard & co, however the latest version on NuGet is still 1.0.1.

Could you push 1.0.2 to NuGet? In the meantime I've compiled and packaged 1.0.2 myself and checked it into my project, but having it come from NuGet directly would be much nicer 🙂

Index of out range in TokenInitialismRatio

Hello, I was using your library without any issue until I fell off on this case:

I got an exception when using "TokenInitialismRatio" with those two words: "lusiki plaza share block " (with a space) and "jmft".

Here is the StackTrace:
Exception thrown: 'System.IndexOutOfRangeException' in FuzzySharp.dll
System.Transactions Critical: 0 : Blablabal
Index was outside the bounds of the array. at FuzzySharp.SimilarityRatio.Scorer.StrategySensitive.TokenInitialismScorerBase.<>c.<Score>b__0_0(String s)
at System.Linq.Enumerable.WhereSelectArrayIterator2.MoveNext() at System.String.Join[T](String separator, IEnumerable1 values)
at FuzzySharp.SimilarityRatio.Scorer.StrategySensitive.TokenInitialismScorerBase.Score(String input1, String input2)
Then my code =)

Yours sincerely.

DotNet 4.6 Compatibility

Found you Nuget and looks great for my applicaiton, but can't run it with a DotNet 4.6+ project. Got this error from the package installer -

Severity Code Description Project File Line Suppression State Error Could not install package 'FuzzySharp 1.0.0'. You are trying to install this package into a project that targets '.NETFramework,Version=v4.6', but the package does not contain any assembly references or content files that are compatible with that framework. For more information, contact the package author.

Any way to make it compatible?

Thanks

Sandy

PartialRatio not working as expected

The lib does not work as expected if there are multiple matches. I expect the method to return the match with the highest ratio.

string dePN = "Partnernummer";
int ratio = Fuzz.PartialRatio(dePN, text);

Example of texts, which behave not as expected:
text = "Partne\nrnum\nmerASDFPartnernummerASDF"; => Ratio == 85 (should be 100)
text = "PartnerrrrnummerASDFPartnernummerASDF"; => Ratio == 77 (should be 100)

another partialtokensetratio problem

Jake, using partialtokensetratio, str1="Vadeli Tl Bakiyesi" and str2="vadeli tl bakiyesi" are supposed to return 100 but it gives me 83. And in Python version I can get the 100 though. Can u check it out please_

Upgrade libraries to include support for .net core 3 and .net standard 2.1

Upgrade libraries to include declared support for .net core 3, .net core 3.1 and .net standard 2.1.

Multithreaded C# application and FuzzySharp - has anyone tried?

Hello,

Is there any reason why FuzzySharp should not be used in a C# in a Multithreaded application ? I'm curious if there are any "lessons learned" or problems with using this package with several Background Workers, each running in its own thread.

Thanks in advance!
Scott

Referenced assembly 'FuzzySharp, Version=1.0.4.0, Culture=neutral, PublicKeyToken=null' does not have a strong name

Currently the NuGet package is not strong named, causing incompatibilities with other public NuGet packages that are strong named.
This also does affect me, since all of our companies projects are strong named.

Warning CS8002 Referenced assembly 'FuzzySharp, Version=1.0.4.0, Culture=neutral, PublicKeyToken=null' does not have a strong name. Foo C:\s\repo\Src\Foo\CSC 1 Active

Microsoft Guidance suggests that public packages be strong named:
https://docs.microsoft.com/en-us/dotnet/standard/library-guidance/strong-naming#create-strong-named-net-libraries

For more information, this repository had pretty much the same issue:
cloudevents/sdk-csharp#24