Giter Club home page Giter Club logo

fuzzysharp's People

Contributors

bobld avatar chrisrrowland avatar jakebayer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fuzzysharp's Issues

[Feature] Is it possible to improve the algorithm to a GC-free version?

I tried to write a search application demo (WPF UI) by this library, but found that the UI was lagging on input.

Here is my code: I have dispatched the search part of the logic to a non-UI thread and, in fact, the algorithm is fast enough not to block the UI.

            Observable
                .FromEvent<string>(
                    handler => SearchTextChanged += handler,
                    handler => SearchTextChanged -= handler)
                .ObserveOn(NewThreadScheduler.Default)
                .Throttle(TimeSpan.FromMilliseconds(200))
                .Select(text => text.Replace(" ", string.Empty).ToLowerInvariant())
                .Select(text => _fullCollection
                    .Select(item => (
                        entity: item,
                        radio: new[]
                        {
                            Fuzz.PartialRatio(text, item.Name, FuzzySharp.PreProcess.PreprocessMode.Full),
                            Fuzz.PartialRatio(text, item.Alias),
                            Fuzz.Ratio(text, item.Id),
                            //item.Name.Contains(text) ? 100 : 0,
                            //item.Alias.Contains(text) ? 100 : 0,
                            //item.Id.Contains(text) ? 100 : 0,
                        }.Max()))
                    .Where(item => item.radio > 50)
                    .OrderByDescending(item => item.radio == 100 ? item.entity.Frequency : item.radio)
                    .Take(10))
                .ObserveOnDispatcher(DispatcherPriority.Background)
                .Subscribe(searchResult =>
                {
                    SearchResult.CanNotify = false;
                    SearchResult.Clear();
                    foreach ((BrandCodeEntity entity, int radio) item in searchResult)
                    {
                        item.entity.Radio = item.radio;
                        SearchResult.Add(item.entity);
                    }

                    SearchResult.CanNotify = true;
                });

So, I did a performance analysis and found that the algorithm execution cause 1 second of time in doing GC.
If I use string.Contains(), I don't have this problem.

image

FuzzySharp for non-Latin languages

We would like to use FuzzySharp for string comparison in some non-Latin languages, including Greek and Russian.
Please can you confirm the best way to use FuzzySharp for this purpose is the Extract methods in the Process class, using the parameter (s) => s
Many thanks

matched index

Hi,
could I get a list of index for match result?
for example
Fuzz.TokenInitialismRatio("NASA", "National Aeronautics and Space Administration");
could return
result(89, [0, 10, 26, 32])

the first number is the score, the second array is a list of index for char N A S A in National Aeronautics and Space Administration.

Ignore value or some kind of a key?

Is there a way when searching to set a value that will be ignored, which will serve as a key for result.

In my case I have some entities and from them I extract some data and make a search, at the time when I got the result I don't know to which Entity found data is related

Extract method with `(string query, IEnumerable<T> choices)` signature

Currently the Process.Extract... methods have 2 signatures:

1: string query, IEnumerable<string> choices:

  public static IEnumerable<ExtractedResult<string>> ExtractAll(
      string query, 
      IEnumerable<string> choices, 
      Func<string, string> processor = null, 
      IRatioScorer scorer = null,
      int cutoff = 0)

and 2: T query, IEnumerable<T> choices:

  public static IEnumerable<ExtractedResult<T>> ExtractAll<T>(
      T query, 
      IEnumerable<T> choices,
      Func<T, string> processor,
      IRatioScorer scorer = null,
      int cutoff = 0)

In my case the user enters a string to filter a List<T> of objects.

I can use 1 if I convert to string first, collect the results to HashSet<string>, and use that to filter the original List<T>:

  public static IEnumerable<Dto> Example1(string query, IEnumerable<Dto> list)
  {
      var set = Process.ExtractAll(query, list.Select(x => x.Name))
          .Select(result => result.Value)
          .ToImmutableHashSet();
      return list.Where(dto => set.Contains(dto.Name));
  }

Or 2 if I create a dummy T query object from the string entered by the user:

  public static IEnumerable<Dto> Example2(string query, IEnumerable<Dto> list)
  {
      var dummy = new Dto(query);
      return Process.ExtractAll(dummy, list, dto => dto.Name)
          .Select(result => result.Value);
  }

The 2nd one isn't that bad... but tbh I struggle to think of a case where you would have a T query? Especially since the Func<T, string> processor is required for this overload.

So I think a signature like this would be useful:

  public static IEnumerable<ExtractedResult<T>> ExtractAll<T>(
      string query, 
      IEnumerable<T> choices,
      Func<T, string> processor,
      IRatioScorer scorer = null,
      int cutoff = 0)

To be used like:

public static IEnumerable<Dto> Example3(string query, IEnumerable<Dto> list)
{
    return Process.ExtractAll(query, list, dto => dto.Name)
        .Select(result => result.Value);
}

Unexpected results with WeightedRatioScorer

Using the WeightedRatioScorer for two different strings, one of them gives an unexpected result.

Input

  1. +30.0% Damage to Close Enemies [30.01%
  2. +14.3% Damage to Crowd Controlled Enemies [7.5 - 18.0]%

Choices

  • "+#% Damage",
  • "+#% Damage to Crowd Controlled Enemies",
  • "+#% Damage to Close Enemies",
  • "+#% Damage to Chilled Enemies",
  • "+#% Damage to Poisoned Enemies",
  • "#% Block Chance#% Blocked Damage Reduction",
  • "#% Damage Reduction from Bleeding Enemies",
  • "#% Damage Reduction",
  • "+#% Cold Damage"

Results

Scorer: WeightedRatioScorer
Input 1: +30.0% Damage to Close Enemies [30.01%

Main: (string: +#% Damage, score: 90, index: 0)
Main: (string: +#% Damage to Close Enemies, score: 90, index: 2)
Main: (string: +#% Damage to Chilled Enemies, score: 77, index: 3)
Main: (string: +#% Damage to Poisoned Enemies, score: 75, index: 4)
Main: (string: +#% Damage to Crowd Controlled Enemies, score: 67, index: 1)
Main: (string: +#% Cold Damage, score: 61, index: 8)
Main: (string: #% Damage Reduction from Bleeding Enemies, score: 59, index: 6)
Main: (string: #% Damage Reduction, score: 50, index: 7)
Main: (string: #% Block Chance#% Blocked Damage Reduction, score: 48, index: 5)
Elapsed time: 39
---
Scorer: WeightedRatioScorer
Input 2: +14.3% Damage to Crowd Controlled Enemies [7.5 - 18.0]%

Main: (string: +#% Damage to Crowd Controlled Enemies, score: 90, index: 1)
Main: (string: +#% Damage to Close Enemies, score: 73, index: 2)
Main: (string: +#% Damage to Chilled Enemies, score: 69, index: 3)
Main: (string: +#% Damage to Poisoned Enemies, score: 68, index: 4)
Main: (string: +#% Cold Damage, score: 61, index: 8)
Main: (string: +#% Damage, score: 60, index: 0)
Main: (string: #% Damage Reduction from Bleeding Enemies, score: 56, index: 6)
Main: (string: #% Damage Reduction, score: 50, index: 7)
Main: (string: #% Block Chance#% Blocked Damage Reduction, score: 40, index: 5)
Elapsed time: 0
---

For some reason input1 gives +#% Damage a score of 90. While for Input2 it works as expected and +#% Damage gets score of 60.

Source

Here is the source to reproduce the issue.

Click me
using FuzzySharp;
using FuzzySharp.SimilarityRatio;
using FuzzySharp.SimilarityRatio.Scorer.Composite;
using System.Reflection;

namespace FuzzySharpTest
{
    internal class Program
    {
        static void Main(string[] args)
        {
            string input1 = "+30.0% Damage to Close Enemies [30.01%";
            string input2 = "+14.3% Damage to Crowd Controlled Enemies [7.5 - 18.0]%";

            List<string> choices = new List<string>()
            {
                "+#% Damage",
                "+#% Damage to Crowd Controlled Enemies",
                "+#% Damage to Close Enemies",
                "+#% Damage to Chilled Enemies",
                "+#% Damage to Poisoned Enemies",
                "#% Block Chance#% Blocked Damage Reduction",
                "#% Damage Reduction from Bleeding Enemies",
                "#% Damage Reduction",
                "+#% Cold Damage"
            };

            // WeightedRatioScorer - input1
            var watch = System.Diagnostics.Stopwatch.StartNew();
            Console.WriteLine("Scorer: WeightedRatioScorer");
            Console.WriteLine($"Input 1: {input1}");
            Console.WriteLine(string.Empty);

            var results = Process.ExtractTop(input1, choices, scorer: ScorerCache.Get<WeightedRatioScorer>(), limit: 9);
            foreach (var r in results)
            {
                Console.WriteLine($"{MethodBase.GetCurrentMethod()?.Name}: {r}");
            }

            watch.Stop();
            var elapsedMs = watch.ElapsedMilliseconds;
            Console.WriteLine($"Elapsed time: {elapsedMs}");
            Console.WriteLine("---");


            // WeightedRatioScorer - input2
            watch = System.Diagnostics.Stopwatch.StartNew();
            Console.WriteLine("Scorer: WeightedRatioScorer");
            Console.WriteLine($"Input 2: {input2}");
            Console.WriteLine(string.Empty);

            results = Process.ExtractTop(input2, choices, scorer: ScorerCache.Get<WeightedRatioScorer>(), limit: 9);
            foreach (var r in results)
            {
                Console.WriteLine($"{MethodBase.GetCurrentMethod()?.Name}: {r}");
            }

            watch.Stop();
            elapsedMs = watch.ElapsedMilliseconds;
            Console.WriteLine($"Elapsed time: {elapsedMs}");
            Console.WriteLine("---");
        }
    }
}

[Query] Can we get score in a case insensitive manner?

First of all, thanks for the awesome library! ๐Ÿ’ฏ
I am couple of doubts. It would be great help if you please answer those.

  1. How is the tokenization done? Based on white space as far as I have browsed through the code. Is there a way to direct the scorer to split camel case tokens? For example: the string MyDocuments will be tokenized to ["My", " Documents"]

  2. I do not see any param to direct the scorer to score in a case-insensitive manner. Is it not possible or I am missing something?
    Below are the scores for couple of pair of strings, (mysmilarstring, MyawfullySimilarStirng) and (mysmilarstring, myawfullysimilarstirng). Scores are different for the pairs where as they are different only by casing of letters.

    -------------------------------FuzzySharp-------------------------------------------------
    mysmilarstring ||MyawfullySimilarStirng || Ratio = 56
    mysmilarstring ||MyawfullySimilarStirng || PartialRatio = 71
    mysmilarstring ||MyawfullySimilarStirng || TokenSortRatio = 56
    mysmilarstring ||MyawfullySimilarStirng || PartialTokenSortRatio = 71
    mysmilarstring ||MyawfullySimilarStirng || TokenSetRatio = 56
    mysmilarstring ||MyawfullySimilarStirng || PartialTokenSetRatio = 71
    mysmilarstring ||MyawfullySimilarStirng || TokenInitialismRatio = 0
    mysmilarstring ||MyawfullySimilarStirng || PartialTokenInitialismRatio = 0
    mysmilarstring ||MyawfullySimilarStirng || WeightedRatio = 64

    -------------------------------FuzzySharp-------------------------------------------------
    mysmilarstring ||myawfullysimilarstirng || Ratio = 72
    mysmilarstring ||myawfullysimilarstirng || PartialRatio = 86
    mysmilarstring ||myawfullysimilarstirng || TokenSortRatio = 72
    mysmilarstring ||myawfullysimilarstirng || PartialTokenSortRatio = 86
    mysmilarstring ||myawfullysimilarstirng || TokenSetRatio = 72
    mysmilarstring ||myawfullysimilarstirng || PartialTokenSetRatio = 86
    mysmilarstring ||myawfullysimilarstirng || TokenInitialismRatio = 0
    mysmilarstring ||myawfullysimilarstirng || PartialTokenInitialismRatio = 0
    mysmilarstring ||myawfullysimilarstirng || WeightedRatio = 77

partialtokensetratio is wrong

Hi Jake
thanks for this good stuff. but I'd like to state that paritaltokensetratio is wrong. it always shows %100. u better check it out.

Release 1.0.2 on NuGet

I noticed the latest commit bumped the version to 1.0.2 and added compatibility with netstandard & co, however the latest version on NuGet is still 1.0.1.

Could you push 1.0.2 to NuGet? In the meantime I've compiled and packaged 1.0.2 myself and checked it into my project, but having it come from NuGet directly would be much nicer ๐Ÿ™‚

Index of out range in TokenInitialismRatio

Hello, I was using your library without any issue until I fell off on this case:

I got an exception when using "TokenInitialismRatio" with those two words: "lusiki plaza share block " (with a space) and "jmft".

Here is the StackTrace:
Exception thrown: 'System.IndexOutOfRangeException' in FuzzySharp.dll
System.Transactions Critical: 0 : Blablabal
Index was outside the bounds of the array. at FuzzySharp.SimilarityRatio.Scorer.StrategySensitive.TokenInitialismScorerBase.&lt;&gt;c.&lt;Score&gt;b__0_0(String s)
at System.Linq.Enumerable.WhereSelectArrayIterator2.MoveNext() at System.String.Join[T](String separator, IEnumerable1 values)
at FuzzySharp.SimilarityRatio.Scorer.StrategySensitive.TokenInitialismScorerBase.Score(String input1, String input2)
Then my code =)

Yours sincerely.

DotNet 4.6 Compatibility

Found you Nuget and looks great for my applicaiton, but can't run it with a DotNet 4.6+ project. Got this error from the package installer -

Severity Code Description Project File Line Suppression State Error Could not install package 'FuzzySharp 1.0.0'. You are trying to install this package into a project that targets '.NETFramework,Version=v4.6', but the package does not contain any assembly references or content files that are compatible with that framework. For more information, contact the package author.

Any way to make it compatible?

Thanks

Sandy

PartialRatio not working as expected

The lib does not work as expected if there are multiple matches. I expect the method to return the match with the highest ratio.

string dePN = "Partnernummer";
int ratio = Fuzz.PartialRatio(dePN, text);

Example of texts, which behave not as expected:
text = "Partne\nrnum\nmerASDFPartnernummerASDF"; => Ratio == 85 (should be 100)
text = "PartnerrrrnummerASDFPartnernummerASDF"; => Ratio == 77 (should be 100)

another partialtokensetratio problem

Jake, using partialtokensetratio, str1="Vadeli Tl Bakiyesi" and str2="vadeli tl bakiyesi" are supposed to return 100 but it gives me 83. And in Python version I can get the 100 though. Can u check it out please_

Multithreaded C# application and FuzzySharp - has anyone tried?

Hello,

Is there any reason why FuzzySharp should not be used in a C# in a Multithreaded application ? I'm curious if there are any "lessons learned" or problems with using this package with several Background Workers, each running in its own thread.

Thanks in advance!
Scott

Referenced assembly 'FuzzySharp, Version=1.0.4.0, Culture=neutral, PublicKeyToken=null' does not have a strong name

Currently the NuGet package is not strong named, causing incompatibilities with other public NuGet packages that are strong named.
This also does affect me, since all of our companies projects are strong named.

Warning CS8002 Referenced assembly 'FuzzySharp, Version=1.0.4.0, Culture=neutral, PublicKeyToken=null' does not have a strong name. Foo C:\s\repo\Src\Foo\CSC 1 Active

Microsoft Guidance suggests that public packages be strong named:
https://docs.microsoft.com/en-us/dotnet/standard/library-guidance/strong-naming#create-strong-named-net-libraries

For more information, this repository had pretty much the same issue:
cloudevents/sdk-csharp#24

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.