jakebayer / fuzzysharp Goto Github PK
View Code? Open in Web Editor NEWC# .NET fuzzy string matching implementation of Seat Geek's well known python FuzzyWuzzy algorithm.
License: MIT License
C# .NET fuzzy string matching implementation of Seat Geek's well known python FuzzyWuzzy algorithm.
License: MIT License
Any idea on performance of FuzzySharp vs. Fastenshtein?
I tried to write a search application demo (WPF UI) by this library, but found that the UI was lagging on input.
Here is my code: I have dispatched the search part of the logic to a non-UI thread and, in fact, the algorithm is fast enough not to block the UI.
Observable
.FromEvent<string>(
handler => SearchTextChanged += handler,
handler => SearchTextChanged -= handler)
.ObserveOn(NewThreadScheduler.Default)
.Throttle(TimeSpan.FromMilliseconds(200))
.Select(text => text.Replace(" ", string.Empty).ToLowerInvariant())
.Select(text => _fullCollection
.Select(item => (
entity: item,
radio: new[]
{
Fuzz.PartialRatio(text, item.Name, FuzzySharp.PreProcess.PreprocessMode.Full),
Fuzz.PartialRatio(text, item.Alias),
Fuzz.Ratio(text, item.Id),
//item.Name.Contains(text) ? 100 : 0,
//item.Alias.Contains(text) ? 100 : 0,
//item.Id.Contains(text) ? 100 : 0,
}.Max()))
.Where(item => item.radio > 50)
.OrderByDescending(item => item.radio == 100 ? item.entity.Frequency : item.radio)
.Take(10))
.ObserveOnDispatcher(DispatcherPriority.Background)
.Subscribe(searchResult =>
{
SearchResult.CanNotify = false;
SearchResult.Clear();
foreach ((BrandCodeEntity entity, int radio) item in searchResult)
{
item.entity.Radio = item.radio;
SearchResult.Add(item.entity);
}
SearchResult.CanNotify = true;
});
So, I did a performance analysis and found that the algorithm execution cause 1 second of time in doing GC.
If I use string.Contains(), I don't have this problem.
I'm not sure if its really logical or not, but the other implementations of IRatioScorer can score blank and empty strings without throwing an exception. The TokenAbbreviationScorerBase classes throw an exception.
We would like to use FuzzySharp for string comparison in some non-Latin languages, including Greek and Russian.
Please can you confirm the best way to use FuzzySharp for this purpose is the Extract methods in the Process class, using the parameter (s) => s
Many thanks
Hi,
could I get a list of index for match result?
for example
Fuzz.TokenInitialismRatio("NASA", "National Aeronautics and Space Administration");
could return
result(89, [0, 10, 26, 32])
the first number is the score, the second array is a list of index for char N
A
S
A
in National Aeronautics and Space Administration
.
Is there a way when searching to set a value that will be ignored, which will serve as a key for result.
In my case I have some entities and from them I extract some data and make a search, at the time when I got the result I don't know to which Entity found data is related
Currently the Process.Extract...
methods have 2 signatures:
1: string query, IEnumerable<string> choices
:
public static IEnumerable<ExtractedResult<string>> ExtractAll(
string query,
IEnumerable<string> choices,
Func<string, string> processor = null,
IRatioScorer scorer = null,
int cutoff = 0)
and 2: T query, IEnumerable<T> choices
:
public static IEnumerable<ExtractedResult<T>> ExtractAll<T>(
T query,
IEnumerable<T> choices,
Func<T, string> processor,
IRatioScorer scorer = null,
int cutoff = 0)
In my case the user enters a string
to filter a List<T>
of objects.
I can use 1 if I convert to string
first, collect the results to HashSet<string>
, and use that to filter the original List<T>
:
public static IEnumerable<Dto> Example1(string query, IEnumerable<Dto> list)
{
var set = Process.ExtractAll(query, list.Select(x => x.Name))
.Select(result => result.Value)
.ToImmutableHashSet();
return list.Where(dto => set.Contains(dto.Name));
}
Or 2 if I create a dummy T query
object from the string entered by the user:
public static IEnumerable<Dto> Example2(string query, IEnumerable<Dto> list)
{
var dummy = new Dto(query);
return Process.ExtractAll(dummy, list, dto => dto.Name)
.Select(result => result.Value);
}
The 2nd one isn't that bad... but tbh I struggle to think of a case where you would have a T query
? Especially since the Func<T, string> processor
is required for this overload.
So I think a signature like this would be useful:
public static IEnumerable<ExtractedResult<T>> ExtractAll<T>(
string query,
IEnumerable<T> choices,
Func<T, string> processor,
IRatioScorer scorer = null,
int cutoff = 0)
To be used like:
public static IEnumerable<Dto> Example3(string query, IEnumerable<Dto> list)
{
return Process.ExtractAll(query, list, dto => dto.Name)
.Select(result => result.Value);
}
I think the functionality is perfect candidate to be used as a CLI tool, specifically as a dotnet CLI tool https://docs.microsoft.com/en-us/dotnet/core/tools/global-tools
What do you think?
It reads: TODO: Is this right?
Very scary. Please remove or fix.
Using the WeightedRatioScorer for two different strings, one of them gives an unexpected result.
+30.0% Damage to Close Enemies [30.01%
+14.3% Damage to Crowd Controlled Enemies [7.5 - 18.0]%
Scorer: WeightedRatioScorer
Input 1: +30.0% Damage to Close Enemies [30.01%
Main: (string: +#% Damage, score: 90, index: 0)
Main: (string: +#% Damage to Close Enemies, score: 90, index: 2)
Main: (string: +#% Damage to Chilled Enemies, score: 77, index: 3)
Main: (string: +#% Damage to Poisoned Enemies, score: 75, index: 4)
Main: (string: +#% Damage to Crowd Controlled Enemies, score: 67, index: 1)
Main: (string: +#% Cold Damage, score: 61, index: 8)
Main: (string: #% Damage Reduction from Bleeding Enemies, score: 59, index: 6)
Main: (string: #% Damage Reduction, score: 50, index: 7)
Main: (string: #% Block Chance#% Blocked Damage Reduction, score: 48, index: 5)
Elapsed time: 39
---
Scorer: WeightedRatioScorer
Input 2: +14.3% Damage to Crowd Controlled Enemies [7.5 - 18.0]%
Main: (string: +#% Damage to Crowd Controlled Enemies, score: 90, index: 1)
Main: (string: +#% Damage to Close Enemies, score: 73, index: 2)
Main: (string: +#% Damage to Chilled Enemies, score: 69, index: 3)
Main: (string: +#% Damage to Poisoned Enemies, score: 68, index: 4)
Main: (string: +#% Cold Damage, score: 61, index: 8)
Main: (string: +#% Damage, score: 60, index: 0)
Main: (string: #% Damage Reduction from Bleeding Enemies, score: 56, index: 6)
Main: (string: #% Damage Reduction, score: 50, index: 7)
Main: (string: #% Block Chance#% Blocked Damage Reduction, score: 40, index: 5)
Elapsed time: 0
---
For some reason input1
gives +#% Damage
a score of 90. While for Input2
it works as expected and +#% Damage
gets score of 60.
Here is the source to reproduce the issue.
using FuzzySharp;
using FuzzySharp.SimilarityRatio;
using FuzzySharp.SimilarityRatio.Scorer.Composite;
using System.Reflection;
namespace FuzzySharpTest
{
internal class Program
{
static void Main(string[] args)
{
string input1 = "+30.0% Damage to Close Enemies [30.01%";
string input2 = "+14.3% Damage to Crowd Controlled Enemies [7.5 - 18.0]%";
List<string> choices = new List<string>()
{
"+#% Damage",
"+#% Damage to Crowd Controlled Enemies",
"+#% Damage to Close Enemies",
"+#% Damage to Chilled Enemies",
"+#% Damage to Poisoned Enemies",
"#% Block Chance#% Blocked Damage Reduction",
"#% Damage Reduction from Bleeding Enemies",
"#% Damage Reduction",
"+#% Cold Damage"
};
// WeightedRatioScorer - input1
var watch = System.Diagnostics.Stopwatch.StartNew();
Console.WriteLine("Scorer: WeightedRatioScorer");
Console.WriteLine($"Input 1: {input1}");
Console.WriteLine(string.Empty);
var results = Process.ExtractTop(input1, choices, scorer: ScorerCache.Get<WeightedRatioScorer>(), limit: 9);
foreach (var r in results)
{
Console.WriteLine($"{MethodBase.GetCurrentMethod()?.Name}: {r}");
}
watch.Stop();
var elapsedMs = watch.ElapsedMilliseconds;
Console.WriteLine($"Elapsed time: {elapsedMs}");
Console.WriteLine("---");
// WeightedRatioScorer - input2
watch = System.Diagnostics.Stopwatch.StartNew();
Console.WriteLine("Scorer: WeightedRatioScorer");
Console.WriteLine($"Input 2: {input2}");
Console.WriteLine(string.Empty);
results = Process.ExtractTop(input2, choices, scorer: ScorerCache.Get<WeightedRatioScorer>(), limit: 9);
foreach (var r in results)
{
Console.WriteLine($"{MethodBase.GetCurrentMethod()?.Name}: {r}");
}
watch.Stop();
elapsedMs = watch.ElapsedMilliseconds;
Console.WriteLine($"Elapsed time: {elapsedMs}");
Console.WriteLine("---");
}
}
}
First of all, thanks for the awesome library! ๐ฏ
I am couple of doubts. It would be great help if you please answer those.
How is the tokenization done? Based on white space as far as I have browsed through the code. Is there a way to direct the scorer to split camel case tokens? For example: the string MyDocuments
will be tokenized to ["My", " Documents"]
I do not see any param to direct the scorer to score in a case-insensitive manner. Is it not possible or I am missing something?
Below are the scores for couple of pair of strings, (mysmilarstring, MyawfullySimilarStirng)
and (mysmilarstring, myawfullysimilarstirng)
. Scores are different for the pairs where as they are different only by casing of letters.
Hi Jake
thanks for this good stuff. but I'd like to state that paritaltokensetratio is wrong. it always shows %100. u better check it out.
I noticed the latest commit bumped the version to 1.0.2 and added compatibility with netstandard & co, however the latest version on NuGet is still 1.0.1.
Could you push 1.0.2 to NuGet? In the meantime I've compiled and packaged 1.0.2 myself and checked it into my project, but having it come from NuGet directly would be much nicer ๐
Hello, I was using your library without any issue until I fell off on this case:
I got an exception when using "TokenInitialismRatio" with those two words: "lusiki plaza share block " (with a space) and "jmft".
Here is the StackTrace:
Exception thrown: 'System.IndexOutOfRangeException' in FuzzySharp.dll
System.Transactions Critical: 0 : Blablabal
Index was outside the bounds of the array. at FuzzySharp.SimilarityRatio.Scorer.StrategySensitive.TokenInitialismScorerBase.<>c.<Score>b__0_0(String s)
at System.Linq.Enumerable.WhereSelectArrayIterator2.MoveNext() at System.String.Join[T](String separator, IEnumerable
1 values)
at FuzzySharp.SimilarityRatio.Scorer.StrategySensitive.TokenInitialismScorerBase.Score(String input1, String input2)
Then my code =)
Yours sincerely.
Found you Nuget and looks great for my applicaiton, but can't run it with a DotNet 4.6+ project. Got this error from the package installer -
Severity Code Description Project File Line Suppression State Error Could not install package 'FuzzySharp 1.0.0'. You are trying to install this package into a project that targets '.NETFramework,Version=v4.6', but the package does not contain any assembly references or content files that are compatible with that framework. For more information, contact the package author.
Any way to make it compatible?
Thanks
Sandy
The lib does not work as expected if there are multiple matches. I expect the method to return the match with the highest ratio.
string dePN = "Partnernummer";
int ratio = Fuzz.PartialRatio(dePN, text);
Example of texts, which behave not as expected:
text = "Partne\nrnum\nmerASDFPartnernummerASDF"; => Ratio == 85 (should be 100)
text = "PartnerrrrnummerASDFPartnernummerASDF"; => Ratio == 77 (should be 100)
Jake, using partialtokensetratio, str1="Vadeli Tl Bakiyesi" and str2="vadeli tl bakiyesi" are supposed to return 100 but it gives me 83. And in Python version I can get the 100 though. Can u check it out please_
Upgrade libraries to include declared support for .net core 3, .net core 3.1 and .net standard 2.1.
Hello,
Is there any reason why FuzzySharp should not be used in a C# in a Multithreaded application ? I'm curious if there are any "lessons learned" or problems with using this package with several Background Workers, each running in its own thread.
Thanks in advance!
Scott
Currently the NuGet package is not strong named, causing incompatibilities with other public NuGet packages that are strong named.
This also does affect me, since all of our companies projects are strong named.
Warning CS8002 Referenced assembly 'FuzzySharp, Version=1.0.4.0, Culture=neutral, PublicKeyToken=null' does not have a strong name. Foo C:\s\repo\Src\Foo\CSC 1 Active
Microsoft Guidance suggests that public packages be strong named:
https://docs.microsoft.com/en-us/dotnet/standard/library-guidance/strong-naming#create-strong-named-net-libraries
For more information, this repository had pretty much the same issue:
cloudevents/sdk-csharp#24
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.