Giter Club home page Giter Club logo

Comments (7)

GoogleCodeExporter avatar GoogleCodeExporter commented on August 30, 2024
You are quite right. I started on the PersonNameComparator, but never actually 
finished it, because I lost faith in it. It seemed to become a mass of special 
cases with no real justification or overarching theory. In practice, it was 
superseded by JaroWinkler and Levenshtein, which I think do a better job.

I'm tempted to just delete the whole class, and instead focus on 
industry-standard comparators.

What do you think?

Original comment by [email protected] on 28 Oct 2011 at 6:40

from duke.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 30, 2024
 I am hitting a few cases on name comparisons where the difference is just in first names and those are short names and I need to be more strict there. I agree, I liked JaroWinklerTokenized most, it worked pretty well with full names but I am trying to make my matching more strict and now switched to comparing first and last names separately and on short first names it is too optimistic, often giving me positive answer where I need no-match, i.e. Dave vs. Dale. Leventstein is not working that well either for those short names as the distance is 1 or 2. What I liked about the PersonalNameComparator is that it still used Levenstein but adjusted the metric for short terms and that seems to address my challenge.

 Do you have any better suggestions? I don't want to switch to exact matching but would like to make either Levenstein or JaroWinkler more strict on short terms.

Thanks

Original comment by [email protected] on 28 Oct 2011 at 7:14

from duke.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 30, 2024
Actually I'll give you a few examples:

Levenstein:

Alex vs. Alexey  - 0.55 (out of 0.6), I like this
Paul vs. Mr. Paul - no-match, I'd like to a match here but it is ok, I probably 
need to clean the Mr.
Joseph vs. John D. - no-match, I would not mind a match here, but it is good
Jessica vs. Jessica Redmond - no-match, I would not mind a match, but it is good
Sam vs. Samuel - no-match, I would prefer a match here
Dale vs. Dawn - 0.55 (out of 0.6), almost an exact match and I don't like this!
Solman vs. Lonnette - no match, here it is doing what I want
coakley iii vs. coakley - no match, I would prefer some match

JiroWinklerTokenized:

Alex vs. Alexey  - 0.59, I like this but too high
Paul vs. Mr. Paul - 0.6, I like it but too high (non-tokenized give no-match)
Joseph vs. John D. - 0.57, I like this but too high on the last names of these 
people (Cassata vs. Cannon) 0.59 out of 0.65
Jessica vs. Jessica Redmond - 0.6, I guess due to tokenizing
Sam vs. Samuel - 0.59, I like this
James vs. Sue - 0.55, seems too high! Leventstein was better - no match
judy vs. Jim - 0.56, too high for my needs! Leventstein was better - no match
Dale vs. Dawn - 0.57 , too high for my needs! Leventstein had the same issue !!!
Solman vs. Lonnette - 0.55, I'd like no-match here. Leventstein was better - no 
match
wayne vs. claude - 0.56, seems too high! Leventstein was better - no match
john vs. jake - 0.55. maybe ok?
brooks vs. b - 0.61 (out of 0.65) I kind of liked that (b - initial) but this 
may not be good on many other cases

 So what I see is that JaroWinkler is too optimistic for me. I Levenstein better but I'd like it to be more pessimistic on short names. Basically if the length of the term could be weighed in that could help.

 Then I either need to clean the names to remove various Mr., iii, ... or have TokenizedLevenstein which would automatically take care of the uneven number of tokens, even Jessica vs. Jessica Redmond <-middle name.

 Any ideas on how I could achieve that? Maybe I just use the adjustments that you put in PersonalNameC-or.
I also liked that you tried to match startsWith firstName and initials there. 
All makes sense to me.

Thanks

Original comment by [email protected] on 28 Oct 2011 at 7:53

from duke.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 30, 2024
From the Levenshtein point of view:

> Sam vs. Samuel - no-match, I would prefer a match here
> Dale vs. Dawn - 0.55 (out of 0.6), almost an exact match and I don't like this

the former is a 3 out of 3 difference, whereas the latter is a 2 out of 4 
difference. No wonder that Levenshtein prefers the latter.

What you can do is to teach the PersonNameComparator that Sam is a common 
contraction for Samuel. We could try to build up a list of such common 
correspondences.

Another thing you can do is to modify Levenshtein so that it does become more 
demanding on short strings. Shouldn't be hard to either put in a hard limit or 
modify the formula. Just make your own subclass and do it. The actual edit 
distance comparison is a static method you can call from your own comparator.

> Then I either need to clean the names 

You definitely need to clean the names. Handling this kind of thing in the 
comparators is both wrong and slow.

I hope this helps.

Original comment by [email protected] on 31 Oct 2011 at 9:43

from duke.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 30, 2024
 I created my LevenshteinTokenized comparator (tokenized like in JaroWinklerTokenized) with adjustments for short terms and term1 startsWith term2 (like in your PersonNameComparator). Works like a charm :)
I don't have to do cleaning really because the tokenization takes care of that.
Thanks

Original comment by [email protected] on 31 Oct 2011 at 3:57

from duke.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 30, 2024
Very good. I assume this means the issue is solved.

Original comment by [email protected] on 4 Nov 2011 at 10:16

  • Changed state: Fixed
  • Added labels: Component-Comparators

from duke.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 30, 2024
Yes, thank you.

Original comment by [email protected] on 4 Nov 2011 at 4:16

from duke.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.