Giter Club home page Giter Club logo

Comments (5)

provinzkraut avatar provinzkraut commented on June 9, 2024

The issue you're facing here is a unfortunately a limitation of the algorithm and a somewhat common pitfall.

Taking your example of Hardtstraße and Hardstraße, dt gets encoded as 2 if not followed by an s. If it's followed by an s, it gets encoded as 8, therefore resulting in an "incorrect" phonetic encoding. My understanding of the original author's reasoning is that this provided an overall more stable result and having outliers like this was a good compromise. It may be worth noting that the KPH was originally intended to compare german names specifically, which might explain some of its weaknesses when applied to the general german language.

Currently I don't have any plans for this package to provide anything beyond a basic implementation of the "Kölner Phonetik" algorithm.
If this algorithm doesn't provide enough accuracy for your use case I suggest a scoring based approach with multiple algorithms, such as metaphone and soundex in addition to KPH while giving the KPH a higher score for likely german inputs.

from cologne_phonetics.

provinzkraut avatar provinzkraut commented on June 9, 2024

On a second thought, it might be viable to add some sort of functionality to disable/enable certain encoding rules, essentially creating your own lookup table. You'd just have to be aware of the implications this has for other cases.
Would something like that work for you?

from cologne_phonetics.

do-me avatar do-me commented on June 9, 2024

Certainly, it would. One could probably already do so adding or removing some of your regex functions in RGX_RULES but that might break parts of your logic or wouldn't it? 

My overall aim is simply to be able to buffer many kinds of spelling for streets, particularly those including names with many different versions like: 

meier = ["meier", "maier", "mair", "mayer", "mayr", "meyer", "meyr"] 
erhart = ["erhardt","erhard","erhart"]
schmitt = ["schmidt","schmitt","schmit","schmid"]

...

In fact, for difficult ones like meyr, a phonetic matching might not be sufficient, but for schmitt and erhart I really would like to make this work.

from cologne_phonetics.

do-me avatar do-me commented on June 9, 2024

Reading through your comment and the Wikipedia articles again I agree with your first comment that it's probably an acceptable flaw for more stable overall results.
Maybe I would need to split the compound words first and then run the algorithm to get better results.
However that would be pointless (e.g. with this package) if the right noun cannot be looked up due to spelling mistakes or missing variants...

I'd definitely prefer an easier solution based on the Kölner Phonetik with just a few tweaks. Not sure, whether anyone ever did a modification like this before.

Else, I could also go radical and globally replace e.g. all dt, tt, d to t and do the same with th and other edge cases I might encounter.

from cologne_phonetics.

do-me avatar do-me commented on June 9, 2024

After more tests, I realized that in my case Kölner Phonetik is just not fit for purpose. It might work great for searching, where false hits are not a big deal but it doesn't work for actual matching. Just a few examples why:

Based on this command

import cologne_phonetics
cologne_phonetics.encode(s, concat=True)

where s is the string.

Herrenmattstraße == Herrenstraße
Gerenotstraße == Kornstraße
Zellerstraße == Schillerstraße

The phonetic similarity of Z and Sch as well as e and i makes sense to me so Zellerstraße and Schillerstraße is legit. However Herrenmattstraße can not equal Herrenstraße for matching but solely for searching.

Closing this issue as my matching premise is simply out of scope for Kölner Phonetik.

from cologne_phonetics.

Related Issues (1)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.