Great package! I like the strict default that <code class=

Customization possible for loose matching? about cologne_phonetics HOT 5 CLOSED

do-me commented on June 9, 2024

Customization possible for loose matching?

from cologne_phonetics.

Comments (5)

provinzkraut commented on June 9, 2024

The issue you're facing here is a unfortunately a limitation of the algorithm and a somewhat common pitfall.

Taking your example of Hardtstraße and Hardstraße, dt gets encoded as 2 if not followed by an s. If it's followed by an s, it gets encoded as 8, therefore resulting in an "incorrect" phonetic encoding. My understanding of the original author's reasoning is that this provided an overall more stable result and having outliers like this was a good compromise. It may be worth noting that the KPH was originally intended to compare german names specifically, which might explain some of its weaknesses when applied to the general german language.

Currently I don't have any plans for this package to provide anything beyond a basic implementation of the "Kölner Phonetik" algorithm.
If this algorithm doesn't provide enough accuracy for your use case I suggest a scoring based approach with multiple algorithms, such as metaphone and soundex in addition to KPH while giving the KPH a higher score for likely german inputs.

from cologne_phonetics.

provinzkraut commented on June 9, 2024

On a second thought, it might be viable to add some sort of functionality to disable/enable certain encoding rules, essentially creating your own lookup table. You'd just have to be aware of the implications this has for other cases.
Would something like that work for you?

from cologne_phonetics.

do-me commented on June 9, 2024

Certainly, it would. One could probably already do so adding or removing some of your regex functions in RGX_RULES but that might break parts of your logic or wouldn't it?

My overall aim is simply to be able to buffer many kinds of spelling for streets, particularly those including names with many different versions like:

meier = ["meier", "maier", "mair", "mayer", "mayr", "meyer", "meyr"] 
erhart = ["erhardt","erhard","erhart"]
schmitt = ["schmidt","schmitt","schmit","schmid"]

...

In fact, for difficult ones like meyr, a phonetic matching might not be sufficient, but for schmitt and erhart I really would like to make this work.

from cologne_phonetics.

do-me commented on June 9, 2024

Reading through your comment and the Wikipedia articles again I agree with your first comment that it's probably an acceptable flaw for more stable overall results.
Maybe I would need to split the compound words first and then run the algorithm to get better results.
However that would be pointless (e.g. with this package) if the right noun cannot be looked up due to spelling mistakes or missing variants...

I'd definitely prefer an easier solution based on the Kölner Phonetik with just a few tweaks. Not sure, whether anyone ever did a modification like this before.

Else, I could also go radical and globally replace e.g. all dt, tt, d to t and do the same with th and other edge cases I might encounter.

from cologne_phonetics.

do-me commented on June 9, 2024

After more tests, I realized that in my case Kölner Phonetik is just not fit for purpose. It might work great for searching, where false hits are not a big deal but it doesn't work for actual matching. Just a few examples why:

Based on this command

import cologne_phonetics
cologne_phonetics.encode(s, concat=True)

where s is the string.

Herrenmattstraße == Herrenstraße
Gerenotstraße == Kornstraße
Zellerstraße == Schillerstraße

The phonetic similarity of Z and Sch as well as e and i makes sense to me so Zellerstraße and Schillerstraße is legit. However Herrenmattstraße can not equal Herrenstraße for matching but solely for searching.

Closing this issue as my matching premise is simply out of scope for Kölner Phonetik.

from cologne_phonetics.

Customization possible for loose matching? about cologne_phonetics HOT 5 CLOSED

Comments (5)

Related Issues (1)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent