Comments (5)
The issue you're facing here is a unfortunately a limitation of the algorithm and a somewhat common pitfall.
Taking your example of Hardtstraße
and Hardstraße
, dt
gets encoded as 2
if not followed by an s
. If it's followed by an s
, it gets encoded as 8
, therefore resulting in an "incorrect" phonetic encoding. My understanding of the original author's reasoning is that this provided an overall more stable result and having outliers like this was a good compromise. It may be worth noting that the KPH was originally intended to compare german names specifically, which might explain some of its weaknesses when applied to the general german language.
Currently I don't have any plans for this package to provide anything beyond a basic implementation of the "Kölner Phonetik" algorithm.
If this algorithm doesn't provide enough accuracy for your use case I suggest a scoring based approach with multiple algorithms, such as metaphone and soundex in addition to KPH while giving the KPH a higher score for likely german inputs.
from cologne_phonetics.
On a second thought, it might be viable to add some sort of functionality to disable/enable certain encoding rules, essentially creating your own lookup table. You'd just have to be aware of the implications this has for other cases.
Would something like that work for you?
from cologne_phonetics.
Certainly, it would. One could probably already do so adding or removing some of your regex functions in RGX_RULES
but that might break parts of your logic or wouldn't it?
My overall aim is simply to be able to buffer many kinds of spelling for streets, particularly those including names with many different versions like:
meier = ["meier", "maier", "mair", "mayer", "mayr", "meyer", "meyr"]
erhart = ["erhardt","erhard","erhart"]
schmitt = ["schmidt","schmitt","schmit","schmid"]
...
In fact, for difficult ones like meyr, a phonetic matching might not be sufficient, but for schmitt
and erhart
I really would like to make this work.
from cologne_phonetics.
Reading through your comment and the Wikipedia articles again I agree with your first comment that it's probably an acceptable flaw for more stable overall results.
Maybe I would need to split the compound words first and then run the algorithm to get better results.
However that would be pointless (e.g. with this package) if the right noun cannot be looked up due to spelling mistakes or missing variants...
I'd definitely prefer an easier solution based on the Kölner Phonetik with just a few tweaks. Not sure, whether anyone ever did a modification like this before.
Else, I could also go radical and globally replace e.g. all dt, tt, d
to t
and do the same with th
and other edge cases I might encounter.
from cologne_phonetics.
After more tests, I realized that in my case Kölner Phonetik is just not fit for purpose. It might work great for searching, where false hits are not a big deal but it doesn't work for actual matching. Just a few examples why:
Based on this command
import cologne_phonetics
cologne_phonetics.encode(s, concat=True)
where s is the string.
Herrenmattstraße == Herrenstraße
Gerenotstraße == Kornstraße
Zellerstraße == Schillerstraße
The phonetic similarity of Z and Sch as well as e and i makes sense to me so Zellerstraße and Schillerstraße is legit. However Herrenmattstraße can not equal Herrenstraße for matching but solely for searching.
Closing this issue as my matching premise is simply out of scope for Kölner Phonetik.
from cologne_phonetics.
Related Issues (1)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cologne_phonetics.