Comments (9)
Hi @bgeisberger, Grüße nach Karlsruhe! :-) Thank you for using my library and for this feature request.
Actually, I was already thinking about implementing some kind of confidence measure for the detection results. Maybe I will do it already in the next version, however, I will most certainly implement it as an optional feature so that users can decide whether they want to risk getting false positives or simply UNKNOWN returned.
I hope you nevertheless enjoy using my library. Are you experimenting with it first or are you using it already for some productive purpose? I'm curious, so feel free to tell me if you like. Maybe I can derive other useful additions from your use cases. Thanks!
from lingua.
Hi, thanks for the answer, Grüße zurück :)
That sounds good.
Currently I'm just experimenting, but the results are really great for most languages and I'm planning to replace the current implementation using optimaize langdetect with your library. Our current use-case is to detect the language of emails and to create specific white- or blacklist filters afterwards. At the moment I still have to wait for Japanese support, as we have some filters for that specific language.
Thanks for actively developing this library!
from lingua.
Wow, that sounds cool. This is the first feedback I get that my library is already planned to be used productively. Great news! :-) Indeed, detecting the language of emails is a perfect use case as they tend to be quite short, usually.
Japanese is actually the next on my todo list, so please stay tuned.
Thanks again for your support. I appreciate this a lot. 👍
from lingua.
I tested your library, and I also would like to use it in our music ingestion pipeline. We do need, however, this confidence feature implemented before we can deploy it.
For example, a recording with title "Prologue" is detected as French, although this word is exactly the same in English. I would rather keep it "Unknown", so we can later process it from the album title or other sources.
In this example the probabilities of French and English should be the same - I believe that returning "French" in such situation is not a desired outcome for any user.
from lingua.
Hi @volgin, I thank you as well for using my library.
Ok I see, the confidence feature seems to be quite important for users. So I will focus on implementing this one next. Thanks for letting me know.
from lingua.
@bgeisberger @volgin I have just implemented this feature with commit fe1dee9. You can define the minimum relative distance as follows:
val detector = LanguageDetectorBuilder.withMinimumRelativeDistance(0.52).build()
The distance must lie in between 0.0 and 0.99. The default value is 0.0 which means that a language is always returned, risking false positives, and is the same as the previous behavior. The higher the distance value, the more picky the detector gets.
Be aware that the distance between the summed up and logarithmized probabilities for each possible language is dependent on the length of the input text. The longer the input text, the larger the distance between the languages. So if you want to classify very short text phrases, do not set the minimum relative distance too high. Otherwise you will get most results returned as UNKNOWN.
Before I release version 0.5.0, I will add a few more languages and unit tests. However, feel free to test this feature now already and please let me know what you think about it. Thanks.
from lingua.
I'm gonna close this issue now as the requested feature has been implemented. If you find a bug or think that the minimum relative distance calculation does not work as expected, then please open a new issue. Thanks.
from lingua.
Just to cheer up @pemistahl for good work, we've been already using it in production :) It mostly works very well (unfortunately, we have unexpected Japanese texts occasionally to deal with).
Anyway, thank you for your library. This is the only open-source library that worked for us well with short-sized messages.
from lingua.
Thank you very much @abdolence for your kind words. I'm happy that my library is obviously useful for some people. (-:
from lingua.
Related Issues (20)
- 1.1.1 -> 1.2.1 upgrade issue: java.lang.NoClassDefFoundError HOT 2
- Lingua requires JDK 8+ instead of JDK 6+ HOT 2
- java.lang.NoSuchMethodError: 'void kotlin.jvm.internal.PropertyReference1Impl HOT 4
- run with gradle ok, but error with maven HOT 2
- Add confidence metric for single language
- Reduce resources to load language models
- Option: Other HOT 3
- Language recognition error HOT 1
- Why am I forced to specify 2 languages for the detection to work with?
- Language recognition enhancement HOT 1
- IsoCode639_1 is ambiguous HOT 2
- Bump pinned version of regex dependency
- java.lang.ExceptionInInitializerError - META-INF? HOT 9
- Bad results with Java version
- Rewrite Lingua in Java HOT 2
- OutOfMemoryError HOT 2
- Simplified & Traditional Chinese HOT 2
- French text is detected as 100% English
- Not Detecting few words in French, It is detecting as a English And Few words not detecting as English but it is English
- com.github.pemistahl.lingua.api.LanguageDetector#detectLanuguageOf NoSuchMethodError with String HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lingua.