Giter Club home page Giter Club logo

Comments (9)

pemistahl avatar pemistahl commented on May 18, 2024

Hi @bgeisberger, Grüße nach Karlsruhe! :-) Thank you for using my library and for this feature request.

Actually, I was already thinking about implementing some kind of confidence measure for the detection results. Maybe I will do it already in the next version, however, I will most certainly implement it as an optional feature so that users can decide whether they want to risk getting false positives or simply UNKNOWN returned.

I hope you nevertheless enjoy using my library. Are you experimenting with it first or are you using it already for some productive purpose? I'm curious, so feel free to tell me if you like. Maybe I can derive other useful additions from your use cases. Thanks!

from lingua.

bgeisberger avatar bgeisberger commented on May 18, 2024

Hi, thanks for the answer, Grüße zurück :)
That sounds good.
Currently I'm just experimenting, but the results are really great for most languages and I'm planning to replace the current implementation using optimaize langdetect with your library. Our current use-case is to detect the language of emails and to create specific white- or blacklist filters afterwards. At the moment I still have to wait for Japanese support, as we have some filters for that specific language.

Thanks for actively developing this library!

from lingua.

pemistahl avatar pemistahl commented on May 18, 2024

Wow, that sounds cool. This is the first feedback I get that my library is already planned to be used productively. Great news! :-) Indeed, detecting the language of emails is a perfect use case as they tend to be quite short, usually.

Japanese is actually the next on my todo list, so please stay tuned.

Thanks again for your support. I appreciate this a lot. 👍

from lingua.

volgin avatar volgin commented on May 18, 2024

I tested your library, and I also would like to use it in our music ingestion pipeline. We do need, however, this confidence feature implemented before we can deploy it.

For example, a recording with title "Prologue" is detected as French, although this word is exactly the same in English. I would rather keep it "Unknown", so we can later process it from the album title or other sources.

In this example the probabilities of French and English should be the same - I believe that returning "French" in such situation is not a desired outcome for any user.

from lingua.

pemistahl avatar pemistahl commented on May 18, 2024

Hi @volgin, I thank you as well for using my library.

Ok I see, the confidence feature seems to be quite important for users. So I will focus on implementing this one next. Thanks for letting me know.

from lingua.

pemistahl avatar pemistahl commented on May 18, 2024

@bgeisberger @volgin I have just implemented this feature with commit fe1dee9. You can define the minimum relative distance as follows:

val detector = LanguageDetectorBuilder.withMinimumRelativeDistance(0.52).build()

The distance must lie in between 0.0 and 0.99. The default value is 0.0 which means that a language is always returned, risking false positives, and is the same as the previous behavior. The higher the distance value, the more picky the detector gets.

Be aware that the distance between the summed up and logarithmized probabilities for each possible language is dependent on the length of the input text. The longer the input text, the larger the distance between the languages. So if you want to classify very short text phrases, do not set the minimum relative distance too high. Otherwise you will get most results returned as UNKNOWN.

Before I release version 0.5.0, I will add a few more languages and unit tests. However, feel free to test this feature now already and please let me know what you think about it. Thanks.

from lingua.

pemistahl avatar pemistahl commented on May 18, 2024

I'm gonna close this issue now as the requested feature has been implemented. If you find a bug or think that the minimum relative distance calculation does not work as expected, then please open a new issue. Thanks.

from lingua.

abdolence avatar abdolence commented on May 18, 2024

Just to cheer up @pemistahl for good work, we've been already using it in production :) It mostly works very well (unfortunately, we have unexpected Japanese texts occasionally to deal with).
Anyway, thank you for your library. This is the only open-source library that worked for us well with short-sized messages.

from lingua.

pemistahl avatar pemistahl commented on May 18, 2024

Thank you very much @abdolence for your kind words. I'm happy that my library is obviously useful for some people. (-:

from lingua.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.