Giter Club home page Giter Club logo

whatlanggo's Introduction

Whatlanggo

Build Status Go Report Card GoDoc Coverage Status

Natural language detection for Go.

Features

  • Supports 84 languages
  • 100% written in Go
  • No external dependencies
  • Fast
  • Recognizes not only a language, but also a script (Latin, Cyrillic, etc)

Getting started

Installation:

    go get -u github.com/abadojack/whatlanggo

Simple usage example:

package main

import (
	"fmt"

	"github.com/abadojack/whatlanggo"
)

func main() {
	info := whatlanggo.Detect("Foje funkcias kaj foje ne funkcias")
	fmt.Println("Language:", info.Lang.String(), " Script:", whatlanggo.Scripts[info.Script], " Confidence: ", info.Confidence)
}

Blacklisting and whitelisting

package main

import (
	"fmt"

	"github.com/abadojack/whatlanggo"
)

func main() {
	//Blacklist
	options := whatlanggo.Options{
		Blacklist: map[whatlanggo.Lang]bool{
			whatlanggo.Ydd: true,
		},
	}

	info := whatlanggo.DetectWithOptions("האקדמיה ללשון העברית", options)

	fmt.Println("Language:", info.Lang.String(), "Script:", whatlanggo.Scripts[info.Script])

	//Whitelist
	options1 := whatlanggo.Options{
		Whitelist: map[whatlanggo.Lang]bool{
			whatlanggo.Epo: true,
			whatlanggo.Ukr: true,
		},
	}

	info = whatlanggo.DetectWithOptions("Mi ne scias", options1)
	fmt.Println("Language:", info.Lang.String(), " Script:", whatlanggo.Scripts[info.Script])
}

For more details, please check the documentation.

Requirements

Go 1.8 or higher

How does it work?

How does the language recognition work?

The algorithm is based on the trigram language models, which is a particular case of n-grams. To understand the idea, please check the original whitepaper Cavnar and Trenkle '94: N-Gram-Based Text Categorization'.

How IsReliable calculated?

It is based on the following factors:

  • How many unique trigrams are in the given text
  • How big is the difference between the first and the second(not returned) detected languages? This metric is called rate in the code base.

Therefore, it can be presented as 2d space with threshold functions, that splits it into "Reliable" and "Not reliable" areas. This function is a hyperbola and it looks like the following one:

Language recognition whatlang rust

For more details, please check a blog article Introduction to Rust Whatlang Library and Natural Language Identification Algorithms.

License

MIT

Derivation

whatlanggo is a derivative of Franc (JavaScript, MIT) by Titus Wormer.

Acknowledgements

Thanks to greyblake (Potapov Sergey) for creating whatlang-rs from where I got the idea and algorithms.

whatlanggo's People

Contributors

abadojack avatar flowonyx avatar g7r avatar kreativka avatar mmorells avatar rylans avatar whiteraven777 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

whatlanggo's Issues

Non-deterministic output for certain language queries with white list options

Here is example

package main

import (
    "fmt"
    "github.com/abadojack/whatlanggo"
)

var whitelist_options = whatlanggo.Options{
    Whitelist: map[whatlanggo.Lang]bool{
        whatlanggo.Eng: true,
        whatlanggo.Jpn: true,
        whatlanggo.Kor: true,
        whatlanggo.Vie: true,
        whatlanggo.Ind: true,
        whatlanggo.Cmn: true,
    },
}

func main() {
    info := whatlanggo.DetectWithOptions("appel", whitelist_options)
    fmt.Println("Language:", whatlanggo.LangToString(info.Lang), "Script:", whatlanggo.Scripts[info.Script])
}

func GetListOfLangsBaseOnScript

Thnx for a great lib
Ive add a func to get a list of Lang by script
Maybe it will be useful for you too

func GetListOfLangsBaseOnScript(script *unicode.RangeTable) []Lang {
	var res []Lang
	switch script {
	case unicode.Latin:
		for k, _ := range latinLangs {
			res = append(res, k)
		}
		return res
	case unicode.Cyrillic:
		for k, _ := range cyrillicLangs {
			res = append(res, k)
		}
		return res

	case unicode.Devanagari:
		for k, _ := range devanagariLangs {
			res = append(res, k)
		}
		return res
	case unicode.Hebrew:
		for k, _ := range hebrewLangs {
			res = append(res, k)
		}
		return res
	case unicode.Ethiopic:
		for k, _ := range ethiopicLangs {
			res = append(res, k)
		}
		return res
	case unicode.Arabic:
		for k, _ := range arabicLangs {
			res = append(res, k)
		}
		return res
	case unicode.Han:
		res = append(res, Cmn)
		return res
	case unicode.Bengali:
		res = append(res, Ben)
		return res
	case unicode.Hangul:
		res = append(res, Kor)
		return res
	case unicode.Georgian:
		res = append(res, Kat)
		return res
	case unicode.Greek:
		res = append(res, Ell)
		return res
	case unicode.Kannada:
		res = append(res, Kan)
		return res
	case unicode.Tamil:
		res = append(res, Tam)
		return res
	case unicode.Thai:
		res = append(res, Tha)
		return res
	case unicode.Gujarati:
		res = append(res, Guj)
		return res
	case unicode.Gurmukhi:
		res = append(res, Pan)
		return res
	case unicode.Telugu:
		res = append(res, Tel)
		return res
	case unicode.Malayalam:
		res = append(res, Mal)
		return res
	case unicode.Oriya:
		res = append(res, Ori)
		return res
	case unicode.Myanmar:
		res = append(res, Mya)
		return res
	case unicode.Sinhala:
		res = append(res, Sin)
		return res
	case unicode.Khmer:
		res = append(res, Khm)
		return res
	case unicode.Katakana:
		res = append(res, Jpn)
		return res
	case unicode.Hiragana:
		res = append(res, Jpn)
		return res
	}
	return nil
}

Not detecting an arabic word

The following arabic word for ordering seems to not be detected correctly:

يأمُر

Google is able to identify it correctly.

Struggling to detect English

I took your awesome lib and wrapped it in a little command line app. I also added a conversion table from ISO 639-3 to ISO 639-1.

▶ ./langdetect "Le candidat socialiste à l’élection présidentielle"
Language: fra Script: Latin
fr

Correct!

▶ ./langdetect "Mitt namn på svenska är Peter"
Language: swe Script: Latin
sv

Correct!

But....

▶ ./langdetect "testing in english"
Language: uig Script: Latin
ug

Not right.

▶ ./langdetect "wondering if it still works in English"
Language: nld Script: Latin
nl

Not right either.

Also, would it be possible to output a list of probabilities? That way my app, where I hope to use this, could throw warnings if the probabilities "aren't certain enough".

Non-deterministic output for certain language queries

I've found that
lang := whatlanggo.Detect("wondering if this works").Lang fmt.Println(whatlanggo.LangToString(lang))

prints "eng" about 75% of the time
but it prints "nld" 25% of the time.

Maybe the randomness is introduced by using maps since in go, map keys are randomized.

what means a super negative confidence rate

Hi,

Hope you are all well !

I have -18.66532829205885 or -10.605926394815977 confidence rate, what does that mean ?

Language: Yoruba  Script: Latin  Confidence:  -8.652592309409306
Language: Turkmen  Script: Latin  Confidence:  -5.528339197102301
Language: Yoruba  Script: Latin  Confidence:  -8.163311123289779
Language: Chewa  Script: Latin  Confidence:  -0.8738781333466048
Language: Yoruba  Script: Latin  Confidence:  -7.287061394685147
Language: Yoruba  Script: Latin  Confidence:  -9.46254452788719
Language: Mandarin  Script: Han  Confidence:  1
Language: English  Script: Latin  Confidence:  -18.66532829205885
Language: Yoruba  Script: Latin  Confidence:  -10.605926394815977

Cheers,
X

Undefined Sort.SliceStable

github.com/abadojack/whatlanggo

../../.gvm/pkgsets/go1.7.3/global/src/github.com/abadojack/whatlanggo/detect.go:120: undefined: sort.SliceStable
../../.gvm/pkgsets/go1.7.3/global/src/github.com/abadojack/whatlanggo/trigrams.go:31: undefined: sort.SliceStable

Required minimum version?

I tried to build this on 1.6.2 and got this error:

$ go get -u github.com/abadojack/whatlanggo
# github.com/abadojack/whatlanggo
gocode/src/github.com/abadojack/whatlanggo/detect.go:120: undefined: sort.SliceStable
gocode/src/github.com/abadojack/whatlanggo/trigrams.go:31: undefined: sort.SliceStable

$ go version
go version go1.6.2 linux/amd64

Language detection issue

While using "Detect" function Arabic and English in not detecting properly.

We are expecting language as "english" for "hi" and "hello".

But for "hi" getting below response
Language: Zulu Script: Latin Confidence: 0.005592493630771142

and for "hello" getting response as
Language: Somali Script: Latin Confidence: 0.010694234025487925

How can we provide 2 default languages like "arabic" and "english"? If "arabic" is not detected should provide language as "english" with confidence.

If we try to detect a string with 2 languages "الأجهزة تحت testing type" not getting either english nor arabic.

Language: Uyghur Script: Latin Confidence: 0.06648113790970933

Any idea to handle this.

detection problem for short text / training option

Hi,

Hope you are all well !

I have a problem to detect french language on short sentences like the one below.

Sentence Language Detected Real Language Location
Ras. Esperanto French France
RAS bon. Esperanto French France
PAS DE SOUCI. Portuguese French France
Bien. Spanish French France
RIEN A SIGNALER. Spanish French France
Nickel. Polish French France
Pas assez de recul. Portuguese French France
Je recommande. Dutch French France

Is there a way to train the model with additional patterns/sentences in order to improve detection confidence ?

Btw, I know the location of these sentence, like they are all from France, is there a way to influence the score with an additional parameter like the location ?

Thanks in advance for any insights or solutions !

Cheers,
X

Scientific paper

Does a scientific paper on this work exist?
Or can you provide any papers that you used for the development?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.