pemistahl / lingua Goto Github PK

The most accurate natural language detection library for Java and the JVM, suitable for long and short text alike

License: Apache License 2.0

Kotlin 100.00%

language-detection language-processing kotlin-library java-library nlp-library nlp nlp-machine-learning natural-language natural-language-processing android-library

lingua's Introduction

Hello, thank you for visiting my profile. 🖖🏻🤓

My name is Peter. There are actually more people than I thought who have the same first and family name, so I bother you with my middle name Michael as well. Were Type O Negative really so popular back then? I haven't got a clue...

I hold a Master's degree in computational linguistics from Saarland University in Saarbrücken, Germany. After my graduation in 2013, I decided against a research career because I like building things that help people now and not in the unforeseeable future.

Currently, I work for Riege, a leading provider of cloud-based software for the logistics industry. In my free time, I like working on open source projects in the fields of computational linguistics and string processing in general.

I have a special interest in modern programming languages and green computing. I believe that the software industry should make more significant contributions towards environmental protection. Great advances have been made to decrease energy consumption and emissions of hardware. However, those are often canceled out by poorly optimized software and resource-intensive runtime environments.

This is why I'm especially interested in the Rust programming language which allows writing performant and memory-safe applications without the need for a garbage collector or a virtual runtime environment, making use of modern syntax abstractions at the same time.

For those of you interested in how Rust and related technology can accomplish the goal of more eco-friendly software, I strongly recommend you to read the dissertation Energyware Engineering: Techniques and Tools for Green Software Development published in 2018 by Rui Pereira at the University of Minho in Portugal.

lingua's People

Contributors

Stargazers

Watchers

lingua's Issues

\p{IsLatin} is used and does not work on Android

LanguageDetector.kt line 412 uses \p{IsLatin}, which definitely ought to work on Android, but doesn't. Using \p{Latin} works, so I suggest you do that. The next two lines probably have the same issue.

Looks great BTW, and I love that you included Latin already.

Pull requests are ignored

The documentation mentions that pull requests are welcome but I have a pull request that's sitting for several months.

Some options:

Update the documentation to remove the statement which encourages contributions
OR approve and merge the pull request
OR add a comment explaining why the pull request is not desirable

How to implement?

Hello I am quite new in programming with android and I am currently working on Text To Speech App. I would like to implement Lingua so the TTS app can switch to the specified language according to the Lingua results. But I found some difficulties in implementing it. This is the process that I have done:

Adding implementation("com.github.pemistahl:lingua:0.6.1") to my Gradle
Putting import to my .kt file

import com.github.pemistahl.lingua.api.*
import com.github.pemistahl.lingua.api.Language.*

Putting the line in my code

val detector: LanguageDetector = LanguageDetectorBuilder.fromLanguages(ENGLISH, FRENCH, GERMAN, SPANISH).build()
val detectedLanguage: Language = detector.detectLanguageOf(text = "languages are awesome")

I then would like to output the detectedLanguage in a toast.. but then everytime I clicked on the button that directs me to point number 3 the app closes down.

I would like to ask if can we detectLanguageOf from a variable. for example,

var checkLang =  "languages are awesome"
val detectedLanguage: Language = detector.detectLanguageOf(checkLang)

and is detectedLanguage a string?
Thank you for your help! this library is really awesome!

a priori reward for languages

I noticed that the software tends to confuse between similar languages like Dutch and Afrikaans. This is not strange, obviously.

Is it possible to add a boost for any language(s), which are more likely in advance.

This could be helpful even for better detection of segments inside a longer document.
For instance, in my scenario I need to identify portions of foreign text within a text which is written in a given language.
For instance in a Chinese text like the following:

毫 不 夸 张 地 说 ， 这是 最 让 人 感 动的 一 块 金 牌 。
Singhal ： 在 911 事件 發 生 時
那個 平等 是因 爲 他們 著 色 的行 中 。
随着 航 班 恢复 ， 俄罗斯 的 旅行 社 与 包 机 公司 必 将 松 一 口 气 。
Primark 的一 位发言 人 说 ： “Primark 已经知 道 了 8 月 9 日 （ 周 二 ） 发生 在我 们 Folkestone 店 内的 事情 。
OPEC 还在 月 报 中 表示 ， 原 油 价格 低 廉 已 促使 全球 炼 油 商 生产 更多 精 炼 油 品 ， 从而 加 重 了 市场 供应 过 剩 的程 度 。
这 名 试图 自 杀 的女 性 于 周 四 清 晨 在 高速公 路 上 被 多 辆 车 辗 过

Which are wrongly (in my opinion) detected as

zh
tl
zh
zh
eo
pt
zh

Personally I would label them all Chinese text.

A final question: would it be possible to spot the non-Chinese portions inside a segment?

Specifying languages at runtime

Is it possible to select a subset of the available languages, which detect the right language among, after loading all languages;
This could be useful to modify this subset for each single sentence to label, in the case segment-based external knowledge is provided.

IllegalStateException thrown for unusual case

I'm able to configure the LanguageDetector as follows:

LanguageDetector languageDetector =
    LanguageDetectorBuilder.fromLanguages(Language.ENGLISH, Language.UNKNOWN)
                           .build();

When trying to compute the probabilities of the languages for the content 그 가격으로는 최상, the following exception is thrown:

Exception in thread "main" java.lang.IllegalStateException: inputStream must not be null
	at com.github.pemistahl.lingua.api.LanguageDetector.loadLanguageModel$lingua(LanguageDetector.kt:346)
	at com.github.pemistahl.lingua.api.LanguageDetector$loadLanguageModels$1.invoke(LanguageDetector.kt:353)
	at com.github.pemistahl.lingua.api.LanguageDetector$loadLanguageModels$1.invoke(LanguageDetector.kt:72)
	at kotlin.SynchronizedLazyImpl.getValue(LazyJVM.kt:74)
	at com.github.pemistahl.lingua.api.LanguageDetector.lookUpNgramProbability$lingua(LanguageDetector.kt:336)
	at com.github.pemistahl.lingua.api.LanguageDetector.computeSumOfNgramProbabilities$lingua(LanguageDetector.kt:312)
	at com.github.pemistahl.lingua.api.LanguageDetector.computeLanguageProbabilities$lingua(LanguageDetector.kt:299)
	at com.github.pemistahl.lingua.api.LanguageDetector.addNgramProbabilities$lingua(LanguageDetector.kt:164)
	at com.github.pemistahl.lingua.api.LanguageDetector.detectLanguageOf(LanguageDetector.kt:116)
	at com.feefo.entity.feedback.statistics.LanguageDetectionTimeAnalysis$LinguaTimeCheck.lambda$3(LanguageDetectionTimeAnalysis.java:83)
	at java.util.ArrayList.forEach(ArrayList.java:1257)
	at com.feefo.entity.feedback.statistics.LanguageDetectionTimeAnalysis$LinguaTimeCheck.lambda$2(LanguageDetectionTimeAnalysis.java:81)
	at com.feefo.entity.feedback.statistics.LanguageDetectionTimeAnalysis$TimedEvent.time(LanguageDetectionTimeAnalysis.java:107)
	at com.feefo.entity.feedback.statistics.LanguageDetectionTimeAnalysis$LinguaTimeCheck.lambda$1(LanguageDetectionTimeAnalysis.java:89)

This exception is not thrown for other clearly non-English content (e.g. 여보세요), although changing from Language.UNKNOWN to Language.GERMAN solves this issue.

If Language.UNKNOWN is not meant to be included in the fromLanguages collection, a suitable exception should be thrown to indicate this.

As a side note, my use case for including Language.ENGLISH and Language.UNKNOWN is that, for my use case, I only care to know whether or not the language is English so would prefer to maintain the ability to include Language.UNKNOWN.

Update dependencies

An update to the newest project dependencies is required for a stable release 1.0.0.

Add Swahili language support

As requested by @orpheusx in #34, language detection support for Swahili will be added.

Rules-based unique character detection too sensitive

This string is detected as KAZAKH:
Hello this is definitely an English sentence: 'ң is a letter of the Cyrillic script'

I think the same over-sensitivity could apply to:

any combination of { Language w/o uniqueCharacters } { 1 Language w/ uniqueCharacters}.
any combination of { Language**s** w/o uniqueCharacters } { 1 Language w/ uniqueCharacters}. , e.g. if our text is composed of these languages:

AFRIKAANS (AF, AFR, setOf(Alphabet.LATIN), ""),
BASQUE (EU, EUS, setOf(Alphabet.LATIN), ""),
ENGLISH (EN, ENG, setOf(Alphabet.LATIN), ""),
SPANISH (ES, SPA, setOf(Alphabet.LATIN), "¿¡"),

I made a quick test string:
¡hallo ek toets iets! kaixo zerbait probatzen ari naiz! hello im testing something!"

I expected Spanish due to ¡, but Lingua detects BASQUE (presumably as its the longest of the Latin scripts, so hits the most n-grams). Does Lingua fallback to n-gram detect if there's >2 possible languages then?

Add Tsonga language support

Language detection support for Tsonga will be added.

Detection of unknown characters throws NullPointerException

Thanks for a great work.
Bump into an issue with NPE.

Checked a problem with Unit Test:

import com.github.pemistahl.lingua.api.{Language, LanguageDetectorBuilder}
import org.scalatest._

class LanguageDetectTest extends FlatSpec with Matchers {

	val languageDetector = LanguageDetectorBuilder.
		fromLanguages(Language.ENGLISH,Language.GERMAN,Language.FRENCH).
		build()

	"A language detector" should "detect simple text" in {
		(languageDetector.detectLanguageOf("Hello").getIsoCode) should be("en")
	}

	it should "work without exception for weird texts and chars" in {
		noException shouldBe thrownBy {
			languageDetector.detectLanguageOf("""\_(ツ)_/¯""")
		}
	}
}

languageDetector.detectLanguageOf("""\_(ツ)_/¯""") - this is where NPE is occured.
This is a real text from user, actually :)

Callstack:

Caused by: kotlin.KotlinNullPointerException
	at com.github.pemistahl.lingua.api.LanguageDetector.getMostLikelyLanguage(LanguageDetector.kt:571)
	at com.github.pemistahl.lingua.api.LanguageDetector.detectLanguageOf(LanguageDetector.kt:88)

Thanks

Rule-based engine doesn't respect specified languages

Rule-based engine is ignoring list of languages loaded in LanguageDetector.

LanguageDetectorBuilder
            .fromLanguages(ENGLISH, SPANISH).build()
            .detectLanguageOf(text = "你好，世界")

will return CHINESE, where I expect to receive UNKNOWN.

More benchmarks on language detection algorithms

Is Lingua truly the best algorithm in GitHub? We will see after these short messages 📺

https://github.com/wooorm/franc/tree/master/packages/franc-all (JS, 401 languages)
https://github.com/cloudmark/language-detect (Python, 271 languages)
https://github.com/kapsteur/franco (Golang, 175 languages)
https://github.com/patrickschur/language-detection (PHP, 110 languages)
https://github.com/richtr/guessLanguage.js (JS, 100 languages)
https://github.com/saffsd/langid.py (Python, 97 languages)
https://github.com/feedbackmine/language_detector (Ruby, 96 languages)
https://github.com/jonathansp/guess-language (Golang, 94 languages)
https://github.com/abadojack/whatlanggo (Golang, 84 languages)
https://github.com/chattylabs/language-detector (JS, 73 language)
https://github.com/optimaize/language-detector (Java, 71 languages)
https://github.com/endeveit/guesslanguage (Golang, 67 languages)
https://github.com/dsc/guess-language (Python, 64 languages)
https://github.com/decultured/Python-Language-Detector (Python, 58 languages)
https://github.com/Mimino666/langdetect (Python, 55 languages)
https://github.com/landrok/language-detector (PHP, 54 language)
https://github.com/shuyo/language-detection (Java, 53 languages)
https://github.com/newmsz/node-language-detection (JS, 53 languages)
https://github.com/pdonald/language-detection (C#, 53 languages)
https://github.com/malcolmgreaves/language-detection (Java, 53 languages)
https://github.com/FGRibreau/node-language-detect (JS, 52 languages)
https://github.com/webmil/text-language-detect (PHP, 52 languages)
https://github.com/pear/Text_LanguageDetect (PHP, 52 languages)
https://github.com/Imaginatio/langdetect (Java, 50 languages)

kotlin.jvm.KotlinReflectionNotSupportedError in beakerX jupyter notebook

Trying to use lingua on beakerX, a popular jupyter notebook kernel collection. I downloaded the lingua jar file with all dependencies (lingua-0.5.0-with-dependencies.jar) from bintray and added it to the classpath. But lingua only allows me to build a detector, but not detect languages, for example:
val det = LanguageDetectorBuilder.fromIsoCodes("en","de", "fr", "es", "it").build() works fine, however when I do:
det.detectLanguageOf("debate about the institutions").getIsoCode(), I get the following error:

kotlin.jvm.KotlinReflectionNotSupportedError: Kotlin reflection implementation is not found at runtime. Make sure you have kotlin-reflect.jar in the classpath
at kotlin.jvm.internal.ClassReference.error(ClassReference.kt:86)
at kotlin.jvm.internal.ClassReference.getSimpleName(ClassReference.kt:23)
at com.github.pemistahl.lingua.api.LanguageDetector.loadLanguageModel$lingua(LanguageDetector.kt:322)
at com.github.pemistahl.lingua.api.LanguageDetector$loadLanguageModels$1.invoke(LanguageDetector.kt:335)
at com.github.pemistahl.lingua.api.LanguageDetector$loadLanguageModels$1.invoke(LanguageDetector.kt:74)
at kotlin.SynchronizedLazyImpl.getValue(Lazy.kt:131)
at com.github.pemistahl.lingua.api.LanguageDetector.lookUpNgramProbability$lingua(LanguageDetector.kt:315)
at com.github.pemistahl.lingua.api.LanguageDetector.computeSumOfNgramProbabilities$lingua(LanguageDetector.kt:291)
at com.github.pemistahl.lingua.api.LanguageDetector.computeLanguageProbabilities$lingua(LanguageDetector.kt:278)
at com.github.pemistahl.lingua.api.LanguageDetector.addNgramProbabilities$lingua(LanguageDetector.kt:169)
at com.github.pemistahl.lingua.api.LanguageDetector.detectLanguageOf(LanguageDetector.kt:113)
... 52 elided

Is kotlin-reflect.jar missing from lingua-0.5.0-with-dependencies.jar?

Mistake in Russian language model

Hello. Russian language alphabet not have this letter: "і", while Ukrainian or Belarusian have. This entries with it are mistake:

https://github.com/pemistahl/lingua/blob/master/src/main/resources/language-models/ru/bigrams.json

"кі","725/3944471"
"фі","5311/210102"

And I see a lot of issues like this in Russian folder. As you refer to Wortschatz corpora offered by Leipzig University, Germany - their data quality seems like a big question. At least alphabet easy to check I guess... I looked in unigrams too and there is:
"ћ","7836216/97387225"
This not Russian or even not Greek/Latin char (it could be something in math/science documents). Just curious what language will detect your library on this page article https://www.unian.ua/politics/10699929-tramp-doruchiv-u-travni-skasuvati-zaplanovaniy-vizit-na-inavguraciyu-zelenskogo-rozsekrechena-skarga.html (This is Ukrainian)

Support for Chinese Languages

Hello do you have a timeline for when support for Simplified Chinese, and Traditional Chinese will be added?

v0.6.1 seems better than v1.0.0

I compared version 0.6.1 vs version 1.0.0 on two private test sets

In the following table (see below) I reported results for both the version and their difference for all languages of my benchmark. Scores are the ratio of corrected classified segments with respect to a gold reference.
Actually, the sets contains real-world text from the web and from technical domains, and they are not manually checked. So it is possible, that they contain some wrongly classified sentences.

Nevertheless, v1.0.0 seems worse for than v0.6.1 for many languages.

One of the big difference relates with text including string in different languages:
For instance, Chinese segment having both Chinese and English, like the following ones, which are detected as English and Portuguese, instead of Chinese

Snapchat 并不 是唯一 一家 触 及 这些 文化 底 线 的公 司 
Gomes ： 我們 的目 標 是 提 升 搜 尋 服 務 的 品 質

or Greek segment with few western strings, like the following ones, which are detected as Danish and Italian, instead of Greek

Rasmussen μετά τη συνάντησή τους στο ΥΠΕΘΑ
γεγονός που, πέραν της σημασίας των σχετικών πολιτικών επαφών, εμπεριέχει και ιδιαίτερη συμβολική αξία, καθώς η επίσκεψη πραγματοποιήθηκε δύο μόλις έτη μετά την επίσκεψη του πρώην ιταλού Προέδρου, κ. Azeglio Ciampi, στην Αθήνα, στις 15-17.

or Arabic segment with few English strings, like the following ones, which are detected as English and Tagalog, instead of Arabic

أداة Google Scholar وضعت أبحاثًا متاحةً للجميع البحث عنها سهل والوصول إليها أسهل.
يارد "YARID" - اللاجئون الأفارقة الشباب للتنمية المتكاملة- بدأت كمحادثة داخل المُجتمع الكونغو

Several other examples, can be found even between languages with more similar alphabets.

It seems that v1.0.0 relies too much on Western alphabets to identify the language, without considering the amount of such Western characters.

set  lng Lingua Lingua100 diffs_vs_Lingua
setA ar  0.930   0.902  diff: -0.028
setA az  0.807   0.784  diff: -0.023
setA be  0.861   0.816  diff: -0.045
setA bg  0.801   0.734  diff: -0.067
setA bs  0.412   0.408  diff: -0.004
setA ca  0.760   0.762  diff: 0.002
setA cs  0.792   0.785  diff: -0.007
setA da  0.760   0.752  diff: -0.008
setA de  0.848   0.848  diff: 0
setA el  0.947   0.932  diff: -0.015
setA es  0.804   0.853  diff: 0.049
setA et  0.856   0.853  diff: -0.003
setA fi  0.865   0.864  diff: -0.001
setA fr  0.868   0.882  diff: 0.014
setA he  0.972   0.961  diff: -0.011
setA hi  0.790   0.733  diff: -0.057
setA hr  0.628   0.623  diff: -0.005
setA hu  0.858   0.848  diff: -0.01
setA hy  0.827   0.801  diff: -0.026
setA id  0.665   0.665  diff: 0
setA is  0.863   0.831  diff: -0.032
setA it  0.866   0.865  diff: -0.001
setA ja  0.758   0.752  diff: -0.006
setA ka  0.802   0.787  diff: -0.015
setA ko  0.887   0.827  diff: -0.06
setA lt  0.839   0.828  diff: -0.011
setA lv  0.882   0.869  diff: -0.013
setA mk  0.786   0.723  diff: -0.063
setA ms  0.801   0.809  diff: 0.008
setA nb  0.735   0.733  diff: -0.002
setA nl  0.799   0.835  diff: 0.036
setA nn  0.768   0.768  diff: 0
setA pl  0.879   0.881  diff: 0.002
setA pt  0.862   0.858  diff: -0.004
setA ro  0.765   0.751  diff: -0.014
setA ru  0.820   0.773  diff: -0.047
setA sk  0.783   0.766  diff: -0.017
setA sl  0.714   0.708  diff: -0.006
setA sq  0.829   0.826  diff: -0.003
setA sr  0.417   0.302  diff: -0.115
setA sv  0.833   0.830  diff: -0.003
setA th  0.940   0.927  diff: -0.013
setA tl  0.747   0.748  diff: 0.001
setA tr  0.901   0.895  diff: -0.006
setA uk  0.877   0.848  diff: -0.029
setA vi  0.920   0.877  diff: -0.043
setA zh  0.941   0.858  diff: -0.083

setB ar   0.996   0.988  diff: -0.008
setB bg   0.957   0.947  diff: -0.01
setB bs   0.495   0.494  diff: -0.001
setB ca   0.946   0.953  diff: 0.007
setB cs   0.993   0.992  diff: -0.001
setB da   0.947   0.946  diff: -0.001
setB de   0.996   0.996  diff: 0
setB el   0.996   0.992  diff: -0.004
setB en   0.964   0.966  diff: 0.002
setB es   0.897   0.920  diff: 0.023
setB et   0.978   0.974  diff: -0.004
setB fi   0.998   0.998  diff: 0
setB fr   0.962   0.971  diff: 0.009
setB he   1.000   0.999  diff: -0.001
setB hr   0.858   0.868  diff: 0.01
setB hu   0.988   0.988  diff: 0
setB id   0.765   0.765  diff: 0
setB is   0.979   0.971  diff: -0.008
setB it   0.939   0.937  diff: -0.002
setB ja   0.986   0.986  diff: 0
setB ko   0.998   0.998  diff: 0
setB lt   0.992   0.990  diff: -0.002
setB lv   0.990   0.983  diff: -0.007
setB mk   0.927   0.930  diff: 0.003
setB ms   0.927   0.927  diff: 0
setB nb   0.927   0.928  diff: 0.001
setB nl   0.921   0.949  diff: 0.028
setB nn   0.942   0.946  diff: 0.004
setB pl   0.993   0.992  diff: -0.001
setB pt   0.952   0.948  diff: -0.004
setB ro   0.964   0.958  diff: -0.006
setB ru   0.997   0.911  diff: -0.086
setB sk   0.977   0.975  diff: -0.002
setB sl   0.943   0.942  diff: -0.001
setB sq   0.983   0.983  diff: 0
setB sv   0.973   0.971  diff: -0.002
setB th   0.996   0.996  diff: 0
setB tr   0.993   0.990  diff: -0.003
setB uk   0.943   0.964  diff: 0.021
setB vi   0.994   0.954  diff: -0.04
setB zh  0.992   0.955  diff: -0.037

Shadow Gson 2.8.5 to avoid conflics with Hadoop

Hello,
Would it be possible to have com.google.code.gson:gson:2.8.5 shadowed.
The reason to do so is that when i try to use the package with spark in yarn there is a conflict with hadoop that uses a com.google.code.gson:gson:2.4.2 version of the library.

I would propose the change but my skills in kotling and gradle are quite poor or non existing.

Add Ganda language support

Language detection support for Ganda will be added.

Add Tswana language support

Language detection support for Tswana will be added.

training model for new languages

Is it possible to train models and hence a classifier for new languages, or update those of provided languages?

Provide memory usage of models + memory usage questions

Memory usage for languages would be quite handy to know - I was running a test and bumped into this:

Caused by: java.lang.OutOfMemoryError: Java heap space
	at java.base/java.util.HashMap.resize(HashMap.java:705)
	at java.base/java.util.HashMap.putVal(HashMap.java:664)
	at java.base/java.util.HashMap.put(HashMap.java:613)
	at com.github.pemistahl.lingua.internal.TrainingDataLanguageModel$Companion.fromJson(TrainingDataLanguageModel.kt:82)
	at com.github.pemistahl.lingua.api.LanguageDetector.loadLanguageModel$lingua(LanguageDetector.kt:351)
	at com.github.pemistahl.lingua.api.LanguageDetector$loadLanguageModels$1.invoke(LanguageDetector.kt:357)
	at com.github.pemistahl.lingua.api.LanguageDetector$loadLanguageModels$1.invoke(LanguageDetector.kt:72)
	at kotlin.SynchronizedLazyImpl.getValue(LazyJVM.kt:74)
	at com.github.pemistahl.lingua.api.LanguageDetector.lookUpNgramProbability$lingua(LanguageDetector.kt:340)
	at com.github.pemistahl.lingua.api.LanguageDetector.computeSumOfNgramProbabilities$lingua(LanguageDetector.kt:316)
	at com.github.pemistahl.lingua.api.LanguageDetector.computeLanguageProbabilities$lingua(LanguageDetector.kt:303)
	at com.github.pemistahl.lingua.api.LanguageDetector.addNgramProbabilities$lingua(LanguageDetector.kt:164)
	at com.github.pemistahl.lingua.api.LanguageDetector.detectLanguageOf(LanguageDetector.kt:116)

I'm creating & using my detector in a Kotlin class like:

  companion object {
       // always keep detector in memory
       private val detector = LanguageDetectorBuilder.fromAllBuiltInLanguages().build()

      fun detectLanguage(text: String): Language {
            // some processing, so this method isn't redundant
           return detector.detectLanguageOf(text)
}

Presumably I just blew up the static portion of memory (relatively small compared to total memory). I hadn't specified JVM arguments, and java -XX:+PrintFlagsFinal -version | grep HeapSize shows I default to 2GB Xmx.

Do you recommend building the detector every time when needed (is possible it gets GC'd and has to be rebuilt), or building it as a static object? Does the use-case depend on how many languages are loaded?

filterLanguagesByRules prevents correct language to be detected

As far as I can see, filterLanguagesByRules in LanguageDetector is used to narrow down the search space of candidate languages. It tries to find "typical" characters found only in certain languages, e.g. the character "ö" in ESTONIAN, FINNISH, GERMAN, HUNGARIAN, ICELANDIC, SWEDISH, TURKISH. However, given large English text containing only one such character makes it impossible to correctly detect the languages's text.

Is this the expected behaviour?

Thanks

Detect multiple languages in mixed-language text

Currently, for a given input string, only the most likely language is returned. However, if the input contains contiguous sections of multiple languages, it will be desirable to detect all of them and return an ordered sequence of items, where each item consists of a start index, an end index and the detected language.

Input:
He turned around and asked: "Entschuldigen Sie, sprechen Sie Deutsch?"

Output:

[
  {"start": 0, "end": 27, "language": ENGLISH}, 
  {"start": 28, "end": 69, "language": GERMAN}
]

LanguageDetector and multithreading

I had a plan to use 'lingua' in a multi threaded Java environment, but, if I got it right, 'LanguageDetector' instance is not thread safe, i.e. if several threads will use it simultaneously, they may corrupt each other work. Am I right? New 'LanguageDetector' instance seems to be very expensive.

High memory consumption using just two languages

Hi @pemistahl I am trying to use lingua in my project. But the problem is it is taking too much heap memory for just prediction among two languages.

detector = LanguageDetectorBuilder.fromLanguages(Language.ENGLISH,
            Language.GERMAN).build();

This instanciation is taking about 22 MB. Now I am giving a sentence for prediction since lingua loads models lazily on first prediction now when I checked the memory using jv-mon I see a memory usage of 170MB and it keeps on increasing in idle state without any further sentence given for prediction. But the total language models size is just about 3 MB for each language.

So I would expect something around 22 MB + 6 MB total heap memory usage. Also when loading all languages it jumps to 1.5 GB whereas the total language models size is about 200 MB. Am I missing something why is this happening. Apologies for the naive issue.
Thank you.

Add ktlint support

In order to have a consistent code formatting and automatic code formatting checks, it would be useful to integrate the Kotlin linter and formatter ktlint in the project.

Add function to avoid ambiguous results

Hi,

while testing the library with some texts I encountered some ambiguous detection results. As far as I understand, the detectLanguageOf method always returns a language as soon as it has at least some possibility. However, there exist texts where this behaviour is probably not desired.

Imagine a text which leads to similar possibilities for two languages, with the first one just a little bit more likely. It would be nice to be able to detect such cases, or at least to ensure a certain distance between the possibilities of the most and the second most likely language (otherwise the method may return UNKNOWN). In our use-case we would prefer to have more detection as unknown rather than (a lot of) false-positives.

The following code snippet illustrates my idea:

@JvmOverloads
fun detectLanguageOf(text: String, requiredRelativeDistance: Double = 0.95): Language {
    
    [...]

    return getMostLikelyLanguage(allProbabilities, unigramCountsOfInputText, requiredRelativeDistance)
}
    
internal fun getMostLikelyLanguage(
    probabilities: List<Map<Language, Double>>,
    unigramCountsOfInputText: Map<Language, Int>,
    requiredRelativeDistance: Double = 0.95
): Language {
    
    [...]

    return when {
        filteredProbabilities.none() -> UNKNOWN
        filteredProbabilities.singleOrNull() != null -> filteredProbabilities.first().key
        else -> {
            val candidate = filteredProbabilities.maxBy { it.value }!!
            val second = filteredProbabilities.filter { it.key != candidate.key }.maxBy { it.value }!!
            if (second.value * requiredRelativeDistance < candidate.value) {
                candidate.key
            } else {
                UNKNOWN
            }
        }
    }
}

Feel free to copy the code if you want. I don't know whether this is a good approach for the problem or if there are better ways to do that. However, it would be really nice to have a solution for this in some way.

Thanks in advance!

An API improvement for detect result model

It is more like a proposal, but it would be probably better to transform detect model

enum class Language(
    val isoCode: String,
    internal val hasLatinAlphabet: Boolean,
    internal val hasCyrillicAlphabet: Boolean,
    internal val hasArabicAlphabet: Boolean,
    internal val hasUmlauts: Boolean,
    internal var isExcludedFromDetection: Boolean
)

to something like

enum class LanguageOptions(
    val hasLatinAlphabet: Boolean,
    val hasCyrillicAlphabet: Boolean,
    val hasArabicAlphabet: Boolean,
    val hasUmlauts: Boolean
)

enum class Language(
    val isoCode: String,
    val options : LanguageOptions,
    internal var isExcludedFromDetection: Boolean
)

and give access to those params to anyone who needs to.

I'm using them already like this (in Scala), but it is a bit ugly:

LanguageDetectionResult(
	isoCode = lo.getIsoCode,
	options = LanguageDetectionOptions(
		hasArabicAlphabet = lo.getHasArabicAlphabet$lingua,
		hasCyrillicAlphabet = lo.getHasCyrillicAlphabet$lingua,
		hasLatinAlphabet = lo.getHasLatinAlphabet$lingua,
		hasUmlauts = lo.getHasUmlauts$lingua
	)
)

Add Xhosa language support

Language detection support for Xhosa will be added.

Provide public API for language model and test data generation

It is necessary to come up with a public API for language model and test data generation, so that developers can extend Lingua with their own languages.

java.lang.ExceptionInInitializerError

Hi,

I tried your code in my Android app for a test. The first line of code throws the exception.
Here is the line of code that throws exception.

LanguageDetector detector = LanguageDetectorBuilder.fromAllBuiltInLanguages().build();

exception:
java.lang.ExceptionInInitializerError
at com.github.pemistahl.lingua.api.LanguageDetectorBuilder.build(LanguageDetectorBuilder.kt:29)

Thanks.

Add Shona language support

Language detection support for Shona will be added.

Add Zulu language support

As requested by @orpheusx in #34, language detection support for Zulu will be added.

Japanese recognized as Finnish

Using all built in languages, texts in japanese are almost always recognized as finnish.

It seems that the problem is due to the presence of "not-japanese" characters.

example:

ヴェダイヤモンド（0.34カラット）をあしらった18Kピンクゴールド製。

is detected as finnish, removing the "k" from 18k give the right result.

"dallas" is recognised as Spanish Language

When I load English and Spanish languages, "dallas" which is a city name is recognized as Spanish text.

Is there an override setting for these type of inputs?

Detection of CJK languages doesn't work with at least one non-CJK character

Hey.

I just wanted to let you know that detection of CJK languages doesn't work when there is at least one non-CJK character. For example, this will work:

languageDetector.detectLanguageOf("기모링");

While this won't work:

languageDetector.detectLanguageOf("기모링a");

languageDetector.detectLanguageOf("기모링~");

Instead of getting Language.KOREAN, we get Language.UNKNOWN.

Can you please confirm if this is the case on your end and if there is an easy fix? I quickly checked the method and it seems that check for CJK doesn't correctly pick it up and then "summedUpProbabilities" has value 0 for all languages. It seems that there are json models for korean etc so it should pick it up normally and give scores for all languages based on that but it doesn't.

If it can't be reliably fixed without throwing tens of hours, boolean argument to agressively check for CJK would be great (remove all a-zA-Z characters, all special characters, split input into more sentences and if one of them is detected as CJK then return that language etc).

ACCURACY_TABLE.md does not show languages that Lingua doesn't support

optimaize/language-detector#107 made me think how there are some comparison gaps between these Java language detectors in https://github.com/pemistahl/lingua/blob/master/ACCURACY_TABLE.md - it only shows languages Lingua supports.

Lingua currently supports 66 languages.
Optimaize supports 10 less languages than Lingua in this table - suggesting it supports ~56 languages
Optimaize actually supports 71 languages (though simp/trad Chinese count as 2, whereas Lingua counts them as 1 - but not the point).

It would benefit users to have a more complete comparison of e.g. Optimiaize vs Lingua - it could be very helpful to know which languages Optimaize supports which Lingua currently does not support.

Hopefully Lingua-unsupported-language test data can be found to help with this generate another accuracy table. If Optimaize users can easily see which languages Lingua doesn't support (which Optimaize does), it could help them voice a priority order.

Besides, this test data would still be useful when Lingua does support those extra languages!

Give more details about Lingua's internal workings

Is it possible to include more details about the algorithm lingua uses for language ID? For my 37-language corpus, I have found its accuracy exceeds that of the other language identification tools i have previously been using, and I would like to know about its internal workings. The README file only mentions it "draws on both rule-based and statistical methods but does not use any dictionaries of words". More details on these statistical/rule-based techniques would be great, including links to any journal articles

Detection Matching on languages not included in builder

Hi @pemistahl

I found this library recently and i want to thank you (first of all) for creating this. Its awesome and much better than what i've been using previously.

Found Issue :
I see that many times detectLanguageOf is detects a language that is not included when building the Detector.

e.g. text vegas returns the detected language as TAGALOG even though i didn't include this as one of the languages while building detector.
I am not sure if its a known issue. Or if i am doing something wrong while building the detector.

Anyways, i am will try to create a patch if i find the issue in code. Just wanted to get this known to you.

Thanks

Lingua not working in Jupyter Notebook with Scala Kernel

Trying to run this in jupyter notebook running scala (Almond) kernel. I am able to import the library easily. But when I attempt to build a LanguageDetector using LanguageDetectorBuilder it returns a strange error:

val lbld = LanguageDetectorBuilder.fromAllBuiltInSpokenLanguages.build()`
java.lang.reflect.InvocationTargetException
	sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	java.lang.reflect.Method.invoke(Method.java:498)
	ammonite.interpreter.Interpreter$$anonfun$process$1$$anonfun$apply$21$$anonfun$apply$24$$anonfun$apply$25$$anonfun$11.apply(Interpreter.scala:259)
java.lang.NoClassDefFoundError: Could not initialize class com.github.pemistahl.lingua.api.Language
	com.github.pemistahl.lingua.api.LanguageDetectorBuilder$Companion.fromAllBuiltInSpokenLanguages(LanguageDetectorBuilder.kt:42)
	com.github.pemistahl.lingua.api.LanguageDetectorBuilder.fromAllBuiltInSpokenLanguages(LanguageDetectorBuilder.kt)
	cmd11$$user$$anonfun$1.apply(Main.scala:336)
	cmd11$$user$$anonfun$1.apply(Main.scala:335)
	cmd11$$user.<init>(Main.scala:337)
	cmd11.<init>(Main.scala:341)
	cmd11$.<init>(Main.scala:290)
	cmd11$.<clinit>(Main.scala)
	cmd11$Main$.$main(Main.scala:285)
	cmd11$Main.$main(Main.scala)
	sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	java.lang.reflect.Method.invoke(Method.java:498)
	ammonite.interpreter.Interpreter$$anonfun$process$1$$anonfun$apply$21$$anonfun$apply$24$$anonfun$apply$25$$anonfun$11.apply(Interpreter.scala:259)

Confidence scoring

Hey @pemistahl

Awesome library it works really well, best performing calculations compared to all the other open source stuff.

I'm looking to replace google detection API with your project but the one thing it's missing is a confidence score similar to google's

{"language":"te","confidence":0.4294964}

Is this something you are thinking about adding in?

Add Yoruba language support

As requested by @orpheusx in #34, language detection support for Yoruba will be added.

How to reduce the size of the jar file by excluding language profiles?

I need to run this lib in a memory constrained environment: less than 200Mb for the unzipped package. How can I exclude rare language profiles from the library?

An alternative: can the memory size be significantly decreased by minifying the json files used for each language?

Note: I am using the maven build of the lingua.

MapDB support not working on Android

that came with me
Error: Method name '%%%verify$mapdb' in class 'org.mapdb.DB$Maker' cannot be represented in dex format.

How about performance?

I have already used another library for language detection (Optimaize) and it has achieved a great performance. I saw that "lingua" outperforms when it comes to accuracy, but what about performance?

Additional languages?

My company's provides reproductive health information in developing countries. We've tried a number of the alternatives and they really do not do a very good job with shorter bits of text. From our tests, Lingua is clearly way WAY better. We're looking for solutions for Swahili, Zulu and Yoruba. Is there a priority list for any language additions you're planning? Anything I could do to help?

Questions about detection performance

Lingua is definitely the best among the softwares I compared, as you stated,

Nevertheless I noticed that often on real-scenario texts like the following French sentence, Lingua fails to identify the right language. Lingua detect the text as "ca"
It is definitely strange, because a so long sentence is clearly written in French.
Recevez 2 000 éclats de super cristal héroïque, 250 éclats de cristal de héros 3 étoiles, 1 capsule d'énergie et 3 capsules de soin de niveau 1 à utiliser pour obtenir de nouveaux champions et tenter de devenir le champion Marvel ultime, pour seulement 2 € en payant avec ModernMT.

So how would like to know:

which are the most important problems which may cause errors in detecting the right language?
are there any tricks to improve performance?

Language Detector misclassifies English text block as Greek

When I build a detector from .fromAllBuiltInSpokenLanguages(), it detects the follow text as Greek instead of English:

Rooter: A Methodology for the Typical Unification
of Access Points and Redundancy
Jeremy Stribling, Daniel Aguayo and Maxwell Krohn
ABSTRACT
Many physicists would agree that, had it not been for congestion control, the evaluation of web browsers might never have occurred. In fact, few hackers worldwide would disagree with the essential unification of voice-over-IP and public-private key pair. In order to solve this riddle, we confirm that SMPs can be made stochastic, cacheable, and interposable.
I. INTRODUCTION
Many scholars would agree that, had it not been for active networks, the simulation of Lamport clocks might never have occurred. The notion that end-users synchronize with the investigation of Markov models is rarely outdated. A theo-retical grand challenge in theory is the important unification of virtual machines and real-time theory. To what extent can web browsers be constructed to achieve this purpose? Certainly, the usual methods for the emulation of Smalltalk that paved the way for the investigation of rasterization do not apply in this area. In the opinions of many, despite the fact that conventional wisdom states that this grand challenge is continuously answered by the study of access points, we believe that a different solution is necessary. It should be noted that Rooter runs in Ω(log log n) time. Certainly, the shortcoming of this type of solution, however, is that compilers and superpages are mostly incompatible. Despite the fact that similar methodologies visualize XML, we surmount this issue without synthesizing distributed archetypes. We question the need for digital-to-analog converters. It should be noted that we allow DHCP to harness homoge-neous epistemologies without the evaluation of evolutionary programming [2], [12], [14]. Contrarily, the lookaside buffer might not be the panacea that end-users expected. However, this method is never considered confusing. Our approach turns the knowledge-base communication sledgehammer into a scalpel.
Our focus in our research is not on whether symmetric encryption and expert systems are largely incompatible, but rather on proposing new flexible symmetries (Rooter). Indeed, active networks and virtual machines have a long history of collaborating in this manner. The basic tenet of this solution is the refinement of Scheme. The disadvantage of this type of approach, however, is that public-private key pair and red-black trees are rarely incompatible. The usual methods for the visualization of RPCs do not apply in this area. Therefore, we see no reason not to use electronic modalities to measure the improvement of hierarchical databases.
The rest of this paper is organized as follows. For starters, we motivate the need for fiber-optic cables. We place our work in context with the prior work in this area. To ad-dress this obstacle, we disprove that even though the much-tauted autonomous algorithm for the construction of digital-to-analog converters by Jones [10] is NP-complete, object-oriented languages can be made signed, decentralized, and signed. Along these same lines, to accomplish this mission, we concentrate our efforts on showing that the famous ubiquitous algorithm for the exploration of robots by Sato et al. runs in Ω((n + log n)) time [22]. In the end, we conclude.
II. ARCHITECTURE
Our research is principled. Consider the early methodology by Martin and Smith; our model is similar, but will actually overcome this grand challenge. Despite the fact that such a claim at first glance seems unexpected, it is buffetted by previous work in the field. Any significant development of secure theory will clearly require that the acclaimed real-time algorithm for the refinement of write-ahead logging by Edward Feigenbaum et al. [15] is impossible; our application is no different. This may or may not actually hold in reality. We consider an application consisting of n access points. Next, the model for our heuristic consists of four independent components: simulated annealing, active networks, flexible modalities, and the study of reinforcement learning. We consider an algorithm consisting of n semaphores. Any unproven synthesis of introspective methodologies will clearly require that the well-known reliable algorithm for the investigation of randomized algorithms by Zheng is in Co-NP; our application is no different. The question is, will Rooter satisfy all of these assumptions? No.
Reality aside, we would like to deploy a methodology for how Rooter might behave in theory. Furthermore, consider the early architecture by Sato; our methodology is similar, but will actually achieve this goal. despite the results by Ken Thompson, we can disconfirm that expert systems can be made amphibious, highly-available, and linear-time. See our prior technical report [9] for details.
III. IMPLEMENTATION
Our implementation of our approach is low-energy, Bayesian, and introspective. Further, the 91 C files contains about 8969 lines of SmallTalk. Rooter requires root access in order to locate mobile communication. Despite the fact that we have not yet optimized for complexity, this should be simple once we finish designing the server daemon. Overall,
DNS
server
VPN
Client
A
NAT
Remote
server
Remote
firewall
Home
user
Bad
node
Server
A
Fig. 1. The relationship between our system and public-private key pair [18].
Rooter
Emulator Shell
Simulator
Kernel
Keyboard
Editor
Fig. 2. The schematic used by our methodology. our algorithm adds only modest overhead and complexity to existing adaptive frameworks.
IV. RESULTS
Our evaluation method represents a valuable research contri-bution in and of itself. Our overall evaluation seeks to prove three hypotheses: (1) that we can do a whole lot to adjust a framework’s seek time; (2) that von Neumann machines no longer affect performance; and finally (3) that the IBM PC Junior of yesteryear actually exhibits better energy than today’s hardware. We hope that this section sheds light on Juris Hartmanis ’s development of the UNIVAC computer in
1995.
2
4
2 4 8 16 32 64 128
w
or
k
fa
ct
or
(
#
C
P
U
s)
time since 1977 (teraflops)
Fig. 3. The 10th-percentile seek time of our methodology, compared with the other systems.
-20
0
20
40
60
80
100
-10 0 10 20 30 40 50 60 70 80 90
tim
e
si
nc
e
19
93
(
m
an
-h
ou
rs
)
sampling rate (MB/s)
topologically efficient algorithms
2-node
Fig. 4. These results were obtained by Dana S. Scott [16]; we reproduce them here for clarity.
A. Hardware and Software Configuration One must understand our network configuration to grasp the genesis of our results. We ran a deployment on the NSA’s planetary-scale overlay network to disprove the mutually large-scale behavior of exhaustive archetypes. First, we halved the effective optical drive space of our mobile telephones to better understand the median latency of our desktop machines. This step flies in the face of conventional wisdom, but is instrumental to our results. We halved the signal-to-noise ratio of our mobile telephones. We tripled the tape drive speed of DARPA’s 1000-node testbed. Further, we tripled the RAM space of our embedded testbed to prove the collectively secure behavior of lazily saturated, topologically noisy modalities. Similarly, we doubled the optical drive speed of our scalable cluster. Lastly, Japanese experts halved the effective hard disk throughput of Intel’s mobile telephones. Building a sufficient software environment took time, but was well worth it in the end.. We implemented our scat-ter/gather I/O server in Simula-67, augmented with oportunis-tically pipelined extensions. Our experiments soon proved that automating our parallel 5.25” floppy drives was more effective than autogenerating them, as previous work suggested. Simi-
35
40
45
50
55
60
65
70
36 38 40 42 44 46 48 50 52 54 56
si
gn
al
-t
o-
no
is
e
ra
tio
(
nm
)
latency (bytes)
Fig. 5. These results were obtained by Bhabha and Jackson [21]; we reproduce them here for clarity.
-40
-20
0
20
40
60
80
100
120
-40 -20 0 20 40 60 80 100
se
ek
ti
m
e
(c
yl
in
de
rs
)
latency (celcius)
millenium
hash tables
Fig. 6. The expected distance of Rooter, compared with the other applications.
larly, We note that other researchers have tried and failed to enable this functionality.
B. Experimental Results
Is it possible to justify the great pains we took in our implementation? It is. We ran four novel experiments: (1) we dogfooded our method on our own desktop machines, paying particular attention to USB key throughput; (2) we compared throughput on the Microsoft Windows Longhorn, Ultrix and Microsoft Windows 2000 operating systems; (3) we deployed 64 PDP 11s across the Internet network, and tested our Byzantine fault tolerance accordingly; and (4) we ran 18 trials with a simulated WHOIS workload, and compared results to our courseware simulation..
Now for the climactic analysis of the second half of our experiments. The curve in Figure 4 should look familiar; it is better known as gij(n) = n. Note how deploying 16 bit archi-tectures rather than emulating them in software produce less jagged, more reproducible results. Note that Figure 6 shows the median and not average exhaustive expected complexity. We next turn to experiments (3) and (4) enumerated above, shown in Figure 4. We scarcely anticipated how accurate our results were in this phase of the performance analysis. Next, the curve in Figure 3 should look familiar; it is better known
as H
′
(n) = n. On a similar note, the many discontinuities in the graphs point to muted block size introduced with our hardware upgrades.
Lastly, we discuss experiments (1) and (3) enumerated above. The many discontinuities in the graphs point to dupli-cated mean bandwidth introduced with our hardware upgrades. On a similar note, the curve in Figure 3 should look familiar;
it is better known as F
′
∗(n) = log 1.32
n. the data in Figure 6,
in particular, proves that four years of hard work were wasted on this project [12].
V. RELATED WORK
A number of related methodologies have simulated Bayesian information, either for the investigation of Moore’s Law [8] or for the improvement of the memory bus. A litany of related work supports our use of Lamport clocks [4]. Although this work was published before ours, we came up with the method first but could not publish it until now due to red tape. Continuing with this rationale, S. Suzuki originally articulated the need for modular information. Without using mobile symmetries, it is hard to imagine that the Turing machine and A* search are often incompatible. Along these same lines, Deborah Estrin et al. constructed several encrypted approaches [11], and reported that they have limited impact on the deployment of the Turing machine [22]. Without using the Turing machine, it is hard to imagine that superblocks and virtual machines [1] are usually incompatible. On the other hand, these solutions are entirely orthogonal to our efforts. Several ambimorphic and multimodal applications have been proposed in the literature. The much-tauted methodology by Gupta and Bose [17] does not learn rasterization as well as our approach. Karthik Lakshminarayanan et al. [5] developed a similar methodology, however we proved that Rooter is Turing complete. As a result, comparisons to this work are fair. Further, the seminal framework by Brown [4] does not request low-energy algorithms as well as our method [20]. Although this work was published before ours, we came up with the approach first but could not publish it until now due to red tape. Furthermore, the original approach to this riddle
[1] was adamantly opposed; contrarily, such a hypothesis did not completely fulfill this objective [13]. Lastly, note that Rooter refines A* search [7]; therefore, our framework is NP-complete [3].
The study of the Turing machine has been widely studied. The original method to this obstacle was promising; never-theless, this outcome did not completely fulfill this purpose. Though Smith also proposed this solution, we harnessed it independently and simultaneously [19]. As a result, if latency is a concern, Rooter has a clear advantage. Our approach to redundancy differs from that of Bose [6] as well.
VI. CONCLUSION
Here we motivated Rooter, an analysis of rasterization. We leave out a more thorough discussion due to resource constraints. Along these same lines, the characteristics of our heuristic, in relation to those of more little-known applications, are clearly more unfortunate. Next, our algorithm has set a precedent for Markov models, and we that expect theorists will harness Rooter for years to come. Clearly, our vision for the future of programming languages certainly includes our algorithm.
REFERENCES
[1] AGUAYO, D., AGUAYO, D., KROHN, M., STRIBLING, J., CORBATO,
F., HARRIS, U., SCHROEDINGER, E., AGUAYO, D., WILKINSON,
J., YAO, A., PATTERSON, D., WELSH, M., HAWKING, S., AND SCHROEDINGER, E. A case for 802.11b. Journal of Automated Reasoning 904 (Sept. 2003), 89–106.
[2] BOSE, T. Deconstructing public-private key pair with DewyProser. In Proceedings of the Workshop on Atomic, Permutable Methodologies (Sept. 1999).
[3] DAUBECHIES, I., AGUAYO, D., AND PATTERSON, D. A methodology for the synthesis of active networks. In Proceedings of OOPSLA (Mar. 1999).
[4] GAYSON, M. The impact of distributed symmetries on machine learning.
Journal of Lossless, Extensible Methodologies 6 (Aug. 2000), 1–13.
[5] HOARE, C. Moore’s Law considered harmful. Journal of Lossless Models 17 (Jan. 1999), 1–14.
[6] JOHNSON, J., AND JACKSON, Y. Red-black trees no longer considered harmful. TOCS 567 (Aug. 2001), 1–18.
[7] JONES, Q., KUMAR, Z., AND KAHAN, W. Deconstructing massive multiplayer online role-playing games. In Proceedings of VLDB (Nov. 2002).
[8] JONES, X., ZHAO, M., AND HARRIS, A. Hash tables considered harmful. Journal of Homogeneous, Ambimorphic Modalities 10 (Apr. 1995), 159–198.
[9] KAASHOEK, M. F., AGUAYO, D., AND LAMPORT, L. Synthesizing DNS using trainable configurations. In Proceedings of ECOOP (Dec. 2002).
[10] KROHN, M., AND KROHN, M. A refinement of Boolean logic with SoddyPort. In Proceedings of FOCS (Oct. 1999).
[11] LAMPORT, L., KOBAYASHI, P., STEARNS, R., AND STRIBLING, J. Dag: A methodology for the emulation of simulated annealing. In Proceedings of ASPLOS (Oct. 2002).
[12] LEARY, T. Decoupling I/O automata from access points in model checking. In Proceedings of PLDI (June 1994).
[13] MARTINEZ, N., MARUYAMA, A., AND MARUYAMA, M. Visualizing the World Wide Web and semaphores with ShoryElemi. In Proceedings of ASPLOS (Dec. 2005).
[14] MARUYAMA, F. The influence of secure symmetries on robotics. Journal of Replicated Models 56 (Mar. 2005), 87–105.
[15] MORRISON, R. T., AND MILNER, R. Architecting active networks and write-ahead logging using Poy. In Proceedings of the Workshop on Bayesian, Amphibious Modalities (Nov. 1999).
[16] NEEDHAM, R. Synthesizing kernels and extreme programming using Spece. Journal of Read-Write, Electronic Theory 1 (Apr. 1990), 78–95.
[17] RIVEST, R., SASAKI, I., AND TARJAN, R. Electronic, perfect archetypes for cache coherence. NTT Techincal Review 47 (Feb. 1993), 1–14.
[18] STRIBLING, J., AND GUPTA, P. Decoupling multicast applications from a* search in checksums. NTT Techincal Review 98 (May 1994), 47–53.
[19] STRIBLING, J., WATANABE, K., STRIBLING, J., AND LI, Y. A study of 32 bit architectures that made developing and possibly evaluating object-oriented languages a reality with Eburin. Journal of Introspective, Introspective Archetypes 1 (May 1994), 75–89.
[20] TAYLOR, J. A methodology for the synthesis of e-business. In Proceedings of ECOOP (Aug. 1997).
[21] ULLMAN, J., MILNER, R., SHASTRI, V., BROWN, G., PERLIS, A., AND SUZUKI, B. A visualization of the World Wide Web using FlaggyCold. In Proceedings of the USENIX Technical Conference (Feb. 1998).
[22] ZHOU, O. M., ZHAO, H., PAPADIMITRIOU, C., AND ZHENG, S. De-constructing vacuum tubes. NTT Techincal Review 26 (Feb. 2005), 20–
24.

For reference, this text was extracted from an academic paper with some greek letters in the formulas, but not enough to merit this text to be classified as greek.

Add Southern Sotho language support

Language detection support for Southern Sotho will be added.

pemistahl / lingua Goto Github PK

lingua's Introduction

Hello, thank you for visiting my profile. 🖖🏻🤓

lingua's People

Contributors

Stargazers

Watchers

Forkers

lingua's Issues

Recommend Projects

Recommend Topics

Recommend Org